Exceptions and Fault tolerance
This tutorial is from the MBrace Starter Kit.
In this tutorial we will be offering an overview of the MBrace exception handling features as well as its fault tolerance mechanism.
Exception handling
Just like async, mbrace workflows support exception handling:
1:
|
|
Sending the above computation to your cluster will have the expected behaviour, any user exception will be caught and rethrown on the client side:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
|
This has interesting ramifications when our cloud computation spans multiple machines:
1: 2: 3: 4: 5: |
|
In the example above, we perform a calculation in parallel across the cluster in which one of the child work items are going to fail with a user exception. In the event of such an uncaught error, the exception will bubble up to the parent computation, actively cancelling any of the outstanding sibling computations:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
|
While the stacktrace offers a precise indication of what went wrong, it may be a bit ambiguous on how and where it went wrong. Let's see how we can use MBrace to improve this in our example:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: |
|
Which yields the following stacktrace:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: |
|
It is also possible to catch exceptions raised by distributed workflows:
1: 2: 3: 4: 5: 6: 7: 8: |
|
this will suppress any exceptions raised by the Parallel
workflow and return a proper value to the client.
Computing partial results
It is often the case that this default behaviour (i.e. bubbling up and cancellation) may be undesirable, particularly when we are running an expensive distributed computation. Often, aggregating partial results is the prefered way to go. Let's see how we can encode this behaviour using MBrace:
1: 2: 3: 4: 5: |
|
Which when executed will result in the following value:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: |
|
Fault tolerance
It is important at this point to make a distinction between user exceptions, i.e. runtime errors generated by user code and faults, errors that happen because of problems in an MBrace runtime. Faults can happen for a multitude of reasons:
- Bugs in the runtime implementation.
- Sudden death of a worker node: VMs of a cloud service can often be reset by the administrator without warning.
- User errors that can cause the worker process to crash, such as stack overflows.
Let's have a closer look at an example of a fault, so that we gain a better understanding
of how they work. First, we define a cloud function that forces the death of a worker
by calling Environment.Exit
on the process that it is being executed:
1:
|
|
Let's try to run this on our cluster:
1:
|
|
Sure enough, after a while we will be receiving the following exception:
1: 2: 3: 4: 5: |
|
Note that this computation has killed one of our worker instances.
If working with MBrace on Azure, the service fabric will ensure that the dead instance will be reset.
If working with MBrace on Thespian, you will have to manually replenish your cluster instance by calling
the cluster.AttachNewLocalWorkers()
method.
If we now call
1:
|
|
we can indeed verify that the last computation has faulted:
1: 2: 3: 4: 5: 6: 7: |
|
MBrace will respond to faults in our cloud process by raising a FaultException
.
What differentiates fault exceptions from normal user exceptions is that they often
cannot be caught by exception handling logic in the user level. For instance:
1:
|
|
Will not have any effect in the outcome of the computation. This happens because the exception handling clause is actually part of the work item which was to be executed by the worker that was killed. Compare this against
1: 2: 3: 4: 5: 6: |
|
which works as expected since the exception handling logic happens on the parent work item.
Working with fault policies
In stark contrast to our pathological die()
example, most real faults
actually happen because of transient errors in cluster deployments.
It is often the case that we want our computation not to stop because
of such minor faults. This is where fault policies come into play.
When sending a computation to the cloud, a fault policy can be specified. This indicates whether, and for how long a specific faulting computation should be retried:
1: 2: 3: 4: 5: 6: 7: 8: |
|
the computation as defined above introduces a significant probability of faulting at some point of its execution. We can compensate by applying the following a more flexible retry policy:
1:
|
|
This will cause any faulting part of the computation to yield a FaultException only after it has faulted more than 5 times.
Fault polices can also be scoped in our cloud code:
1: 2: 3: 4: 5: 6: 7: 8: |
|
The example above uses the inherited fault policy for the first parallel computation,
whereas the second parallel computation uses a custom fault policy, namely InfiniteRetries
.
Faults & Partial results
The question that now arises is how would it be possible to recover partial results of a distributed computation in the presence of faults? This can be achieved through the use of runtime introspection primitives:
1: 2: 3: 4: 5: 6: |
|
The Cloud.IsPreviouslyFaulted
primitive gives true if the current work item
is part of a computation that has previously faulted and currently being retried.
We can use this knowledge to dynamically alter its mode of execution.
Let's now have a look at a more useful example. Let's define a parallel combinator that returns partial results even in the presence of faults. First, let's define a result type:
1: 2: 3: 4: |
|
we now define our protectedParallel
combinator:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: |
|
the example makes use of the Cloud.TryGetFaultData()
primitive to determine
whether the current computation is a retry of a previously faulted operation.
Let's now test our workflow:
1: 2: 3: 4: 5: 6: 7: 8: |
|
Summary
In this tutorial, you've learned how to reason about exceptions and faults in MBrace. Continue with further samples to learn more about the MBrace programming model.
Note, you can use the above techniques from both scripts and compiled projects. To see the components referenced by this script, see ThespianCluster.fsx or AzureCluster.fsx.
from MBrace.Core
Full name: 10-exceptions-and-fault-tolerance.cluster
Full name: Config.GetCluster
Gets or creates a new Thespian cluster session.
Full name: Microsoft.FSharp.Core.Operators.failwith
member Clone : unit -> obj
member CopyTo : array:Array * index:int -> unit + 1 overload
member GetEnumerator : unit -> IEnumerator
member GetLength : dimension:int -> int
member GetLongLength : dimension:int -> int64
member GetLowerBound : dimension:int -> int
member GetUpperBound : dimension:int -> int
member GetValue : [<ParamArray>] indices:int[] -> obj + 7 overloads
member Initialize : unit -> unit
member IsFixedSize : bool
...
Full name: System.Array
Full name: Microsoft.FSharp.Collections.Array.sum
Full name: 10-exceptions-and-fault-tolerance.WorkerException
val int : value:'T -> int (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.int
--------------------
type int = int32
Full name: Microsoft.FSharp.Core.int
--------------------
type int<'Measure> = int
Full name: Microsoft.FSharp.Core.int<_>
Full name: Microsoft.FSharp.Core.exn
Full name: 10-exceptions-and-fault-tolerance.WorkerException.Message
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.sprintf
Full name: Microsoft.FSharp.Core.Operators.raise
type DivideByZeroException =
inherit ArithmeticException
new : unit -> DivideByZeroException + 2 overloads
Full name: System.DivideByZeroException
--------------------
DivideByZeroException() : unit
DivideByZeroException(message: string) : unit
DivideByZeroException(message: string, innerException: exn) : unit
Full name: 10-exceptions-and-fault-tolerance.die
static member CommandLine : string
static member CurrentDirectory : string with get, set
static member Exit : exitCode:int -> unit
static member ExitCode : int with get, set
static member ExpandEnvironmentVariables : name:string -> string
static member FailFast : message:string -> unit + 1 overload
static member GetCommandLineArgs : unit -> string[]
static member GetEnvironmentVariable : variable:string -> string + 1 overload
static member GetEnvironmentVariables : unit -> IDictionary + 1 overload
static member GetFolderPath : folder:SpecialFolder -> string + 1 overload
...
nested type SpecialFolder
nested type SpecialFolderOption
Full name: System.Environment
Full name: 10-exceptions-and-fault-tolerance.diePb
die with a probability of 1 / N
type Random =
new : unit -> Random + 1 overload
member Next : unit -> int + 2 overloads
member NextBytes : buffer:byte[] -> unit
member NextDouble : unit -> float
Full name: System.Random
--------------------
Random() : unit
Random(Seed: int) : unit
Full name: 10-exceptions-and-fault-tolerance.test
Full name: 10-exceptions-and-fault-tolerance.run
Full name: Microsoft.FSharp.Core.Operators.not
| Success of 'T
| Exception of exn
| Fault of exn
Full name: 10-exceptions-and-fault-tolerance.Result<_>
union case Result.Exception: exn -> Result<'T>
--------------------
type Exception =
new : unit -> Exception + 2 overloads
member Data : IDictionary
member GetBaseException : unit -> Exception
member GetObjectData : info:SerializationInfo * context:StreamingContext -> unit
member GetType : unit -> Type
member HelpLink : string with get, set
member InnerException : Exception
member Message : string
member Source : string with get, set
member StackTrace : string
...
Full name: System.Exception
--------------------
Exception() : unit
Exception(message: string) : unit
Exception(message: string, innerException: exn) : unit
Full name: 10-exceptions-and-fault-tolerance.protectedParallel
val seq : sequence:seq<'T> -> seq<'T>
Full name: Microsoft.FSharp.Core.Operators.seq
--------------------
type seq<'T> = Collections.Generic.IEnumerable<'T>
Full name: Microsoft.FSharp.Collections.seq<_>
from Microsoft.FSharp.Collections
Full name: Microsoft.FSharp.Collections.Seq.map
Full name: 10-exceptions-and-fault-tolerance.test2
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.failwithf