Creating and Using Cloud Files
This tutorial is from the MBrace Starter Kit.
MBrace clusters have a cloud file system mapped to the corresponding cloud fabric. This can be used like a distributed file system such as HDFS.
Accessing the Cloud File System from F# scripts
First let's define and use some Unix-like file functions to access the cloud file system from your F# client script. (Using these is optional: you can also use the MBrace API directly).
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: |
|
You now use these functions to create directories and files:
1: 2: 3: 4: 5: 6: |
|
Now check you've created the files correctly:
1: 2: |
|
Now remove the directory of data:
1:
|
|
Progammatic upload of data as part of cloud workflows
The Unix-like abbreviations from the previous section are for use from your client scripts. You can also use the MBrace cloud file API directly from cloud workflows.
First, create a local temp file.
1: 2: 3: 4: 5: 6: 7: 8: 9: |
|
Next, you upload the created file to the tmp container in cloud storage. The tmp container will be created if it does not exist.
1:
|
|
After uploading the file, you remove the local file.
1:
|
|
Now process the file in the MBrace cluster. This cloud expression runs in the MBrace cluster.
1: 2: 3: 4: 5: 6: 7: |
|
Using multiple cloud files as input to distributed cloud flows
Processing one small file in the cloud is not of much use. However multiple, large cloud files can be used as inputs to distributed cloud flows in a similar way to map-reduce jobs in Hadoop.
Next you generate a collection of 100 cloud files and process them using a distributed cloud flow.
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: |
|
A collection of cloud files can be used as input to a cloud parallel data flow, summing the third column of each line of each file in a distributed way.
1: 2: 3: 4: 5: 6: |
|
Cleanup the cloud data
1:
|
|
Summary
In this tutorial, you've learned how to use cloud files, from some simple Unix-like operations to using multiple cloud files as partitioned inputs into CloudFlow programming. Continue with further samples to learn more about the MBrace programming model.
Note, you can use the above techniques from both scripts and compiled projects. To see the components referenced by this script, see ThespianCluster.fsx or AzureCluster.fsx.
Full name: 7-using-cloud-data-files.cluster
Full name: Config.GetCluster
Gets or creates a new Thespian cluster session.
Full name: 7-using-cloud-data-files.fs
Full name: 7-using-cloud-data-files.root
Full name: 7-using-cloud-data-files.ls
Full name: 7-using-cloud-data-files.lsRec
val seq : sequence:seq<'T> -> seq<'T>
Full name: Microsoft.FSharp.Core.Operators.seq
--------------------
type seq<'T> = Collections.Generic.IEnumerable<'T>
Full name: Microsoft.FSharp.Collections.seq<_>
Full name: 7-using-cloud-data-files.mkdir
Full name: 7-using-cloud-data-files.rmdir
Full name: 7-using-cloud-data-files.rmdirRec
Full name: 7-using-cloud-data-files.randdir
Full name: 7-using-cloud-data-files.randfile
Full name: 7-using-cloud-data-files.rm
Full name: 7-using-cloud-data-files.cat
Full name: 7-using-cloud-data-files.catLines
Full name: 7-using-cloud-data-files.catBytes
Full name: 7-using-cloud-data-files.write
Full name: 7-using-cloud-data-files.writeLines
Full name: 7-using-cloud-data-files.writeBytes
Full name: 7-using-cloud-data-files.localTmpFile
static val DirectorySeparatorChar : char
static val AltDirectorySeparatorChar : char
static val VolumeSeparatorChar : char
static val InvalidPathChars : char[]
static val PathSeparator : char
static member ChangeExtension : path:string * extension:string -> string
static member Combine : [<ParamArray>] paths:string[] -> string + 3 overloads
static member GetDirectoryName : path:string -> string
static member GetExtension : path:string -> string
static member GetFileName : path:string -> string
...
Full name: System.IO.Path
type DateTime =
struct
new : ticks:int64 -> DateTime + 10 overloads
member Add : value:TimeSpan -> DateTime
member AddDays : value:float -> DateTime
member AddHours : value:float -> DateTime
member AddMilliseconds : value:float -> DateTime
member AddMinutes : value:float -> DateTime
member AddMonths : months:int -> DateTime
member AddSeconds : value:float -> DateTime
member AddTicks : value:int64 -> DateTime
member AddYears : value:int -> DateTime
...
end
Full name: System.DateTime
--------------------
DateTime()
(+0 other overloads)
DateTime(ticks: int64) : unit
(+0 other overloads)
DateTime(ticks: int64, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, kind: DateTimeKind) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, calendar: Globalization.Calendar) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int) : unit
(+0 other overloads)
DateTime(year: int, month: int, day: int, hour: int, minute: int, second: int, millisecond: int, kind: DateTimeKind) : unit
(+0 other overloads)
val float : value:'T -> float (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.float
--------------------
type float = Double
Full name: Microsoft.FSharp.Core.float
--------------------
type float<'Measure> = float
Full name: Microsoft.FSharp.Core.float<_>
Full name: Microsoft.FSharp.Core.ExtraTopLevelOperators.sprintf
DateTime.ToString(provider: IFormatProvider) : string
DateTime.ToString(format: string) : string
DateTime.ToString(format: string, provider: IFormatProvider) : string
static member AppendAllLines : path:string * contents:IEnumerable<string> -> unit + 1 overload
static member AppendAllText : path:string * contents:string -> unit + 1 overload
static member AppendText : path:string -> StreamWriter
static member Copy : sourceFileName:string * destFileName:string -> unit + 1 overload
static member Create : path:string -> FileStream + 3 overloads
static member CreateText : path:string -> StreamWriter
static member Decrypt : path:string -> unit
static member Delete : path:string -> unit
static member Encrypt : path:string -> unit
static member Exists : path:string -> bool
...
Full name: System.IO.File
File.WriteAllLines(path: string, contents: string []) : unit
File.WriteAllLines(path: string, contents: Collections.Generic.IEnumerable<string>, encoding: Text.Encoding) : unit
File.WriteAllLines(path: string, contents: string [], encoding: Text.Encoding) : unit
Full name: 7-using-cloud-data-files.cloudFile
member MBrace.Core.Internals.CloudFileClient.Upload : sourcePath:string * targetPath:string * ?overwrite:bool * ?compress:bool -> MBrace.Core.CloudFileInfo
Full name: 7-using-cloud-data-files.lines
from Microsoft.FSharp.Collections
Full name: Microsoft.FSharp.Collections.Seq.distinct
Full name: Microsoft.FSharp.Collections.Seq.toList
Full name: 7-using-cloud-data-files.dataDir
Full name: 7-using-cloud-data-files.cloudFiles
val string : value:'T -> string
Full name: Microsoft.FSharp.Core.Operators.string
--------------------
type string = String
Full name: Microsoft.FSharp.Core.string
Full name: 7-using-cloud-data-files.sumOfLengthsOfLines
module CloudFlow
from MBrace.Flow
--------------------
module CloudFlow
from Utils
--------------------
type CloudFlow =
static member OfArray : source:'T [] -> CloudFlow<'T>
static member OfCloudArrays : cloudArrays:seq<#CloudArray<'T>> -> LocalCloud<PersistedCloudFlow<'T>>
static member OfCloudCollection : collection:ICloudCollection<'T> * ?sizeThresholdPerWorker:(unit -> int64) -> CloudFlow<'T>
static member OfCloudDirectory : dirPath:string * serializer:ISerializer * ?sizeThresholdPerCore:int64 -> CloudFlow<'T>
static member OfCloudDirectory : dirPath:string * ?deserializer:(Stream -> seq<'T>) * ?sizeThresholdPerCore:int64 -> CloudFlow<'T>
static member OfCloudDirectory : dirPath:string * deserializer:(TextReader -> seq<'T>) * ?encoding:Encoding * ?sizeThresholdPerCore:int64 -> CloudFlow<'T>
static member OfCloudDirectoryByLine : dirPath:string * ?encoding:Encoding * ?sizeThresholdPerCore:int64 -> CloudFlow<string>
static member OfCloudFileByLine : path:string * ?encoding:Encoding -> CloudFlow<string>
static member OfCloudFileByLine : paths:seq<string> * ?encoding:Encoding * ?sizeThresholdPerCore:int64 -> CloudFlow<string>
static member OfCloudFiles : paths:seq<string> * serializer:ISerializer * ?sizeThresholdPerCore:int64 -> CloudFlow<'T>
...
Full name: MBrace.Flow.CloudFlow
--------------------
type CloudFlow<'T> =
interface
abstract member WithEvaluators : collectorFactory:LocalCloud<Collector<'T,'S>> -> projection:('S -> LocalCloud<'R>) -> combiner:('R [] -> LocalCloud<'R>) -> Cloud<'R>
abstract member DegreeOfParallelism : int option
end
Full name: MBrace.Flow.CloudFlow<_>
static member CloudFlow.OfCloudFileByLine : paths:seq<string> * ?encoding:Text.Encoding * ?sizeThresholdPerCore:int64 -> CloudFlow<string>
Full name: MBrace.Flow.CloudFlow.map
val int : value:'T -> int (requires member op_Explicit)
Full name: Microsoft.FSharp.Core.Operators.int
--------------------
type int = int32
Full name: Microsoft.FSharp.Core.int
--------------------
type int<'Measure> = int
Full name: Microsoft.FSharp.Core.int<_>
Full name: MBrace.Flow.CloudFlow.sum