Example: Cloud-distributed WordCount
This example is from the MBrace Starter Kit.
This example implements the classic word count example commonly associated with distributed Map/Reduce frameworks. We use CloudFlow for the implementation and textfiles.com as our data source.
First, some basic type definitions:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: |
Now, define the words to ignore in the word count:
1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: |
We are now ready to define our distributed workflows. First, we create a distributed download workflow that caches the contents of supplied urls across the cluster. This returns a PersistedCloudFlow type that can be readily used for consumption by future flow queries.
1: 2: 3: 4: |
The wordcount function can now be defined:
1: 2: 3: 4: 5: 6: 7: 8: 9: |
Test the wordcount sample using textfiles.com
Step 1. Determine URIs to data inputs from textfiles.com
1: 2: |
Step 2. Download URIs to across cluster and load in memory
1: 2: 3: 4: 5: 6: 7: 8: |
Step 3. Perform wordcount on downloaded data
1: 2: 3: |
Check progress:
1: 2: |
Wait for the results:
In this tutorial, you've learned how to perform a scalable textual analysis task using MBrace. Continue with further samples to learn more about the MBrace programming model.
Note, you can use the above techniques from both scripts and compiled projects. To see the components referenced by this script, see ThespianCluster.fsx or AzureCluster.fsx.
