There has been a lot of talk going around involving the Microsoft Worldwide Partner Conference 2009, but another event, largely overshadowed, was the 10th annual Microsoft Research Faculty Summit. During this summit, Tony Hey, the Microsoft External Research Vice President, announced the release of two tools to help transform research in the academic world, Project Trident and Dryad with DryadLINQ, which you can watch here. These tools are available freely to academic researchers and scientists and can be downloaded here. So, what are they and why should we care?
Roger Barga, a principal architect at Microsoft Research heads the Project Trident: A Scientific Workbench that aims to make complex data easily managed and visualized on a large scale. The team worked with researchers at the Monterey Bay Aquarium Research Institute, the University of Washington among others to develop this workbench which is written in C# using Windows Workflow Foundation and Windows Presentation Foundation.
Although it’s background is oceanographic in nature, to collect data from sensors in the Pacific Ocean and analyze and visualize the data, the capability of this workbench is much more general purpose than that. I had the pleasure of visiting with Roger and his team last month when they were in Washington, DC for the Microsoft Research Roadshow at the Newseum and saw the capabilities firsthand. You could imagine any number of data sets and domains in which this might be used to give scientists and other domain experts the ability to model data and visualize data, without having to know the very guts of .NET. What’s more is that you can utilize the other half of this release, Dryad, on an HPC cluster to help you crunch that data to fit your needs.
You can download Project Trident today and give the team feedback on Microsoft Connect.
Dryad and DryadLINQ
So, what is Dryad and DryadLINQ? It’s a project, which has been incubating at Microsoft Research Silicon Valley for the past four years with the goal of making distributed data-parallel computing accessible to all developers through a common programmatic model that many in the .NET community already understand, LINQ. Simply put, developers can write LINQ queries as if they were local expressions, and then Dryad will then distribute the work accordingly across the cluster which could contain thousands of machines. For some examples of what the code might look like, refer to the “Samples Applications Written in DryadLINQ” paper. A quick example would be building a Histogram in C#:
public static IQueryable<Pair> BuildHistogram(
string uri = DataPath.FileUriPrefix + Path.Combine(directory, fileName);
PartitionedTable<LineRecord> inputTable = PartitionedTable.Get<LineRecord>(uri);
IQueryable<string> words = inputTable.SelectMany(x => x.line.Split(' '));
IQueryable<IGrouping<string, string>> groups = words.GroupBy(x => x);
IQueryable<Pair> counts = groups.Select(x => new Pair(x.Key, x.Count()));
IQueryable<Pair> ordered = counts.OrderByDescending(x => x.Count);
IQueryable<Pair> top = ordered.Take(k);
You may have heard about the Dryad project in the past when Michael Isard, a Dryad Technical lead, was interviewed by Carl and Richard on .NET Rocks show 378. In this interview, Michael explained the basics of Dryad, a general-purpose execution engine for coarse-grain data-parallel applications built upon the Windows High Performance Computing (HPC) Server, and DryadLINQ, a set of language extensions which enable a programming model for large scale distributed computing using LINQ. At the time when the episode was recorded, when quizzed, Michael stated at that juncture, there had been no plans to productize the venture. Fast forward to this week, it was announced that it has been made available for scientific research. Will it remain as such, I certainly doubt that.
How does the technology differ from Google MapReduce and Hadoop? In the introduction to DryadLINQ, the case is laid out:
The MapReduce system adopted a radically simplified programming abstraction, however even commonoperations like database Join are tricky to implement in this model. Moreover, it is necessary to embed MapReduce computations in a scripting language in order to execute programs that require more than one reduction or sorting stage. Each MapReduce instantiation is self contained and no automatic optimizations take place across their boundaries. In addition, the lack of any type system support or integration between the MapReduce stages requires programmers to explicitly keep track of objects passed between these stages, and may complicate long-term maintenance and re-use of software components.
Several domain-specific languages have appeared on top of the MapReduce abstraction to hide some of this complexity from the programmer, including Sawzall, Pig, and other unpublished systems such as Facebook’s HIVE. These offer a limited hybridization of declarative and imperative programs and generalize SQL’s stored-procedure model. Some whole query optimizations are automatically applied by these systems across MapReduce computation boundaries. However, these approaches inherit many of SQL’s disadvantages, adopting simple custom type systems and providing limited support for iterative computations. Their support for optimizations is less advanced than that in DryadLINQ, partly because the underlying MapReduce execution platform is much less flexible than Dryad.
Erik Meijer had the chance to sit down with Roger Barga this week to talk about this subject as well as Dryad and DryadLINQ technology in their Expert to Expert talk, which is well worth a watch. So, if you’re in the field of research, download Dryad and DryadLINQ today and give the team feedback on Microsoft Connect, and play around with the examples in C# and VB.
These collaboration tools released by Microsoft Research have a potential to transform the way that some scientific research is done. Moreover, this helps answer the age old question of “What is Microsoft’s answer to MapReduce and Hadoop?” But, even more interesting is that future releases of this software are promised to be released on CodePlex in the near future, which to me is one of the best parts of all.