Blog

Systemic 2 Cookbook: Parallel computation with R

This post is the first in the “Systemic Cookbook” series, which will show how to use Systemic 2 for scientific purposes (especially its more advanced features not readily accessible from its user interface). More comprehensive documentation for Systemic is forthcoming.

All posts in this series will be collected in the “Systemic 2” category.


Systemic 2 automatically parallelizes so-called “embarrassingly parallelizable” workloads that do not need shared data or synchronization. Such workloads include bootstrap resampling and periodograms (each resampled data set is analyzed independently), Markov Chain Monte Carlo (each chain runs in parallel, and only synchronizes at infrequent intervals), jackknife, and other algorithms. Some algorithms can even be launched on Xgrid-connected clusters (Why, Apple?!).

The majority of embarrassingly parallelizable workloads, however, will be better defined by the user. In the rest of this article, I will show how to easily parallelize dynamical integration over a grid using an R script. (The following problem could be just as easily set up using C, but I will be showing how to accomplish it in R since the code will be most transparent in a high-level language.)

Setting up dynamical integration over a grid

The (somewhat artificial) problem is as follows. Imagine that we would like to investigate the survival time of a two-planet system, as we increase the eccentricity of the outer planet. Intuitively, we expect the system to become unstable faster as the eccentricity increases.

Let’s first set up a sample system and integrate it forward in time with Systemic2.

To run this script, type

Rscript scriptname.r

from the shell (edit the script beforehand to point it to the correct path to Systemic).

This script sets up a two-planet system around a 0.95 solar mass star, integrate it forward 50,000 years using the SWIFT backend and plot it to a PDF file. The “k” object is called a kernel within Systemic; it contains all the data and functions associated with the planetary system (internally, it is a handle returned by the C library).integration_1e5 Next, we create a grid of eccentricity and longitude of periastron angles. The rationale here is to average the survival times for each eccentricity, over the longitude of pericenter in order to remove the dependency on the phase angle (again, this is an artificial exercise!).

This part of the script involve some idiomatic R functions (not too dissimilar from the strategies used by Matlab, Numpy and other vectorized languages). To create a grid of eccentricity/phase angles (a cartesian product), I used the expand.grid function. Subsequently, I loop over the (e, lop) pairs using mapply. Mapply (m-apply) applies a function over each element of the two vectors and collects the return values of the function.

The function called by mapply sets up a kernel with the given eccentricity and longitude angle, and integrates the system forward in time using kintegrate. Kintegrate is the Systemic function that integrates a system forward in time,  returning an object containing information about the dynamical evolution of the system. One of the fields of the object is “survival.time”, how long the system survives (in this case, whether there is a close encounter or an ejection of one of the bodies).  Then, we collect the survival times, group them by eccentricity (using split) and take the median (using sapply).  Finally, we plot the eccentricity vs. the survival time.

Parallelizing the problem

Recent R builds include the parallel package. The parallel package includes various implementations of apply (map) functions that are drop-in replacements for the builtin R functions (e.g. the *apply family of functions). Parallelization is achieved by transparently launching a number of “worker” processes, executing the function on each worker and finally collecting the result (the workers are heavyweight processes rather than threads, so you will see several R processes if you inspect your running programs).

Screen Shot 2014-01-16 at 8.36.07 PM

Here is the revised script:

Note that the only change was using the “mcmapply” (multicore-mapply) function instead of the “mapply” builtin! Since this problem is eminently parallelizable, the speedup is linear in the number of physical cores.

Closing words

In this case, parallelization was trivial; in other cases, it might be more involved (for instance, when balancing is required).

You gotta get a gimmick: BAM

BAM is a new, extremely lightweight templating system that converts files containing a mixture of code, metadata and text and produces a plain text file. We use a fully-featured version of BAM as an “automated paper writer” in my research group at UC Santa Cruz. A short guide to BAM is available here.


BAM was born out of my conversations with Greg Laughlin on how to automate the somewhat repetitive task of writing certain types of scientific papers and reports. Since we already had a planet discovery tool at our disposal, we decided to focus on planet discovery papers. In its full version, BAM is deeply integrated with the Systemic Console to produce first drafts of planet discovery papers from a reusable template, with the aim of automating as much as possible the writing of the abstract, introduction and quantitative analysis. The Console-integrated BAM has some cool NLG features that we used to write Meschiari et al., 2011 (discovery of HD31253b, HD218566b, HD177830c, HD99492c). Since these planets are somewhat unremarkable, this paper lent itself well to automated writing of the majority of the text. Ideally, BAM templates evolve to contain enough logic to be able to “comment” on the data it processes; for instance, it could write a few remarks about the distribution of the newly discovered planets in period-eccentricity and period-mass plots, such as the one to the right (where the dots are automatically downloaded from the Extrasolar planet encyclopaedia and placed on the plot by BAM, as well).

Following the spirit of literate programming, the paper template contained the procedure for the data analysis, fitting, error estimation and all the plots intermingled with the LaTeX layout and fragments of text. I presented this tool at the European Science Foundation conference (presentation in Keynote format), including a somewhat humorous video of the Console discovering, analyzing and publishing four planetary systems in real time based on the directions of the paper template (see video at Greg’s website). It’s become one of my favorite gimmicky (but useful) tools to show off to people.

While we keep the Console-integrated BAM version private for use by our team, I am making a rewritten and simplified version, available to download now. You should consider this a preliminary 0.1 version and expect bugs and limitations to crop up; features will be added back in future revisions.

Data visualization in Cocoa

Probably one of the best advantages of an interactive GUI for scientific programs is the ability to visualize data in real time. While you could as well output a full dump of your data and plot with an external program, there is great pleasure in exploring how your model reacts to mucking and futzing around with the parameter space.

Image and video hosting by TinyPic

In the Systemic Console program (above), I wrote a flexible Swing component for plotting. As you can see, it takes care of plotting scatter (with optional error bars) and line plots, histograms and 3D line plots. Plot styles are easy to customize and are close to most journal styles by default. The component takes care of printing, zooming, tracking the mouse pointer and has a customization panel. Writing this component myself and optimizing it for quick redrawing took quite a bit of time. The graphical performance of Swing under OS X was often under par, especially when trying to get antialiasing right. There is no built-in way to generate PDFs (I ended up using the excellent iText for that). And finally, as usual a lot of the tedium derived from Java’s verbosity. What’s available for Cocoa?

Plotting with Cocoa

I’m still quite a way to be worrying about the UI of my new application; currently, I am mostly thinking about the overall framework, how much of it to write in C and how much to wrap in shiny Objective-C classes. However, I’ve spent a bit of time researching the current landscape of data visualization. I will survey some of the options below, along with some IMO pros/cons.

1. SM2DGraphView

Website, Documentation, Open source
Image and video hosting by TinyPic
An open-source framework used by a few Mac applications. It can show line and scatter plots, and pie charts. It is very well documented and easy to use (see e.g. this tutorial at MacResearch).

  • PROS: Easy to use, well documented, open source
  • CONS: Plots look outdated, difficult to get output in line with journal standards, low quality output compared to other options

2. CorePlot

Website, BSD license
Image and video hosting by TinyPic
I have only had time to give a cursory look to this framework, but from what I have seen from glancing at the examples and the documentation, it seems like a very complete and decently documented framework. The project is frequently updated, and there seems to be a good amount of tutorial and forum posts on Stack Overflow about it. This is mainly due to its compatibility with iOS.

  • PROS: Complete documentation (including a .docset for Xcode), plots look good on screen
  • CONS: None, but it seems more complicated than other options.

3. DataGraph framework

Website, Free for open source projects, $400 otherwise (including in-house projects)
Image and video hosting by TinyPic

This is basically the full plotting component powering the excellent DataGraph. The framework is available for free for open source projects.

The philosophy is very different from the other components listed above. A plot is defined beforehand by creating a template in DataGraph, rather than programmatically in the code; this is similar to how Aqua UIs are usually serialized in nib files designed with Interface Builder rather than created programmatically. This makes setting up a static plot almost trivial. You can also connect sliders, color wells and other interface elements with plot and formula parameters extremely easily. Zooming, panning and exporting are taken care by the framework. Finally, the output quality is the same as DataGraph, and therefore excellent.

  • PROS: Quality is awesome, free for open source projects, trivial to get up and running with a pre-made DataGraph template
  • CONS: As far as I can tell, you can’t add new plots programmatically, so you might have to allow for several plots in the template and dynamically show/hide them. Documentation is very scarce.

4. A custom NSView

This might be easier than it looks, and is basically the path I took for the Systemic Console to get exactly the look and functionality I needed. It has the advantage of implementing exactly what you need without unwanted baggage.

An object inheriting from NSView will do its drawing in the drawRect: method. The actual drawing can be performed using NSBezierPaths to draw lines, ovals, rects and rounded rects with quality antialiasing. Finally, PDF creation is handled by Cocoa automatically (see, e.g., this).

  • PROS: Completely customizable presentation, PDF export for free, probably the best solution if only handling a few uncomplicated plot types.
  • CONS: Writing everything yourself, meaning time-consuming to get right. NSBezierPath not available on iOS.

Conclusions

This is just a quick and undoubtedly incomplete survey of options for plotting with a Cocoa view. For my project, I will probably go with #3 or #4, with Core Plot a close third and the best option for iOS development. The DataGraph framework is truly impressive for output quality and breadth of options, and trivially easy to set up for static plots.

Please let me know if there’s any library I missed in this post. You can leave a comment below or send me a message with the “Ask anything” link at the top of the blog.