Data Science at the Command Line webcast

Yesterday, I attended a very handy webcast by Jeroen Janssens called Data Science at the Command Line (a book is on its way). While I do most of my data manipulation from R, it is undeniably convenient to be able to run some simple tasks interactively from the command line, or as part of a shell script or Makefile.

The presentation touched on several command tools that I either wasn’t aware of, or had forgotten about.  If you missed the webcast, this website has a helpful list of commands — of both the well-known and the obscure variety. Below are a few that I had not known about and might be useful to others.

parallel

parallel is a shell command to execute a series of commands in batch, over several CPUs on a local machine or over several computers (using a combination of ssh and rsync to connect and transfer files). I used to use a combination of Xgrid and some homegrown shell scripts to achieve that, but Xgrid is unfortunately no more (RIP!). parallel seems like a way  to quickly get a small subset of Xgrid’s functionality.

cowsay

cowsay is like echo, but with cute ASCII art animals.

~$ cowsay "I'd rather get error messages from a cute animal."
 ______________________________________ 
/ I'd rather get error messages from a \
\ cute animal.                         /
 -------------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Very useful to add some levity to your error messages! Use cowsay -l to get a list of animal templates.

I used the Node.js cowsay package to create this little Node.js app (for fun, and to learn a minimal amount of Express.js). Install via Homebrew or MacPorts.

asciinema

I saw Jeroen use this tool during the webcast to record his terminal session. Asciinema looks very very cool and potentially helpful to create terminal-based tutorials. Install via pip.

csvkit & jq

csvkit is a suite of command line tools to deal with CSV files, but works quite well for tab-separated data as well (which I deal with often). Particularly useful so far: csvlook (nicely formatted table in the terminal), csvstat (column statistics) and csvsql (SQL queries on CSV files). jq can be used to manipulate JSON data, potentially piped into csvkit to create simple text tables.

An interactive Barnes-Hut tree

[TL,DR: if you’d like to play with a simple Barnes-Hut octree code, scroll down to the little embedded app.]

Gah! It’s been quite a while since my last post. Despite my best intentions, work (and a lot of feedback from Super Planet Crash!) has taken precedence over blogging. I do have a sizable list of interesting topics that I’ve been meaning to write about, however, so over the next few weeks I’ll try to keep to a more steady posting clip.

Super Planet Crash has been a resounding success. I have been absolutely, positively astounded with the great feedback I received. My colleagues and I have been coming up with lots of ideas for improving the educational value of SPC, add new, interesting physics, and addressing some of the complaints. In order to have the ability to dedicate more time to it, over the past few months, we’ve been furiously applying for educational and scientific grants to fund development. Hopefully something will work out — my goal is to make it into a complete suite of edu-tainment applications.

When Giants Collide

I’ve recently started experimenting with a new  visualization that I think will turn out pretty darn cool. Its draft name is When Giants Collide. When  Giants Collide will address a common request from planetary crashers: “Can I see what happens when two giant planets collide”?

A sketch of the interface.

When Giants Collide will be a super-simple JavaScript app (so it will run in your browser) that will simulate the collision of two massive spheres of gas. The simulation will have to model both gravity and the dynamics of the gas: to address this, I’ve been dusting off and reviewing an old Smoothed-Particle Hydrodynamics (SPH) code I worked on for a brief period in graduate school. SPH is a very simple technique for cheaply simulating gas flows with good spatial accuracy, and is somewhat straightforward to code. There are some shortcuts that have to be taken, too — large time steps, low particle counts, and more (e.g., a polytropic equation of state for the gas giants; more on this in future posts). These shortcuts come at the expense of realism, but will enable fast, smooth animation in the browser.

Gravity with the  Barnes-Hut algorithm

Gravity is an essential ingredient of When Giants Collide! Even with very low particle counts (say, N = 1000), a brute force calculation that just sums up the mutual gravitational force between particles won’t do if you want to run the simulation at 60 frames per second. Direct summing is an N^2 operation:

(this is a simple force accumulator written in R).

A better way that involves only a slightly more complicated algorithm is to use the Barnes-Hut algorithm (a short Nature paper with more than 1,000 citations!). The algorithm involves recursively subdividing space into cubes and loading them with particles, such that every cube contains either 0 or 1 particles. This is represented in code with an oct-tree structure.  Once such a tree is constructed, one can calculate the gravitational force on a given particle in the brute-force way for close particles, and in an approximate way for distant particles; whether to use one or the other is determined by walking the tree down from the top. An excellent explanation (with great visuals!) is provided in this article.

The other advantage is that, once the tree has been already built for the gravity calculation, it can be used to identify the nearest neighbors of a given particle through the same tree-walking procedure. The nearest neighbors are needed for the hydrodynamical part of the SPH algorithm (see, e.g., this review article by Stefan Rosswog or this one by Daniel Price).

An interactive tree

Below is an interactive JavaScript applet that subdivides space with the Barnes-Hut algorithm. You can add new points by clicking on the surface, or using the buttons to add new, random ones.

The code for building the Barnes-Hut tree from an array of 3D positions is available at the GitHub repository for When Giants Collide. I will be developing the code in the open, and post periodically about my progress. Hopefully by the end of summer I will have an attractive app running on any modern device and web browser. Any ideas on how to gamify it?

Setting up a nice AucTeX environment on Mac OS X

Most people I know use TeXShop on Mac OS X. While it’s a pretty good TeX editor, I think Emacs is overall vastly superior. Of course, I’m rather biased since I already use Emacs for everything else… Perhaps this post will be useful to other Emacs-addicted astronomers.

In my setup, I use the AUCTeX package coupled with the Skim PDF viewer (if you’re not using Skim, download it, it’s brilliant!). One of the advantages of this combination is that Emacs and Skim can be kept in sync, like in the screenshot below.

Skim + Emacs/AUCTeX nirvana. Note that the current highlighted line in Skim corresponds to the cursor position in Emacs.
Skim + Emacs/AUCTeX nirvana. Note that the current highlighted line in Skim corresponds to the cursor position in Emacs.

I found it a bit difficult to set up the AUCTeX package with sensible defaults, so I’ll reproduce here my configuration in hopes that it will be useful to someone else.

The salient lines are the ones configuring latexmk and Skim. You should have latexmk installed if you are using the TeX Live distribution; Skim can be downloaded for free here. You can stick this script in your Emacs initialization file (see my dotemacs repository if you’d like to see my other Emacs configs). I shamelessly copied those lines from this Stack Overflow answer.

Two LaTeX gems: ShareLaTeX and latexdiff

Here are two really cool LaTeX tools every astronomer should enjoy.

ShareLaTeX is an online LaTeX writing tool. It’s great for collaboratively writing LaTeX documents of any size, and a life-saver when you don’t have access to your own laptop with a TeX installation on it — just grab a web browser, navigate to ShareLaTeX and write away, then grab the PDF product. (You can also chat with collaborators, browse revisions, and a bunch of other useful niceties.)

A sample ShareLaTeX project.
A sample ShareLaTeX project.

The folks behind ShareLaTeX generously announced today that they made their product open-source. Here is the GitHub page with their source code. It appears to be extremely easy to run your own local installation, if you so desire.

While working on a grant application (in the old, inefficient fashion: on a Dropbox shared folder) I wished there was some way to send “diffs” of my changes to the PDF to my collaborators, in order to save them the time to hunt for the changed word or sentence. Emery Berger’s blog directed me to the latexdiff tool, which I had somehow never heard about! It’s quite easy to install (if you use MacPorts, it’s a simple sudo port install latexdiff), and the resulting PDF diffs are nice and clear.

A sample latexdiff output.
A sample latexdiff output.

 

AstroTRENDS: A new tool to track astronomy topics in the literature

A screenshot of AstroTRENDS, showing three random keywords: Dark Energy, Spitzer, and White Dwarf.
A screenshot of AstroTRENDS, showing three random keywords: Dark Energy, Spitzer, and White Dwarf. White Dwarfs are the “old reliable” of the group.

Inspired by this post by my good friend Augusto Carballido, I created a new web app called AstroTRENDS. It’s like Google Trends, for astronomy!

AstroTRENDS shows how popular specific astronomic topics are in the literature throughout the years. For instance, you could track the popularity of Dark Energy vs. Dark Matter; or the rise of exoplanetary-themed papers since the discovery of the first exoplanets in 1992. As an example, check out this post I wrote about whether the astronomical community has settled on the “extrasolar planet” or “exoplanet” monicker.

You can normalize keywords with respect to one another, or the total article count, to track relative trends in popularity (say, the growth of “Transits” papers compared to “Radial Velocity” papers). Finally, you can click on a specific point to see all the papers containing the keyword from that year (maybe that spike in a keyword is connected to a discovery, a new theory or the launch of a satellite?).

How does it work? I crawled ADS for a small number of keywords that I thought were interesting (but you can ask me for more!), and counted how many refereed articles were published containing that keyword in the abstract for each year between 1970 and 2013. Keywords containing multiple words are contained within quotes, to specify that all words must be in the abstract.

Play and have fun with it, and if you find an interesting trend, you can share it with others by copying and pasting the address from the “Share” box. (Feel free to send it to me, too!)

Open AstroTRENDS

Two useful tools: git-flow and git-gutter+

This is just a quick post linking to two very useful tools I just started incorporating into my workflow.

The first is git-flow. Git-flow gives a very nice structure to feature development using git: it imposes some discipline on branching and committing features to the “master” branch, while also providing a very helpful branching naming scheme and a clear path from developing a feature to incorporating it into a release. This post gives a clear overview on how to get started with it.

Fringe Git indicators in Emacs.
Fringe Git indicators in Emacs.

The second is an Emacs package called git-gutter+ (in its “fringe” flavor). It lets you view Git changes directly from the current buffer (the graphical symbols in the fringe shown in the screenshot). Install it from the package manager (MELPA) and add to one of your .el initialization scripts.