Data Science at the Command Line webcast

Yesterday, I attended a very handy webcast by Jeroen Janssens called Data Science at the Command Line (a book is on its way). While I do most of my data manipulation from R, it is undeniably convenient to be able to run some simple tasks interactively from the command line, or as part of a shell script or Makefile.

The presentation touched on several command tools that I either wasn’t aware of, or had forgotten about.  If you missed the webcast, this website has a helpful list of commands — of both the well-known and the obscure variety. Below are a few that I had not known about and might be useful to others.

parallel

parallel is a shell command to execute a series of commands in batch, over several CPUs on a local machine or over several computers (using a combination of ssh and rsync to connect and transfer files). I used to use a combination of Xgrid and some homegrown shell scripts to achieve that, but Xgrid is unfortunately no more (RIP!). parallel seems like a way  to quickly get a small subset of Xgrid’s functionality.

cowsay

cowsay is like echo, but with cute ASCII art animals.

~$ cowsay "I'd rather get error messages from a cute animal."
 ______________________________________ 
/ I'd rather get error messages from a \
\ cute animal.                         /
 -------------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Very useful to add some levity to your error messages! Use cowsay -l to get a list of animal templates.

I used the Node.js cowsay package to create this little Node.js app (for fun, and to learn a minimal amount of Express.js). Install via Homebrew or MacPorts.

asciinema

I saw Jeroen use this tool during the webcast to record his terminal session. Asciinema looks very very cool and potentially helpful to create terminal-based tutorials. Install via pip.

csvkit & jq

csvkit is a suite of command line tools to deal with CSV files, but works quite well for tab-separated data as well (which I deal with often). Particularly useful so far: csvlook (nicely formatted table in the terminal), csvstat (column statistics) and csvsql (SQL queries on CSV files). jq can be used to manipulate JSON data, potentially piped into csvkit to create simple text tables.