How to Bash for Data Science

or: How I Learned to Stop Worrying and Love the Terminal

Anmol Garg
The Data Experience
3 min readNov 9, 2015

--

Led by a strong desire of working with Internet-of-Things data, I chose to transition from being a pure statistician to a well rounded data scientist. I quickly went from writing R scripts in RStudio to living in terminal sessions, programming in Python, collaborating using git, writing HTML/CSS/JavaScript, etc. Fortunately, tools for data scientists can make this transition much easier, namely Anaconda, Jupyter, and Pandas. Unfortunately, they can also instill a fake sense of confidence in the command line and you may find yourself in over your head at the worst moment. I know I have.

As I continued my data science career, I found myself working on a wider set of tasks than I ever experienced as a statistician, including building analysis and visualization tools, maintaining remote servers, setting up databases, and contributing to libraries. Since I had primarily focused on Python during my transition, I had only learned the bare minimum of commands in the command line and kept a cheat sheet of git commands, server commands, and other commonly used commands that I didn’t really understand for reference.

My understanding of git in 2013

I needed to become more efficient and more powerful in the command line so to do that, I attempted to move as much of my work as possible to the terminal (specifically iTerm 2.9). I also worked on my command line vocabulary by memorizing lots of different commands with accompanying options and routines to help in my workflow. It doesn’t take long running unix commands to find that the the built in commands are less than adequate and doing fairly simple things become complicated due to having to trigger specific flags and having to memorize unintuitive command names. I also disliked the terminal’s appearance and found it lacking useful information.

What to do? Bash profile customization to the rescue! With the right pieces setup, you can reassign built in commands to work as you want (i.e. always copy files interactively), create new commands (i.e. open a directory in Google Chrome), modify the appearance (i.e. change the prompt information shown and color scheme), add data in useful places (i.e. change the tab header to show the working directory), and add additional functionality (i.e. create a function to show live computer stats).

Running my self-defined `ii` function to show computer information

Now that I am slightly more experienced (and after accidentally wiping my old .bash_profile and having to recreate one from scratch), I wanted to share my .bash_profile (as well as back it up) and hope you are able to take some pieces — if not the whole thing — to make your own terminal more approachable and convenient for your data science needs. This profile is designed and tested on Mac OS X with Sublime Text, Google Chrome, and Jupyter installed. See here for a primer on bash_profiles and how to edit your own.

With all this awesome customized functionality, I never want to leave the terminal and it’s always the first program I open.

Note that I moved several important pieces of .bash_profile to .bashrc namely my path and Java pointer. This was done to 1. keep my .bash_profile clean and easily sharable and 2. to separate user specific information from more general aliases and functions. There is more reason to split information between these files depending on your OS and needs but you can also put everything in .bash_profile if you prefer. If you do decide to use a .bashrc, be sure to source it into .bash_profile.

This work, like most work in CS, is standing on the shoulders of giants: credit to Nate Landau and Barry Clark for inspiration.

--

--

Anmol Garg
The Data Experience

data scientist at Tesla, UW huskies fan, SF. all thoughts are my own.