Jeeves: Predicting Bash Commands via ML

Alessandro Scoccia Pappagallo
Unkempt Thoughts
Published in
4 min readApr 7, 2018

I’ll admit it: I have a bad memory. I am terrible with dates and possibly even worse with code, that’s why I have dozens of cheat-sheets among my bookmarks. This is especially true with Bash as there are plenty of commands I only run a few times a year.

Wouldn’t be easier though if I could instruct the terminal directly any time I don’t remember a command? Something like: “Please, start a new TMUX session” or “Create a new file in this folder”.

I decided to give it a shot.

I start from the data, nothing fancy, just two columns for each data point: an instruction and the command that instruction refers to. From an ML prospective, this is a multiclass classification problem. A subset of the dataset would look something like:

As per now I don’t have any signals, though, just text. I decide to use a simple approach: characters n-grams. I can train a sklearn.CountVectorizer() over the whole dataset and then use the presence or absence of n-grams as signals.

Something like this should do the job:

Before building the model I need to convert the labels (the actual commands) into something more manageable, i.e. integers. We can do this with sklearn.LabelBinarizer(). See:

Okay, now we need the actual model. I decide to use, surprise surprise, sklearn.RandomForestClassifier(). Instead of working directly with RandomForestClassifier() though I use sklearn.GridSearchCV() so to get both a well optimized model and CV assessment at the same time. Two notes: (a) we are not passing the number of trees to the grid because we want to keep our model light and (b) we’re using accuracy as we don’t have a proper positive class. We can use the same interface we used so far, so:

Speaking of interface, it’s useful to consider the way users are going to interact with the software before we commit to any specific architecture. In my case, this is the structure I had in mind:

- utils.py, the library containing most of the classes and functions; not to be used directly by the user.
- train.py, the script that trains the model and saves it on disk. Technically, a user would only need to execute this once.
- predict.py {instruction}, the script to use to get the actual predicted command. {instructions} refers to the actual instruction passed to the script, e.g. predict.py create a new file.

We can then drop all the code we wrote so far into utils.py. We then import the code from train.py, this way:

Awesome, let’s give it a shot.

It works! The accuracy for this first iteration is 70%, not too bad. Sklearn is complaining about us splitting an already small dataset into test and training sets and thus having sometimes very few instances for each class. That’s not good.

We can solve this issue and help our model to better generalize by generating more data algorithmically. I create a DataExpander() class and give it a few methods to create a new instruction from a previous one, e.g. by swapping a word or introducing a typo.

The final version of our utils.py and train.py then look respectively like this and like this. Let’s try to train the model again:

We now have a way larger dataset (100 times larger in fact!) and the accuracy of our model seems to have increased to ~99%. Great news.

The code in predict.py looks very similar to train.py, the major difference being the way we retrieve the model from disk rather than instantiating it from scratch (see here).

Let’s give it a try:

Great, it works like a charm! I don’t like the idea I need to navigate to this specific folder every time I want to use the model (also because I would need to remember its location) so I create a very simple bash program and drop it in /usr/bin/. I also take the opportunity to give the program a catchy name: Jeeves!

I can now execute Jeeves from anywhere just by typing jeeves {instrucion}.

Awesome! What’s next?

- We only trained the model on six commands, so we need to expand the training set to contain more classes.
- Currently we are only using “static” commands with hardcoded arguments. We need to find a way to teach Jeeves to run commands like “touch my_file” inferring “my_file” directly from the instruction.
- With the current approach, the number of features grows with the number of instances (especially when brand new instructions are provided). This system does not scale well. We may want to reduce the number of features applying dimensionality reduction techniques.

--

--

Alessandro Scoccia Pappagallo
Unkempt Thoughts

T&S Manager @ Google | ML Enthusiast | People think that the human brain is in the head. Nothing of the sort; it is carried by the wind from the Caspian Sea.