Handy Command-Line One-liners for Starting Data Scientists

Rama Ramakrishnan
6 min readJun 3, 2017

--

[6/5/2017 update: I was asked if I had a PDF version of the one-liners below. Here it is — Data-Science-One-Liners.pdf]

Experienced data scientists use Unix/Linux command-line utilities (like grep, sed and awk) a great deal in everyday work. But starting data scientists, particularly those without programming experience, are often unaware of the power and elegance of these utilities.

When interviewing candidates for data scientist positions, I ask simple data manipulation questions that can be done with a command-line one-liner. But often the answer is “I will fire up R, import the CSV into a data frame, and then …” or “I will load the data into Postgres and then …”.

The command-line can be much simpler and faster, especially for getting large data files ready for consumption by specialized tools like R. For example, rather than try to load a million-row CSV into R and sample 10% of it, you can quickly create a 10% sample using this one-liner …

awk -v X=10 ‘BEGIN {srand()} rand() <= 0.01*X’ filename > newfile

and then load the resulting 100k row CSV into R for modeling.

To illustrate the usefulness of the command-line for basic data wrangling, I have listed some examples below. I assume that the incoming file has rows and columns, the columns are delimited with a comma, space or tab, and there’s a header row with the names of the columns.

The ‘one liner’ in the title as well as some of the actual one-liners below are from the venerable list of awk one-liners and sed one-liners.

Count the number of lines in a file

wc -l filename

Show the column names, one in each line, preceded by line numbers (i.e., grab the header row, transpose it and prefix line numbers)

Helpful when you have numerous columns in a new datafile and want to get the lay of the land e.g., knowing that “average_selling_price” is column # 39 is useful for many of the column-oriented examples below.

(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)

head -1 filename | tr ‘\t’ ‘\n’ | nl

head -1 filename | tr ‘,’ ‘\n’ | nl

head -1 filename | tr ‘ ’ ‘\n’ | nl

Page through the file with line numbers showing

less filename | nl

Show the first line/first few lines/last few lines

head -1 filename

head filename

tail filename

Show line #4212

Very useful when you are trying to load the file into a database and the load fails at line #4212, for instance. Also, this command will conveniently quit after printing the 4212nd line; very considerate if your file has a million lines!

sed ‘4212q;d’ filename

Show lines with “foo” in any field/show lines with “foo” in any field, ignoring foo’s case

grep ’foo’ filename

grep -i ‘foo’ filename

Show lines with ‘foo’ in field #18

(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)

awk -F\t ‘$18 == “foo” ’ filename

awk -F, ‘$18 == “foo” ’ filename

awk ‘$18 == “foo” ’ filename

Show rows that have fewer fields than the header row

To check if any rows are incomplete

awk ‘NR==1 {x=NF}; NF < x’ filename

Remove lines with “foo” in any field and save the rest into a new file

sed ’/foo/d’ filename > newfile

Remove lines with ‘foo’ in field #18 and save the rest into a new file (The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)

awk -F\t ‘$18 != “foo” ’ filename > newfile

awk -F, ‘$18 != “foo” ’ filename > newfile

awk ‘$18 != “foo” ’ filename > newfile

Remove the first line and save the rest into a new file

Great for stripping a header row before further processing

sed ‘1d’ filename > newfile

Remove the first 8 lines and save the rest into a new file

sed ‘1,8d’ filename > newfile

Remove line #42 and save the rest into a new file

sed ‘42d’ filename > newfile

Remove lines 233 to 718 and save the rest into a new file

sed ‘233,718d’ filename > newfile

Remove the last line and save the rest into a new file

sed ‘$d’ filename > newfile

Remove the last 8 lines and save the rest into a new file

sed -e :a -e ‘$d;N;2,8ba’ -e ‘P;D’ filename > newfile

(Ugh! Let me know if you know of a better way )

[7/1/2017 update: Prem Swaroop suggested a better way. While it is a 2-liner, it is much nicer than the sed monstrosity above]

num=`cat filename|wc -l `; awk “NR<=$num-8” filename > newfile

Remove blank lines from the file and save the rest into a new file

sed ‘/^$/d’ filename > newfile

Remove duplicate lines and save the rest into a new file

(if you want the original order preserved)

awk ‘!seen[$0]++’ filename > newfile

(if you don’t need the original order preserved)

sort -u filename > newfile

Remove lines with a missing value in field #18 and save the rest into a new file

(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)

awk -F\t ‘!$18’ filename > newfile

awk -F, ‘!$18’ filename > newfile

awk ‘!$18’ filename > newfile

Show just col #42

(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)

cut -f42 filename

cut -d, -f42 filename

cut -d’ ‘ -f42 filename

Show the unique values in column #42 with counts

Useful for understanding a categorical field. A histogram, essentially. The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files.

cut -f42 filename | sort | uniq -c

cut -d, -f42 filename | sort | uniq -c

cut -d’ ‘ -f42 filename | sort | uniq -c

(The three examples below assume that the file is tab-delimited. For comma and space delimited files, modify as shown in other earlier examples)

Remove the 1st field and save the rest into a new file

cut -f2- filename > newfile

Remove field #42 and save the rest into a new file

cut -f1–41,43- filename > newfile

Remove fields #19–42 and save the rest into a new file

cut -f1–18,43- filename > newfile

Stack files row-wise

Useful if you have two or more files with the same columns and you need to ‘pancake’ stack them. Assumes file2 doesn’t have a header row. If it does, first remove it using a one-liner :-)

cat file1 file2 > newfile

Stack files column-wise

Useful if you have two or more files with the same rows but different sets of columns and you need to combine them side-by-side

paste file1 file2 > newfile

Randomly shuffle the rows of a file and save to a new file

awk ‘BEGIN{srand();}{print rand()”\t”$0}’ filename | sort -k1 -n | cut -f2- > newfile

Randomly choose X% of rows and save to a new file (X= 10% in the code snippet below)

awk -v X=10 ‘BEGIN {srand()} rand() <= 0.01*X’ filename > newfile

Select every 10th row and save to a new file

awk ‘NR%10’ filename > newfile

Split a file into two files with 3000 rows in the first file and the rest in the second

Useful for train/test splitting

csplit -sf prefix filename 3001

(the resulting files will be prefixed with whatever you specify as ‘prefix’)

[6/5/2017 update: Here’s a PDF version of the one-liners above — Data-Science-One-Liners.pdf]

This barely scratches the surface of what’s possible with the command-line. Even just the ones listed above can be chained to do some pretty complex tasks in a single sequence.

There are numerous resources if you want to learn more. Here are a few to start with:

--

--

Rama Ramakrishnan
Rama Ramakrishnan

Written by Rama Ramakrishnan

MIT Professor, AI/ML entrepreneur/advisor. Prev: Founder/CEO CQuotient, SVP Data Science Salesforce, Chief Scientist/VP Oracle Retail, McKinsey. MIT PhD.

Responses (1)