Handy Command-Line One-liners for Starting Data Scientists
[6/5/2017 update: I was asked if I had a PDF version of the one-liners below. Here it is — Data-Science-One-Liners.pdf]
Experienced data scientists use Unix/Linux command-line utilities (like grep, sed and awk) a great deal in everyday work. But starting data scientists, particularly those without programming experience, are often unaware of the power and elegance of these utilities.
When interviewing candidates for data scientist positions, I ask simple data manipulation questions that can be done with a command-line one-liner. But often the answer is “I will fire up R, import the CSV into a data frame, and then …” or “I will load the data into Postgres and then …”.
The command-line can be much simpler and faster, especially for getting large data files ready for consumption by specialized tools like R. For example, rather than try to load a million-row CSV into R and sample 10% of it, you can quickly create a 10% sample using this one-liner …
awk -v X=10 ‘BEGIN {srand()} rand() <= 0.01*X’ filename > newfile
and then load the resulting 100k row CSV into R for modeling.
To illustrate the usefulness of the command-line for basic data wrangling, I have listed some examples below. I assume that the incoming file has rows and columns, the columns are delimited with a comma, space or tab, and there’s a header row with the names of the columns.
The ‘one liner’ in the title as well as some of the actual one-liners below are from the venerable list of awk one-liners and sed one-liners.
Count the number of lines in a file
wc -l filename
Show the column names, one in each line, preceded by line numbers (i.e., grab the header row, transpose it and prefix line numbers)
Helpful when you have numerous columns in a new datafile and want to get the lay of the land e.g., knowing that “average_selling_price” is column # 39 is useful for many of the column-oriented examples below.
(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)
head -1 filename | tr ‘\t’ ‘\n’ | nl
head -1 filename | tr ‘,’ ‘\n’ | nl
head -1 filename | tr ‘ ’ ‘\n’ | nl
Page through the file with line numbers showing
less filename | nl
Show the first line/first few lines/last few lines
head -1 filename
head filename
tail filename
Show line #4212
Very useful when you are trying to load the file into a database and the load fails at line #4212, for instance. Also, this command will conveniently quit after printing the 4212nd line; very considerate if your file has a million lines!
sed ‘4212q;d’ filename
Show lines with “foo” in any field/show lines with “foo” in any field, ignoring foo’s case
grep ’foo’ filename
grep -i ‘foo’ filename
Show lines with ‘foo’ in field #18
(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)
awk -F\t ‘$18 == “foo” ’ filename
awk -F, ‘$18 == “foo” ’ filename
awk ‘$18 == “foo” ’ filename
Show rows that have fewer fields than the header row
To check if any rows are incomplete
awk ‘NR==1 {x=NF}; NF < x’ filename
Remove lines with “foo” in any field and save the rest into a new file
sed ’/foo/d’ filename > newfile
Remove lines with ‘foo’ in field #18 and save the rest into a new file (The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)
awk -F\t ‘$18 != “foo” ’ filename > newfile
awk -F, ‘$18 != “foo” ’ filename > newfile
awk ‘$18 != “foo” ’ filename > newfile
Remove the first line and save the rest into a new file
Great for stripping a header row before further processing
sed ‘1d’ filename > newfile
Remove the first 8 lines and save the rest into a new file
sed ‘1,8d’ filename > newfile
Remove line #42 and save the rest into a new file
sed ‘42d’ filename > newfile
Remove lines 233 to 718 and save the rest into a new file
sed ‘233,718d’ filename > newfile
Remove the last line and save the rest into a new file
sed ‘$d’ filename > newfile
Remove the last 8 lines and save the rest into a new file
sed -e :a -e ‘$d;N;2,8ba’ -e ‘P;D’ filename > newfile
(Ugh! Let me know if you know of a better way )
[7/1/2017 update: Prem Swaroop suggested a better way. While it is a 2-liner, it is much nicer than the sed monstrosity above]
num=`cat filename|wc -l `; awk “NR<=$num-8” filename > newfile
Remove blank lines from the file and save the rest into a new file
sed ‘/^$/d’ filename > newfile
Remove duplicate lines and save the rest into a new file
(if you want the original order preserved)
awk ‘!seen[$0]++’ filename > newfile
(if you don’t need the original order preserved)
sort -u filename > newfile
Remove lines with a missing value in field #18 and save the rest into a new file
(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)
awk -F\t ‘!$18’ filename > newfile
awk -F, ‘!$18’ filename > newfile
awk ‘!$18’ filename > newfile
Show just col #42
(The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files)
cut -f42 filename
cut -d, -f42 filename
cut -d’ ‘ -f42 filename
Show the unique values in column #42 with counts
Useful for understanding a categorical field. A histogram, essentially. The first command is for tab-delimited files, the second for comma-delimited files, and the third for space-delimited files.
cut -f42 filename | sort | uniq -c
cut -d, -f42 filename | sort | uniq -c
cut -d’ ‘ -f42 filename | sort | uniq -c
(The three examples below assume that the file is tab-delimited. For comma and space delimited files, modify as shown in other earlier examples)
Remove the 1st field and save the rest into a new file
cut -f2- filename > newfile
Remove field #42 and save the rest into a new file
cut -f1–41,43- filename > newfile
Remove fields #19–42 and save the rest into a new file
cut -f1–18,43- filename > newfile
Stack files row-wise
Useful if you have two or more files with the same columns and you need to ‘pancake’ stack them. Assumes file2 doesn’t have a header row. If it does, first remove it using a one-liner :-)
cat file1 file2 > newfile
Stack files column-wise
Useful if you have two or more files with the same rows but different sets of columns and you need to combine them side-by-side
paste file1 file2 > newfile
Randomly shuffle the rows of a file and save to a new file
awk ‘BEGIN{srand();}{print rand()”\t”$0}’ filename | sort -k1 -n | cut -f2- > newfile
Randomly choose X% of rows and save to a new file (X= 10% in the code snippet below)
awk -v X=10 ‘BEGIN {srand()} rand() <= 0.01*X’ filename > newfile
Select every 10th row and save to a new file
awk ‘NR%10’ filename > newfile
Split a file into two files with 3000 rows in the first file and the rest in the second
Useful for train/test splitting
csplit -sf prefix filename 3001
(the resulting files will be prefixed with whatever you specify as ‘prefix’)
[6/5/2017 update: Here’s a PDF version of the one-liners above — Data-Science-One-Liners.pdf]
This barely scratches the surface of what’s possible with the command-line. Even just the ones listed above can be chained to do some pretty complex tasks in a single sequence.
There are numerous resources if you want to learn more. Here are a few to start with: