How to automate a quick data report with shell scripting

Jordi Lucas
Geek Culture
Published in
6 min readApr 27, 2022
Photo by Gabriel Heinzer on Unsplash

Time is precious when our stakeholders need to see on a frequent basis the results of our analyses. Therefore, it’s important to automize tasks to save us time and energy. As you know, there are a lot of technologies, programming languages and libraries to create in deep analyses and awesome visualizations. But what happens when we need the results quickly, to deliver a presentation or to take a decision? We will do an unforgettable EDA later.

Shell scripting is a good idea if you want to achieve this. With bash, awk, sed and regular expressions we can do a preliminary analysis of a dataset, make a basic (or advanced) feature engineering, calculate some stats and show the results in a static html file, for example, in a fast way without installing any extra software.

Hands On please: A Health Issue

I have to recognize that shell was a black hole for me during some years until I discovered its powerful capabilities. Nowadays I use it as times as possible.

In order to show you how to create a fast ETL pipeline with only shell scripting I have chosen the Heart Failure Prediction Kaggle dataset. In it we have 11 clinical features for predicting heart disease events for men and women and we will use it to discover differences between both genders. The goal is present you a static report showing you the final results.

Heart Failure Prediction Kaggle dataset

First of all, I have developed a simple info.sh file to extract our data and display basic information like number of rows and columns, and name of columns; if we add -v argument: info.sh -v we’ ll see more in detail information about the dataset file like the download URL in our system and the file format metadata.

info.sh -v file

We will see in detail file metadata information with -v argument:

File information in detail with -v argument

Data Understanding Process

Photo by Alexander Sinn on Unsplash

The next step is to understand our data and pre process it to create two clean csv files separated by gender. I have used stream editor sed in order to translate the acronyms column names by human readable expressions and finally, with all transformations done, I have created two different datasets with regular expressions (grep): heart_women.csv and heart_men.csv.

# Rename acronyms by human readable expressions with sedsed -E '{
# Sex: sex of the patient [M: Male, F: Female]
s/,F,/,Female,/g
s/,M,/,Male,/g

# ChestPainType: chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
s/,TA,/,Typical Angina,/g
s/,ATA,/,Atypical Angina,/g
s/,NAP,/,Non-Anginal Pain,/g
s/,ASY,/,Asymptomatic,/g

...

}' heart.csv
# Separate in two individual files by gender (Female and Male)file_by_gender(){
grep -E '*,Female,*' $1 > heart_women.csv
grep -E '*,Male,*' $1 > heart_men.csv
}

# Call file_by_gender function with file name parameter
file_by_gender heart_temp.csv

This is the result, a csv by gender:

heart_women.csv clean file.
heart_men.csv clean file.

Statistical Data Analysis

Photo by National Cancer Institute on Unsplash

The next step in our shell scripting ETL pipeline as a part of the transformation layer, is take the files by gender created before and do some basic statistics and data analysis with them. We’ ll need average, maximum and minimum values of the columns of each dataset and we’ ll add the results like a single row in a third file called results.csv.

For that, I have used bash to do the heart_men.csv data analysis and awk for the heart_women.csv. Both are excellent tools to read a file and process data in a very fast and clean way. In the bash file I ‘ve created the function calculate_avg passing it an array as parameter with the data to calculate:

# Function to calculate average from arrayfunction calculate_avg(){
arr=("$@")
tot=0

for i in "${arr[@]}"; do
let tot+=$i
done

size=${#arr[@]}
echo "scale=2; $tot/$size" | bc
}

# Get the max and min age values
max_age=$(printf "%d\n" "${age[@]}" | sort -n | tail -1)
min_age=$(printf "%d\n" "${age[@]}" | sort -n -r | tail -1)
# Create a row with data stats values and add it to results.csv file.echo M,
$(calculate_avg "${age[@]}"),
$max_age,
$min_age,
$(calculate_avg "${asy[@]}"),
$(calculate_avg "${ta[@]}"),
$(calculate_avg "${maxhr[@]}"),
$(calculate_avg "${chol[@]}"),
$(calculate_avg "${restbp[@]}"),
$men,
$men_1,
$(( $men_1*100/$men | bc -l )) > results.csv

The final outcome will be a csv file with the results of the calculated data for both genders, each one in a different row.

Show me the results

Photo by NORTHFOLK on Unsplash

At the end of the process we have to calculate and load the results.csv file data. In our case we don’t need a database, we show the results of the analysis in a static html file available to show in the laptop or/and print it and pass it to the doctors (in this case).

I have developed the last step in bash, the goal is to calculate and show differences in the analyzed data and display a simple and readable report with three rows: men and women data and the substrate of them in order to visualize interesting differences. The strategy here is get the substrate between the two arrays in a nested for bucle.

for i in "${!men[@]}"; do
for w in "${!women[@]}"; do
if (($i == $w)); then
a=`echo ${men[$i]}-${women[$i]} | bc -l`
echo "<td>"
echo $a
echo "</td>"
fi
done
done

Finally, we have the results.html file:

results.html file

With only six lines: A Pipeline

The most interesting thing is to execute all the code with only six lines in a run.sh file. It is a very fast way to achieve our objective: extracting the necessary information of the raw data, transforming it into valuable statistics and finally displaying the results in a static file.

Conclusions

Photo by Pablo Arroyo on Unsplash

As you can see, developing a pipeline with shell scripting is a good way to automatize the process and create a simple informational report. Scripting can be difficult to catch at the beginning but it is an excellent tool when you have a little bit of practice with it.

You can see and test the code in my GitHub repository.

--

--

Jordi Lucas
Geek Culture

Data Science, Sports Analytics, Machine Learning, Deep Learning, AI, Python, Spark.