No Code Needed: Efficiently Download, Clean, and Insert Files into Databases Using Command Line

How to write data from large files into a database?

Oleh Teslenko
T-slen
4 min readJul 26, 2024

--

Photo by John Matychuk on Unsplash

Although it sounds like a common task for modern developers, I want to show another view on this point. Instead of immediately starting to write code using your favorite language, importing libraries, and “reinventing the wheel”, I suggest thinking about alternative ways and trying to do it by writing none of the lines of code. Let me show how to use only the command line tools for this task. You would be surprised how powerful it could be sometimes.

Benefits of Using Command Line for Data Tasks

  1. Speed and Efficiency: Command-line tools execute tasks directly and with minimal overhead, often faster than equivalent scripts in Python or JavaScript.
  2. Simplicity: Many command-line commands are concise and straightforward, allowing you to accomplish tasks with minimal code.
  3. Resource Usage: Command-line tools are lightweight and consume less memory, eliminating the need to manage external libraries and dependencies.

Download the File

The easiest way to do this is to use the command curl. We use curl to download the file from a specified URL:

curl -o output_file.tar.gz "https://s3.amazonaws.com/"

As curl params, we set -o output_file.tar.gz (specifies the output filename) and URL.

Extract the File

We use tar command to extract the file:

tar -xzf output_file.tar.gz

As tar params, we set -xzf (to extract a gzipped tar file) and the output file path.

Clean the Data

We use sed (a stream editor) to remove single quotes from the CSV file:

sed "s/'//g" test.csv > cleaned_test.csv

As sed params, we set s/'//g (a regular expression, which you can change for your purpose) command that removes all single quotes from test.csv. As an output file for cleaned data, we use cleaned_test.csv.

Insert Data into MySQL

  1. The mysql command is used to load data into a MySQL database:
mysql -u user_name -p your_database_name -e "LOAD DATA LOCAL INFILE '/home/user/cleaned_test.csv' INTO TABLE test FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;"

Replace user_name, your_database_name, and other placeholders with your MySQL credentials and database details.

Automation

But how could it work together as well as our hand-made written code? The answer is a Bash (Unix shell).

As a command processor, Bash operates within a text window where users input commands to execute various tasks. It also supports the execution of commands from files, known as shell scripts, facilitating automation — Wikipedia

Here’s a full code snippet of the shell script that performs the tasks outlined:

#!/bin/bash

# URL of the file to download
URL="https://s3.amazonaws.com/"

# Desired output file name
OUTPUT_FILE="archive.tar.gz"

# Download the file with the specified name
curl -o $OUTPUT_FILE "$URL"

# Extract the file
tar -xzf $OUTPUT_FILE

# Use stream editor to remove single quotes from the file
sed "s/'//g" test.csv > cleaned_test.csv

# Insert the data into the database CH
clickhouse-client --host=200.12.12.12 --user=user_name --password='password' --query="INSERT INTO test.test_csv_table FORMAT CSVWithNames" < cleaned_test.csv

# Insert the data into the database MySQL
mysql -u user_name -p your_database_name -e "
LOAD DATA LOCAL INFILE '/home/user/cleaned_test.csv'
INTO TABLE test
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;"

This shell script provides a straightforward way to automate the process of downloading a file, cleaning its content, and inserting it into the MySQL database. To execute this script run the next commands:

chmod +x index.sh
./index.sh

Useful links:

  1. My GitHub repository with the complete script and additional details.
  2. Bash Shell Scripting by Wikibooks Contributors.

Conclusion

By leveraging bash scripting and command-line tools, you can efficiently handle tasks like downloading, cleaning, and inserting data into a database without writing a single line of application code. This method not only saves development time but also utilizes the powerful features of Unix-based systems, making your workflow simpler and more efficient. So, the next time you face a similar challenge, consider reaching for your terminal instead of your IDE.

--

--