No Code Needed: Efficiently Download, Clean, and Insert Files into Databases Using Command Line
How to write data from large files into a database?
Although it sounds like a common task for modern developers, I want to show another view on this point. Instead of immediately starting to write code using your favorite language, importing libraries, and “reinventing the wheel”, I suggest thinking about alternative ways and trying to do it by writing none of the lines of code. Let me show how to use only the command line tools for this task. You would be surprised how powerful it could be sometimes.
Benefits of Using Command Line for Data Tasks
- Speed and Efficiency: Command-line tools execute tasks directly and with minimal overhead, often faster than equivalent scripts in Python or JavaScript.
- Simplicity: Many command-line commands are concise and straightforward, allowing you to accomplish tasks with minimal code.
- Resource Usage: Command-line tools are lightweight and consume less memory, eliminating the need to manage external libraries and dependencies.
Download the File
The easiest way to do this is to use the command curl
. We use curl
to download the file from a specified URL:
curl -o output_file.tar.gz "https://s3.amazonaws.com/"
As curl
params, we set -o output_file.tar.gz
(specifies the output filename) and URL.
Extract the File
We use tar
command to extract the file:
tar -xzf output_file.tar.gz
As tar
params, we set -xzf
(to extract a gzipped tar file) and the output file path.
Clean the Data
We use sed
(a stream editor) to remove single quotes from the CSV file:
sed "s/'//g" test.csv > cleaned_test.csv
As sed
params, we set s/'//g
(a regular expression, which you can change for your purpose) command that removes all single quotes from test.csv
. As an output file for cleaned data, we use cleaned_test.csv
.
Insert Data into MySQL
- The
mysql
command is used to load data into a MySQL database:
mysql -u user_name -p your_database_name -e "LOAD DATA LOCAL INFILE '/home/user/cleaned_test.csv' INTO TABLE test FIELDS TERMINATED BY ',' ENCLOSED BY '\"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;"
Replace user_name
, your_database_name
, and other placeholders with your MySQL credentials and database details.
Automation
But how could it work together as well as our hand-made written code? The answer is a Bash (Unix shell).
As a command processor, Bash operates within a text window where users input commands to execute various tasks. It also supports the execution of commands from files, known as shell scripts, facilitating automation — Wikipedia
Here’s a full code snippet of the shell script that performs the tasks outlined:
#!/bin/bash
# URL of the file to download
URL="https://s3.amazonaws.com/"
# Desired output file name
OUTPUT_FILE="archive.tar.gz"
# Download the file with the specified name
curl -o $OUTPUT_FILE "$URL"
# Extract the file
tar -xzf $OUTPUT_FILE
# Use stream editor to remove single quotes from the file
sed "s/'//g" test.csv > cleaned_test.csv
# Insert the data into the database CH
clickhouse-client --host=200.12.12.12 --user=user_name --password='password' --query="INSERT INTO test.test_csv_table FORMAT CSVWithNames" < cleaned_test.csv
# Insert the data into the database MySQL
mysql -u user_name -p your_database_name -e "
LOAD DATA LOCAL INFILE '/home/user/cleaned_test.csv'
INTO TABLE test
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;"
This shell script provides a straightforward way to automate the process of downloading a file, cleaning its content, and inserting it into the MySQL database. To execute this script run the next commands:
chmod +x index.sh
./index.sh
Useful links:
- My GitHub repository with the complete script and additional details.
- Bash Shell Scripting by Wikibooks Contributors.
Conclusion
By leveraging bash scripting and command-line tools, you can efficiently handle tasks like downloading, cleaning, and inserting data into a database without writing a single line of application code. This method not only saves development time but also utilizes the powerful features of Unix-based systems, making your workflow simpler and more efficient. So, the next time you face a similar challenge, consider reaching for your terminal instead of your IDE.
Thank you for reading. If you have any questions you can find me on LinkedIn. Feel free to leave comments. You can also read my previous articles: