5 min BASICS

How to export a MySQL table with HTML pages into CSV and load them into Pandas

Today, you get the data sample from the production MySQL database. Tomorrow, you present the outcomes of your analysis. Check what the journey of the data sample looks like.

Bartłomiej Poniecki-Klotz

Published in

Data And Beyond

7 min readOct 5, 2023

Three steps: install MySQL server, export data from MySQL database table to CSV, read CSV into Pandas data frame — Export from MySQL in CSV format and load into Pandas

Data Scientists love CSV. This format is easy to use — all you need is a text editor or a simple CLI tool like head. We have all been there when the number of rows in CSV differs from in the Pandas data frame. We love CSV and hate it.

Data Engineers hate it. It has a few problems — size, encoding, and missing schema. Datasets stored in CSV format take a few times more disk space than in binary formats like Parquet. The difference for a petabyte-scale dataset is significant in cost. CSV does not provide schema and types for columns, sometimes not even the names of the columns. On the other hand, splitting the data into chunks is effortless.

MySQL is the most frequently used Relational Database in the world. It is behind many of the websites and transactional systems. WordPress, a leading blogging and website platform, uses MySQL as its database.

MySQL is the most popular database. Source: https://marketsplash.com/database-software-statistics/

Python is a key player in Data Science, while Pandas is the first choice for loading and scrubbing data. It has a flexible and powerful API, which allows data manipulation, cleaning and transformations. It also supports reading and writing in multiple formats, including CSV and Parquet.

In this blog, we follow the path of the data sample. First, set up a MySQL server and create a table with data. Next, extract data from it in CSV format. Last, ingest data from CSV into the Pandas data frame.

All the above in 5 minutes.

MySQL

The operating system I’m using is Ubuntu 22.04 LTS. If you use a different one, the main change is how you install MySQL.

Install the MySQL server and check if it is running correctly. The expected status is “Server is operational”. If you use a new VM in the cloud, you might need to start with an apt update.

$ sudo apt install mysql-server -y


$ sudo systemctl status mysql.service
● mysql.service - MySQL Community Server
     Loaded: loaded (/lib/systemd/system/mysql.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2023-09-17 13:25:20 CEST; 42s ago
    Process: 13688 ExecStartPre=/usr/share/mysql/mysql-systemd-start pre (code=exited, status=0/SUCCESS)
   Main PID: 13696 (mysqld)
     Status: "Server is operational"
      Tasks: 39 (limit: 38007)
     Memory: 365.9M
        CPU: 739ms
     CGroup: /system.slice/mysql.service
             └─13696 /usr/sbin/mysqld


Sep 17 13:25:20 barteus-xps systemd[1]: Starting MySQL Community Server...
Sep 17 13:25:20 barteus-xps systemd[1]: Started MySQL Community Server

After installation, the user root is created without a password. You can log in using your OS authentication. If you want to log in from a different machine than the one you installed MySQL on, you must set a password for the root user.

$ sudo mysql

If the MySQL server is empty, create a database and tell MySQL CLI to use it for the next commands. Additionally, create a table with:

URL — the link to the website
htmlcontent — the HTML content of the page

The HTML content contains special characters like new lines and “/”

mysql> CREATE DATABASE IF NOT EXISTS htmltest;
mysql> USE htmltest;


mysql> CREATE TABLE htmltest.websites (
    ->     URL VARCHAR(255) NOT NULL,
    ->     htmlcontent TEXT NOT NULL,
    ->     PRIMARY KEY (URL)
    -> );

To extract the database table dump, you need data. Insert a single row with URL and htmlcontent into MySQL. There is only a single row of data in it. After loading the data to Pandas, the number of rows is the same.

mysql> INSERT INTO htmltest.websites (URL, htmlcontent)
    -> VALUES (
    ->     'https://www.example.com',
    ->     '<html>
    '>         <head>
    '>             <title>Example Website</title>
    '>         </head>
    '>         <body>
    '>             <h1>Welcome to our website</h1>
    '>             <p>This is a sample HTML content with special characters like &lt; and &gt;.</p>
    '>             <p>You can visit our <a href="/about">About Us</a> page.</p>
    '>         </body>
    '>     </html>'
    -> );


mysql> SELECT count(*) from htmltest.websites;
+----------+
| count(*) |
+----------+
|        1 |
+----------+
1 row in set (0.00 sec)

CSV

Focus on CSV from three steps: MySQL, CSV, Pandas in the sequence flow, MySQL is marked green as finished — Second step — dump MySQL table into CSV format

You set up the database and filled it with some data. Now, extract data from it into the CSV. This process is usually called dump. You will use mysqldump CLI program. Dumps are generally in the format of text files with SQL statements or CSV format. They can take multiple GB of storage space.

Before you execute the command, you need to know that this process has two layers of security measures. First, you have a MySQL user accessing the table you dump. Second, your OS-level user can access the folder where your dump can be saved.

Check which folder is configured as possible storage of dumps. Try different folders to see if it fails.

mysql> SELECT @@global.secure_file_priv;
+---------------------------+
| @@global.secure_file_priv |
+---------------------------+
| /var/lib/mysql-files/     |
+---------------------------+
1 row in set (0.00 sec)

You are ready to execute mysqldump command. The below statement works only when executed on a machine that runs MySQL. It outputs only data without the SQL statement on creating a table. If you want a table creation statement, remove the “-t” flag. The execution time depends on the amount of data you have in the table.

$ sudo mysqldump -t -T /var/lib/mysql-files/ htmltest websites

Pandas

The CSV file is ready and can be shared with the Data Scientist, who cannot wait to put their hands on it. One of the first choices is loading the data into pandas. You can install it using pip.

There is a single row in the database, but nine are in the Pandas data frame. Something is wrong.

CSV format, when mishandled, has issues with new lines. They are interpreted as new rows of data. During the dumping process, the special characters are “escaped” using special characters. By default, this character is “\\”. Pandas also support the exact mechanism but have a different default.

$ sudo apt install pip -y

$ sudo pip install pandas 

$ sudo python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.read_csv("/var/lib/mysql-files/websites.txt")
>>> df.shape
(9, 1)
>>> df.head()
               https://www.example.com\t<html>\
0                                       <head>\
1               <title>Example Website</title>\
2                                      </head>\
3                                       <body>\
4              <h1>Welcome to our website</h1>\

This time, the effect is quite opposite. You have 0 rows and only a single row. In the extracted CSV, you did not have a heading row. Pandas, by default, treat the first row as a header.

>>> df = pd.read_csv("/var/lib/mysql-files/websites.txt", escapechar="\\")
>>> df.shape
(0, 2)
>>> df.head()
Empty DataFrame
Columns: [https://www.example.com, <html>
        <head>
            <title>Example Website</title>
        </head>
        <body>
            <h1>Welcome to our website</h1>
            <p>This is a sample HTML content with special characters like &lt; and &gt;.</p>
            <p>You can visit our <a href="/about">About Us</a> page.</p>
        </body>
    </html>]
Index: []

There are two ways to fix it. While loading the data, configure that there is no header row, but then your columns will be 1 and 2. On the other hand, you can provide the columns manually, like below.

>>> df = pd.read_csv("/var/lib/mysql-files/websites.txt", escapechar="\\", header=None, names=["URL", "htmlcontent"])
>>> df.shape
(1, 2)
>>> df.head()
                       URL                                        htmlcontent
0  https://www.example.com  <html>\n        <head>\n            <title>Exa...

The Pandas data frame is ready for experimentation.

Data Science experiments locally using Ubuntu and VSCode with cloud GPUs.

I’m a Data Scientist. I do not own a GPU but experiment on one “almost” locally. Ubuntu, VSCode and NVidia GPU…

medium.com

Non-root user

What if I use other users than root?

It’s a bit trickier. You need to:

Create a MySQL user with a password
Allow dumps to a folder where both MySQL and OS user has access to
Change file ownership or permissions of the dump file to open it using the Python program

SQL dump to CSV

What if you get a MySQL dump in SQL format and want to change it into a CSV file? On the web, you can find a few scripts which help you do this, but you can also follow the guide above — instead of creating a new table manually, use the SQL dump.

Cleanup

When you finish playing with MySQL, it’s time to clean up. Remove the installed dep package and folders with configurations and dumps.

$ sudo apt remove mysql-server -y
$ sudo rm -rf /etc/mysql 
$ sudo rm -rf /var/lib/mysql

Summary

In many organisations, duties are separated between Database Admins (DBAs) and Data Scientists (DS). DBAs are the only ones to execute queries on the running DBs, while DS works on extracted data.

In the tutorial, you do the job of both. Knowing both sides helps to understand each other and foster collaboration.

This is how you set up an MLOps platform on AWS EKS with Kubeflow and MLflow

Give your data science team the MLOps tools they need and empower them to solve business problems.