Python & MySQL: baby-steps with pymysql

Pyariksha Tiluk
The Startup
Published in
5 min readJun 29, 2019

--

In my current gig, a main focus is data processing and analysis. Source files are acquired from various systems in csv format, pre-processed and finally loaded through an off-the-shelf ETL tool.

I decided to play around with an alternative method for my data analysis which makes use of Jupyter Notebook alongside MySQL workbench to create a simple way to get data from raw CSV files into a secure database for storage and added data control whilst harnessing the power and ease of python to analyse and pre-process this data.

Now to the point of this story - I have included 7 steps to using the pymysql module which I hope will assist any beginner along their journey with data analysis.

Before we get to the code: it is extremely important to “get a feel” for the data in your csv file and understand, for example, what the relevant field names would be, which fields should be dropped or transformed, whether calculated fields are required and, most importantly, what the values mean in your business context. This understanding will form the foundation of your data analysis going forward. If you are a business analyst like I am, you would usually get this information from a workshop with your business stakeholders or from the user stories and requirement specifications.

In this example, I have used generic column names “field1, field2, field3, field4, field5” with generic data types for simplicity.

I will only touch on the step which involves creating a MySQL database and table in MySQL workbench (which I refer to as “myDB” and “table1”, respectively, in this example). If you are new to creating databases and tables, the below query should suffice but it would be extremely helpful to read further on this topic and to also focus on choosing relevant datatypes in MySQL. There are many MySQL courses available online for free, Google is your friend!

Before I start with the pymysql module in Jupyter Notebook, it is important to mention the basics of reading the csv file into a pandas dataframe which I have named “df_new_obs” in the example below. Don’t forget to import pandas!

Now to the main-course, bon appétit!

The above steps can be followed to create a connection to your MySQL database (“myDB” in this example) and insert data from your csv file, which was read into a pandas dataframe (“df_new_obs”), into the table you have created in MySQL (“table1”).

I will dissect the above into further detail:

Steps 1 & 2: Connecting to the DB and .execute()

Firstly, as shown above, we need to import the pymysql module in Jupyter Notebook. If you have not inferred this already, you will need to install pymysql on Anaconda Prompt by running the command: pip install pymysql.

Secondly, importing the json module will allow us to use a json file to pass credentials in a secure and private manner when we access the database connection. Alternatively, you can just enter your username and password directly in the connection.

Once we have run the code to extract the user and password into variables from the json file using the json.load() method (line #4-7), we can now create the connection (“conn”) between Jupyter and MySQL! (line #9) Aaahh success!

Finally, the cursor connection needs to be created so the we can execute queries in MySQL (line #11). Below is the code you can run to execute MySQL queries directly from Jupyter either by writing the select statement in directly (line #3) or by assigning it to a variable like “query” (line #5) and then passing it as an argument in cursor.execute().

The output to the above executed query will be the number of records retrieved in MySQL, however in this example, the output will be 0 as our table is empty until we insert our csv data from df_new_obs! Lets get to it…

Step 3: Insert Query

OK! So this is less of a “baby-step” and more of a “fully-grown adult leap”… One seemingly being made by an Olympic, gold-medal-worthy, long-jump athlete if you have not had enough exposure to loops in coding. I suggest a Python programming course for beginners to immediate level if the above code is unfamiliar.

This, however, will be invaluable in getting any data from pandas dataframes into a SQL database so it is worth the effort.

The MySQL statement to insert data is always “INSERT INTO tablename VALUES (value1,value2…),(value1,value2…),(…and so on)”

My example csv has 40 rows and most csvs will have thousands or hundreds of thousands of rows. It would be highly inefficient to manually enter each value into this query and the human error would be atrocious. Therefore, we should extract the rows in “df_new_obs” and use the above for loop in python to add the SQL syntax modifications that we need to get this insert_query to execute in MySQL. These would be the additional comma separators, apostrophes and brackets.

Once the above for loop is complete and you print the “insert_query” your output will look something like this:

Printed insert_query

Perfect SQL syntax format for the INSERT statement! This is now ready to execute in MySQL-table1.

Steps 4,5,6,7: execute query, commit and close connections

Once the data is inserted into table1 in myDB, you can run a simple select * statement (line #3) to check that the records retrieved are no longer 0 but are now equal the rows in your csv/dataframe. In my case this would be 40.

Furthermore, it is important to commit your changes in the database otherwise the insertion of data will not be permanent (line #5).

Finally, closing the cursor and database connections (line #7&9). This is important as we don’t want to leave connections open and if you want to play around with your data in MySQL workbench afterwards you would need to close the connection in Jupyter.

Final thoughts…

In addition to pymysql, there is cx_oracle for Oracle Databases, sqlite3 for SQLite and SQLalchemy for PostgreSQL among many other modules. There are slight differences in each but overall they work the same.

Working with data in relational databases is a vital skill for any analyst and integrating this with Jupyter Notebook will give one access to the benefits of databases as well as the analysis power of python. Most companies still operate with relational databases and require SQL as a skill for most data analyst candidates so I believe it will be a valuable skill set to obtain and to merge with your existing knowledge.

Go forth and analyse!

--

--