Transforming Categorical Values to Numerics with Python-SQL and Pandas

In order to use machine learning packages like Python’s scikit learn, we have to ensure that all our data values are numerical or vectorized.

This guide will walk through how we :

  1. Connect and extract data from a DB using a Python JDBC call
  2. Transform the data with Pandas
  3. Write it back to the DB

Step 1: Connecting to DB

Create a script that connects to a DB. I am using Teradata as my database but the code is easily modifiable to be used for MySQL. Below code requires that you username and password are stored in your bash_profile or zshell.

Step 2: Extracting Data from DB

Now that we have a connection to the DB, we can start writing SQL queries to pull the relevant column values we want to modify.

Example Table:

We want to transform col1 and col2 into numerical values and keep the mapping in another table.

In the above code, we specify the original table that has the string values in the columns. We also specify a new table that is a copy of the original table for testing. A mapping table that will tell us which string the numerical values are mapped is also required.

In the function, we run a SQL query that selects all the distinct string values from each column and puts them into a list. It is also important to note that the returned values are unicoded and we have to convert them to string formatting for when we do inserting into the table later on.

The uniq_list looks like this:

>>> print uniq_list
>>> ['cat','dog','mouse']

Step 3: Mapping String to Integer Values with Pandas

Now that we have the unique list of strings, lets map that to integers and update the table (test_table) to replace the string values with the integers. We are using pandas factorize function to map the strings to integers.

Printing out s_enc gives us the result we want nicely. Note that the returned value is an array where the first item in the array is a list of integers and the second item in the array is a tuple.

>>> print s_enc
>>> (array([ 0, 1, 2]), Index(['cat', 'dog', 'mouse'], dtype = 'object'))

Step 4: Updating the Test_Table with Integer Values (efficiently)

Now we know we can write a simple SQL query to update a single row, but how can we do it for multiple rows AND multiple columns while using a CASE statement?

a typical update query to update one column will look like this:

update my_table SET my_col = CASE 
when my_col = 'some_string1' THEN 1
when my_col = 'some_string2' Then 2
.
.
when my_col = 'some_string1000' Then 1000..;

Assume we have hundreds of unique string values in uniq_list that we want to update and for multiple columns. Surely we don’t want to write the above manually for each column. Below, we will find out how to get around it.

Nice, we just generated a huge SQL query without having to manually type each column value for each CASE WHEN statements!

The Updated Table:

STEP 5: Create A Mapped Table for Reference

Remember the zipped list of tuples we had before that mapped the integers to their string values? We can now insert the tuples into our mapped reference table.

Reference Table:

Thats it for this tutorial on how to vectorize strings to integers and update Tables efficiently with python-sql :-)

If you have checked that the integers in the test_table correspond to the initial string values in the original table, you can update the original table easily by replacing instances of test_table above with orig_table.