Transforming Categorical Values to Numerics with Python-SQL and Pandas
In order to use machine learning packages like Python’s scikit learn, we have to ensure that all our data values are numerical or vectorized.
This guide will walk through how we :
- Connect and extract data from a DB using a Python JDBC call
- Transform the data with Pandas
- Write it back to the DB
Step 1: Connecting to DB
Create a script that connects to a DB. I am using Teradata as my database but the code is easily modifiable to be used for MySQL. Below code requires that you username and password are stored in your
Step 2: Extracting Data from DB
Now that we have a connection to the DB, we can start writing SQL queries to pull the relevant column values we want to modify.
We want to transform col1 and col2 into numerical values and keep the mapping in another table.
In the above code, we specify the original table that has the string values in the columns. We also specify a new table that is a copy of the original table for testing. A mapping table that will tell us which string the numerical values are mapped is also required.
In the function, we run a SQL query that selects all the distinct string values from each column and puts them into a list. It is also important to note that the returned values are unicoded and we have to convert them to string formatting for when we do inserting into the table later on.
uniq_list looks like this:
>>> print uniq_list
Step 3: Mapping String to Integer Values with Pandas
Now that we have the unique list of strings, lets map that to integers and update the table (
test_table) to replace the string values with the integers. We are using pandas
factorize function to map the strings to integers.
s_enc gives us the result we want nicely. Note that the returned value is an array where the first item in the array is a list of integers and the second item in the array is a tuple.
>>> print s_enc
>>> (array([ 0, 1, 2]), Index(['cat', 'dog', 'mouse'], dtype = 'object'))
Step 4: Updating the
Test_Table with Integer Values (efficiently)
Now we know we can write a simple SQL query to update a single row, but how can we do it for multiple rows AND multiple columns while using a
a typical update query to update one column will look like this:
update my_table SET my_col = CASE
when my_col = 'some_string1' THEN 1
when my_col = 'some_string2' Then 2
when my_col = 'some_string1000' Then 1000..;
Assume we have hundreds of unique string values in
uniq_list that we want to update and for multiple columns. Surely we don’t want to write the above manually for each column. Below, we will find out how to get around it.
Nice, we just generated a huge SQL query without having to manually type each column value for each
CASE WHEN statements!
The Updated Table:
STEP 5: Create A Mapped Table for Reference
Remember the zipped list of tuples we had before that mapped the integers to their string values? We can now insert the tuples into our mapped reference table.
Thats it for this tutorial on how to vectorize strings to integers and update Tables efficiently with python-sql :-)
If you have checked that the integers in the test_table correspond to the initial string values in the original table, you can update the original table easily by replacing instances of
test_table above with