Extracting and Transforming Data in Python

In order to get insights from data you have to play with it a little..

--

It is important to be able to extract, filter, and transform data from DataFrames in order to drill into the data that really matters. The pandas library has many techniques that make this process efficient and intuitive. And in today’s article I will list those techniques with code samples and some explanations. Let’s get started.

For this article I’ve created a sample DataFrame with random numbers to play with it. We will use this data as an example during the explanations in the article.

import pandas as pd
import numpy as np
cols = ['col0', 'col1', 'col2', 'col3', 'col4']
rows = ['row0', 'row1', 'row2', 'row3', 'row4']
data = np.random.randint(0, 100, size=(5, 5))
df = pd.DataFrame(data, columns=cols, index=rows)df.head()
Out[2]:
col0 col1 col2 col3 col4
row0 24 78 42 7 96
row1 40 4 80 12 84
row2 83 17 80 26 15
row3 92 68 58 93 33
row4 78 63 35 70 95

Indexing DataFrames

To extract data from pandas DataFrame we can use direct indexing or accessors. We can select necessary rows and columns using it’s labels:

df['col1']['row1']Out[3]: 4

Please, note the order in this type of indexing: first you specify column label and then row. But the truth is, datasets with is rare and small, while in real life we work with much heavier machinery. It is much better to select data using accessors — .loc and .iloc. The difference between them is that .loc accepts labels and .iloc — indexes. Also when we use accessors first we specify rows and then columns. I had some difficult time in the beginning to get used to it — SQL background, what else can you say..

So, to select a single value using accessors you’d do the following:

df.loc['row4', 'col2']
Out[4]: 35
df.iloc[4, 2]
Out[5]: 35

Using indexing we can select a single value, Series or DataFrame from a DataFrame (sorry for tautology). Above I have shown how to select a value.

To subselect few columns, just pass nested list of it’s labels and DataFrame will be returned:

df_new = df[['col1','col2']]
df_new.head(3)

Out[6]:
col1 col2
row0 78 42
row1 4 80
row2 17 80

If you want to select also specific rows, add its indexes and you will get a DataFrame again. This technique is called slicing and more in detail about it — below.

df_new = df[['col1','col2']][1:4]
df_new.head(3)

Out[7]:
col1 col2
row1 4 80
row2 17 80
row3 68 58

To select a Series you have to select a single column with all or range of rows. Each line of code will produce the same output:

df['col0']
df.loc[:,'col0']
df.iloc[:, 0]

Out[8]:
row0 24
row1 40
row2 83
row3 92
row4 78
Name: col0, dtype: int32

The colon means that we want to select all rows or all columns — df.loc[:,:] or df.iloc[:,:] will return all values. And slowly we’ve got to the slicing — selecting specific ranges from our data. To slice a Series you just add a range of rows you want to select using its indexes:

df['col3'][2:5]

Out[12]:
row2 26
row3 93
row4 70
Name: col3, dtype: int32

And don’t forget about ranging in Python — first element included, second excluded. So the code above will return rows with indexes 5, 6, 7, 8 and 9. And indexes start from 0.

Slicing DataFrames works the same way. With just one nuance. When using .loc (labels) both borders are included. For example, select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and all columns:

df.loc['row1':'row4', :]
Out[20]:
col0 col1 col2 col3 col4
row1 40 4 80 12 84
row2 83 17 80 26 15
row3 92 68 58 93 33
row4 78 63 35 70 95

df.iloc[1:4, :]
Out[21]:
col0 col1 col2 col3 col4
row1 40 4 80 12 84
row2 83 17 80 26 15
row3 92 68 58 93 33

The first line of code above selected row1, row2, row3 and row4. While the second — row1, row2 and row3 only. And few more examples below.

Select columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4 and all rows:

df.loc[:, 'col1':'col4']
Out[22]:
col1 col2 col3 col4
row0 78 42 7 96
row1 4 80 12 84
row2 17 80 26 15
row3 68 58 93 33
row4 63 35 70 95
df.iloc[:, 1:4]
Out[23]:
col1 col2 col3
row0 78 42 7
row1 4 80 12
row2 17 80 26
row3 68 58 93
row4 63 35 70

Select rows from label ‘row1’ to label ‘row4’ or from row index 1 to index 4 and columns from label ‘col1’ to label ‘col4’ or from column index 1 to index 4:

df.loc['row1':'row4', 'col1':'col4']
Out[24]:
col1 col2 col3 col4
row1 4 80 12 84
row2 17 80 26 15
row3 68 58 93 33
row4 63 35 70 95

df.iloc[1:4,1:4]
Out[25]:
col1 col2 col3
row1 4 80 12
row2 17 80 26
row3 68 58 93

Use a list to select specific columns or rows that are not in a range.

df.loc['row2':'row4', ['col1','col3']]
Out[28]:
col1 col3
row2 17 26
row3 68 93
row4 63 70
df.iloc[[2,4], 0:4]
Out[30]:
col0 col1 col2 col3
row2 83 17 80 26
row4 78 63 35 70

Filtering DataFrames

Filtering is a more general tool to select parts of the data based on properties of interest of the data itself and not on indexes or labels. DataFrames have several methods for filtering. Underlying idea for all these methods is a Boolean Series. The df[‘col1’] > 20 (we assume col1 is of type integer) will return a Boolean Series where this condition is true. I will put here the output of .head() method, so you don’t need to scroll up to match the numbers.

Out[2]: 
col0 col1 col2 col3 col4
row0 24 78 42 7 96
row1 40 4 80 12 84
row2 83 17 80 26 15
row3 92 68 58 93 33
row4 78 63 35 70 95

So, to select part of a DataFrame where values of col1 are bigger than 20 we will use the following code:

df[df['col1'] > 20]
# assigning variable also works
condition = df['col1'] > 20
df[condition]

Out[31]:
col0 col1 col2 col3 col4
row0 24 78 42 7 96
row3 92 68 58 93 33
row4 78 63 35 70 95

We can combine those filters using standard logical operators (and — &, or — |, not — ~). Notice usage of parenthesis for these operations.

df[(df['col1'] > 25) & (df['col3'] < 30)] # logical and
Out[33]:
col0 col1 col2 col3 col4
row0 24 78 42 7 96

df[(df['col1'] > 25) | (df['col3'] < 30)] # logical or
Out[34]:
col0 col1 col2 col3 col4
row0 24 78 42 7 96
row1 40 4 80 12 84
row2 83 17 80 26 15
row3 92 68 58 93 33
row4 78 63 35 70 95
df[~(df['col1'] > 25)] # logical not
Out[35]:
col0 col1 col2 col3 col4
row1 40 4 80 12 84
row2 83 17 80 26 15

Dealing with 0 and NaN values

Almost always datasets have zero or NaN values and we definitely want to know where they are. Ours is particular, so we will modify it a little:

df.iloc[3, 3] = 0
df.iloc[1, 2] = np.nan
df.iloc[4, 0] = np.nan
df['col5'] = 0
df['col6'] = np.NaN
df.head()

Out[57]:
col0 col1 col2 col3 col4 col5 col6
row0 24.0 78 42.0 7 96 0 NaN
row1 40.0 4 NaN 12 84 0 NaN
row2 83.0 17 80.0 26 15 0 NaN
row3 92.0 68 58.0 0 33 0 NaN
row4 NaN 63 35.0 70 95 0 NaN

To select columns that don’t have any zero value we can use .all() method (all data is present):

df.loc[:, df.all()]

Out[43]:
col0 col1 col2 col4 col6
row0 24.0 78 42.0 96 NaN
row1 40.0 4 NaN 84 NaN
row2 83.0 17 80.0 15 NaN
row3 92.0 68 58.0 33 NaN
row4 NaN 63 35.0 95 NaN

If we want to find a column that have at least one nonzero (any) value, this will help:

df.loc[:, df.any()]

Out[47]:
col0 col1 col2 col3 col4
row0 24.0 78 42.0 7 96
row1 40.0 4 NaN 12 84
row2 83.0 17 80.0 26 15
row3 92.0 68 58.0 0 33
row4 NaN 63 35.0 70 95

To select columns with any NaN:

df.loc[:, df.isnull().any()]

Out[48]:
col0 col2 col6
row0 24.0 42.0 NaN
row1 40.0 NaN NaN
row2 83.0 80.0 NaN
row3 92.0 58.0 NaN
row4 NaN 35.0 NaN

Select columns without NaNs:

df.loc[:, df.notnull().all()]

Out[49]:
col1 col3 col4 col5
row0 78 7 96 0
row1 4 12 84 0
row2 17 26 15 0
row3 68 0 33 0
row4 63 70 95 0

We can drop those rows containing NaNs, but it’s a dangerous game — dropping usually isn’t a solution. You have to understand your data and deal with such rows wisely. I warned you.

Are you sure you wanna know it? OK.. 😀

df.dropna(how='all', axis=1) # if all values in a column are NaN it will be dropped
Out[69]:
col0 col1 col2 col3 col4 col5
row0 24.0 78 42.0 7 96 0
row1 40.0 4 NaN 12 84 0
row2 83.0 17 80.0 26 15 0
row3 92.0 68 58.0 0 33 0
row4 NaN 63 35.0 70 95 0

df.dropna(how='any', axis=1) # if any value in a row is NaN it will be dropped
Out[71]:
col1 col3 col4 col5
row0 78 7 96 0
row1 4 12 84 0
row2 17 26 15 0
row3 68 0 33 0
row4 63 70 95 0

This methods do not modify original DataFrame, so to continue working with filtered data you have to assign it to new dataframe or reassign to the existing one

df = df.dropna(how='any', axis=1)

The beauty of filtering is that we actually can select or modify values of one column based on another. For example, we can select values from col1 where col2 is grater than 35 and update those values by adding 5 to each:

# Find a column based on another
df['col1'][df['col2'] > 35]
Out[74]:
row0 78
row2 17
row3 68
Name: col1, dtype: int32


# Modify a column based on another
df['col1'][df['col2'] > 35] += 5
df['col1']
Out[77]:
row0 83
row1 4
row2 22
row3 73
row4 63
Name: col1, dtype: int32

And this brings us to the next part –

Transforming DataFrames

Once we selected or filtered our data we want to transform it somehow. The best way to do this is with methods inherited to DataFrames or numpy universal funcs, that transform entire column of data element-wise. Examples would be pandas’ .floordiv() function (from documentation:
‘Integer division of dataframe and other, element-wise’) or numpy’s .floor_divide() (doc: ‘Return the largest integer smaller or equal to the division of the inputs.’).

If those functions were not available we could write our own and use it with .apply() method.

def some_func(x): 
return x * 2
df.apply(some_func) -- # update each entry of a DataFrame without any loops# lambda also works
df.apply(lambda n: n*2) -- # the same

These functions do not return transformations, so we have to store it explicitly:

df['new_col'] = df['col4'].apply(lambda n: n*2)
df.head()

Out[82]:
col0 col1 col2 col3 col4 col5 col6 new_col
row0 24.0 83 42.0 7 96 0 NaN 192
row1 40.0 4 NaN 12 84 0 NaN 168
row2 83.0 22 80.0 26 15 0 NaN 30
row3 92.0 73 58.0 0 33 0 NaN 66
row4 NaN 63 35.0 70 95 0 NaN 190

If index is a string it has a .str accessor that allows us to modify entire index at once:

df.index.str.upper()
Out[83]: Index(['ROW0', 'ROW1', 'ROW2', 'ROW3', 'ROW4'], dtype='object')

Also, we cannot use .apply() method on index — the alternative for it is .map()

df.index = df.index.map(str.lower)
Out[85]: Index(['row0', 'row1', 'row2', 'row3', 'row4'], dtype='object')

But .map() can be used on columns as well. For example:

# Create the dictionary: red_vs_blue
red_vs_blue = {0:'blue', 12:'red'}

# Use the dictionary to map the 'col3' column to the new column df['color']
df['color'] = df['col3'].map(red_vs_blue)
df.head()

Out[92]:
col0 col1 col2 col3 col4 col5 col6 new_col color
row0 24.0 83 42.0 7 96 0 NaN 192 NaN
row1 40.0 4 NaN 12 84 0 NaN 168 red
row2 83.0 22 80.0 26 15 0 NaN 30 NaN
row3 92.0 73 58.0 0 33 0 NaN 66 blue
row4 NaN 63 35.0 70 95 0 NaN 190 NaN

Arithmetic operations on Series and DataFrames work directly. The expression below will create a new column where each value with index n is a sum of values with index n from ‘col3’ and ‘col7’.

df['col7'] = df['col3'] + df['col4'] 
df.head()

Out[94]:
col0 col1 col2 col3 col4 col5 col6 new_col color col7
row0 24.0 83 42.0 7 96 0 NaN 192 NaN 103
row1 40.0 4 NaN 12 84 0 NaN 168 red 96
row2 83.0 22 80.0 26 15 0 NaN 30 NaN 41
row3 92.0 73 58.0 0 33 0 NaN 66 blue 33
row4 NaN 63 35.0 70 95 0 NaN 190 NaN 165

This is the second version of the article, because the first one was a complete mess — errors in the code, no examples and few other things. Thanks to the feedback I went through the article one more time and I think it looks much better now. I have covered basics of transforming and extracting data in Python with code snippets and examples here and hopefully it will be useful for people who are just starting their path in this field.

Meanwhile, love data science and smile more. We have to be positive as we have the sexiest job of 21st century 😀

Originally published at sergilehkyi.com.

--

--

Towards Data Science
Towards Data Science

Published in Towards Data Science

Your home for data science and AI. The world’s leading publication for data science, data analytics, data engineering, machine learning, and artificial intelligence professionals.

Sergi Lehkyi
Sergi Lehkyi

Written by Sergi Lehkyi

Data and Cloud Developer, love technology in general, maybe too much humor and never too serious, based in amazing Barcelona

No responses yet