Why “df.to_csv” could be a Mistake ?

Elfao
Analytics Vidhya
Published in
5 min readApr 8, 2021

--

Photo by Varvara Grabova on Unsplash

As a Data Scientist and in the field of data analysis more globally, load and save data (DataFrame) is almost systematic.

Usually, I use df.to_csv('path/df.csv') and i think that almost all Pandas users do the same or did that at least once.

All Python users know that saving data in CSV format is very practical and very easy but it has some drawbacks, which we will detail below, that can turn this usage as a bad habit.

For the rest of this article, I will generate a Pandas DataFrame, as follow, and I will use it as an example.

from uuid import uuid4
import numpy as np
import pandas as pd

def generate_strings(n_rows, n_cols, taux_nan):
"""
This function is used to generate a string variables
"""
df_ = pd.DataFrame()
for col in range(n_cols):
name = f'str_{col}'
cats = [str(uuid4()) for _ in range(n_rows)]
values = np.array(cats, dtype=object)
nan_cnt = np.random.randint(1, int(taux_nan*n_rows))
index = np.random.choice(n_rows, nan_cnt, replace=False)
values[index] = np.nan
df_[name] = values
return df_


def generate_numeric(n_rows, n_cols, taux_nan):
"""
This function is used to generate a numeric variables
"""
df_ = pd.DataFrame()
for col in range(n_cols):
name = f'num_{col}'
nums…

--

--

Elfao
Analytics Vidhya

Data scientist with 4 years experience. I worked in different field like Marketing digital, Consulting and currently I work for a start-up in finance.