Why “df.to_csv” could be a Mistake ?
As a Data Scientist and in the field of data analysis more globally, load and save data (DataFrame) is almost systematic.
Usually, I use df.to_csv('path/df.csv')
and i think that almost all Pandas users do the same or did that at least once.
All Python users know that saving data in CSV format is very practical and very easy but it has some drawbacks, which we will detail below, that can turn this usage as a bad habit.
For the rest of this article, I will generate a Pandas DataFrame, as follow, and I will use it as an example.
from uuid import uuid4
import numpy as np
import pandas as pd
def generate_strings(n_rows, n_cols, taux_nan):
"""
This function is used to generate a string variables
"""
df_ = pd.DataFrame()
for col in range(n_cols):
name = f'str_{col}'
cats = [str(uuid4()) for _ in range(n_rows)]
values = np.array(cats, dtype=object)
nan_cnt = np.random.randint(1, int(taux_nan*n_rows))
index = np.random.choice(n_rows, nan_cnt, replace=False)
values[index] = np.nan
df_[name] = values
return df_
def generate_numeric(n_rows, n_cols, taux_nan):
"""
This function is used to generate a numeric variables
"""
df_ = pd.DataFrame()
for col in range(n_cols):
name = f'num_{col}'
nums…