Missing value Imputation with python — EDA

Bagiyalakshmi
featurepreneur
Published in
3 min readDec 11, 2022

Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data.

Missing data can skew anything for data scientists, from economic analysis to clinical trials. so, Imputation of missing value holds a very important position data analysis.

For convenience, I’m proceeding with jupyter lab

jupyter lab installation in Ubuntu

To create a new environment, follow the command below:

conda create -n py38 -y python=3.8
conda activate py38
conda deactivate

Install jupyter lab on Conda:

pip install jupyterlab

Once, you have installed jupyter lab, use this command on Conda to open:

jupyter lab

In this session:

  • Import Dataset & Headers
  • Identify Missing Data
  • Replace Missing Data
  1. Import Dataset & Headers

First import the packages, such as numpy, pandas etc., then import the dataset with the following syntax as dataframe.

data = pd.read_csv(“<location of the dataset>”)

If the dataset presents in the same folder, then just mention the name of the file in the place location.

2. Identify the Missing values.

data.head()- it returns the top 5 rows in the dataframe

to find the null values use data.info()- returns number of non-null values.

3. Replace the missing data.

data.fillna()- fills the not null value with the repective input.

In this dataset, we see null values in column name- num and award.

num column can be replaced by the mean of the other values in the column.

award column can be replaced as “No award”.

But we see here that ‘?’ is not identified as null value so, it must be replaced with null by using replace().

value_counts() — helps in counting the number of given entry.

Here we can calculate how many people did not get the award

In this dataset, the first column name is very large so, it is difficult to call every time and make changes. so we can rename the column name by rename() function.

In some cases, numerical columns are main concerns so, there is a function in python to include and exclude the datatypes.

Here, we will exclude the object datatypes.

Finally the cleaned data frame can be converted to csv or even other formats.

data.to_csv("file.csv", header=False, index=False)

syntax:

<dataframe_name>(“<new_file>”, header=False, index=False)

Be tuned for more updates.

Thank you !!

--

--

Bagiyalakshmi
featurepreneur

Learning something new everyday keeps me busy and refresh