Missing value Imputation with python — EDA
Imputation preserves all cases by replacing missing data with an estimated value based on other available information. Once all missing values have been imputed, the data set can then be analyzed using standard techniques for complete data.
Missing data can skew anything for data scientists, from economic analysis to clinical trials. so, Imputation of missing value holds a very important position data analysis.
For convenience, I’m proceeding with jupyter lab
jupyter lab installation in Ubuntu
To create a new environment, follow the command below:
conda create -n py38 -y python=3.8
conda activate py38
conda deactivate
Install jupyter lab on Conda:
pip install jupyterlab
Once, you have installed jupyter lab, use this command on Conda to open:
jupyter lab
In this session:
- Import Dataset & Headers
- Identify Missing Data
- Replace Missing Data
- Import Dataset & Headers
First import the packages, such as numpy, pandas etc., then import the dataset with the following syntax as dataframe.
data = pd.read_csv(“<location of the dataset>”)
If the dataset presents in the same folder, then just mention the name of the file in the place location.
2. Identify the Missing values.
data.head()- it returns the top 5 rows in the dataframe
to find the null values use data.info()- returns number of non-null values.
3. Replace the missing data.
data.fillna()- fills the not null value with the repective input.
In this dataset, we see null values in column name- num and award.
num column can be replaced by the mean of the other values in the column.
award column can be replaced as “No award”.
But we see here that ‘?’ is not identified as null value so, it must be replaced with null by using replace().
value_counts() — helps in counting the number of given entry.
Here we can calculate how many people did not get the award
In this dataset, the first column name is very large so, it is difficult to call every time and make changes. so we can rename the column name by rename() function.
In some cases, numerical columns are main concerns so, there is a function in python to include and exclude the datatypes.
Here, we will exclude the object datatypes.
Finally the cleaned data frame can be converted to csv or even other formats.
data.to_csv("file.csv", header=False, index=False)
syntax:
<dataframe_name>(“<new_file>”, header=False, index=False)
Be tuned for more updates.
Thank you !!