Intro to Pandas, reading files handling Missing values
Pandas is an open-source library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It is based on the dataframe concept found in the R programming language.
Took the auto mpg dataset from the StatLib library. The following code loads the MPG dataset into a data frame: It contains data for 398 cars, including mpg, cylinders), displacement, horsepower , weight, acceleration, model year, origin and the car’s name.
The display function provides a cleaner display than merely printing the data frame.
It is possible to generate a second data frame to display statistical information about the first data frame.
The Python program can convert this JSON-like information to a data frame for better display.
Missing values are a reality of machine learning. Ideally, every row of data will have values for all columns. However, this is rarely the case. Most of the values are present in the MPG database. However, there are missing values in the horsepower column. A common practice is to replace missing values with the median value for that column. The program calculates the median as described here. The following code replaces any NA values in horsepower with the median:
Dealing with Outliers
Outliers are values that are unusually high or low. Sometimes outliers are simply errors; this is a result of observation error. Outliers can also be truly large or small values that may be difficult to address. We typically consider outliers to be a value that is several standard deviations from the mean. The following function can remove such values.
The code below will drop every row from the Auto MPG dataset where the horsepower is more than two standard deviations above or below the mean.