How I Am Learning Machine Learning — Week 1: Python and Pandas (Part One)

Gabriele Boccarusso
Feb 24 · 6 min read

Introduction

After we did the set up of our environment we can now do the actual work, and to be able to do it we have to learn python and its library dedicated to the data analysis, pandas.

Learning python

Learning python is very easy, and if you have any experience with a programming language will certainly learn python easily. I’ll not cover it by myself just because you can find all you need here. or, if you want to overkill, learn python the hard way

What we will use here are especially lists and matrixes, but not at a difficult level. This will be just an overview of the various ways to display data in pandas, so don’t be afraid, we are all newbies here.

Pandas datatypes

As we saw in the setup, we’ll do everything on jupyter notebook, where you should already have imported all the packages, for this example, I’ll create a new notebook with just pandas in it.

There are two main data types in pandas, the first is a Series, the pandas name for a list.

Series

Now let’s create another series so we can introduce the second data type.

DataFrames

Remember when we have talked about matrixes? that is simply the technical names for tables, called in pandas DataFrame.

Importing data

But always writing data is tedious and not efficient, we’ll probably already have all the data, sample or not, and what we’ll have to is importing it.

The most common file used to get data is .csv, which is like an excel file.
I have already put the csv file “baseball_players” into the main folder, so I can see it here:

Now to have the data to work on I have to just type:

Exporting data

Once we have worked with our data we may want to export them, and to do it is very simple.

But we have a problem, we have an extra column that displays the index of rows as it would be a series of the DataFrame.
To remedy this we can modify the exporting function by adding a parameter that says

index = False

Describing data

Before describing data we have to know a little detail, the difference between a function and an attribute.
A function is a piece of code that may or may not require parameters and that can change the data, it has () at the end.
An attribute is similar to a function but is used just for visualization and has no brackets, even if the underline operations are the same as a normal function.

dtypes attribute

Using this attribute we can notice two things:

First that there is an error in the sample and that the name of the columns that are between quotation marks.
Second, now we know the types of data we are using.

Note: now I had to manually adjust all the data between quotation marks and it was simply because this data set was just 10 rows, but in a dataset, with thousands of data this kind of error may be crucial.

Columns attribute

This attribute will show to us all the columns of the data frame.

but instead of always using this attribute we can just give it to a variable that we can use when needed.

Info function

This function will give us information about the dataset that we are working on.

Included the memory usage.

Mean function

This function will show us more or less information about the DataFrame, but for more accurate options you can see the doc

Viewing and selecting data

Pandas offer a lot of useful functions to display data and select them, the most useful are head and tails.

Head function

Calling the head function on our DataFrame will show us the first 5 elements. It accepts even a number so that we can view the first n element of what we are working on:

It may be useful to have a quick look at big DataFrame with thousands of rows so that just viewing the first 3, 5 or 7 we can have an idea of what we are going to work on.

Tail function

Very similar to the head function but instead of the first it shows the last elements of a DataFrame.

Loc function

Let’s create series to illustrate this function

Now let’s call the function

Very strange and situational, but still good to know.

Iloc function

We’ll use the same array of before to illustrate what iloc does

it returns the fourth element of the series, still beginning from 0, referring to the real position of the series.

Both loc and iloc have precise properties, similar to when in python one prints a string followed by [], it accepts a maximum of three parameters that are [start: stop: stepover].

Boolean operators

To see specific columns we can type two commands:
the brackets notation

or the dot notation

both have the same behavior, it’s just preference, but they are important because we can display certain rows using them and the booleans operators.

This will work with any boolean operator and will let us search for a row, or a group of rows, with a specific feature.

Final thoughts

In the following week, I’ll write the second part on python and pandas for then begin seeing numpy.
See you till the next time.

originally published on dev.to

The Startup

Get smarter at building your thing. Join The Startup’s +731K followers.