The hunt for outliers… by a group of beginners

Brenda Leyva
MCD-UNISON
Published in
7 min readDec 2, 2020

A step by step recount, with a livestock dataset.

source: 7NewsCQ/Twitter

As first semester students of a Data Science program we took our first dive into the outliers world and this is what we came up with.

First steps

Working on data science projects one gets used to a few simple steps that feel like a warm up before a workout, so much so that it becomes a very natural and even subconscious act, but let’s go over them anyway:

1. Get ’em libraries set up

We highly recommend the use of the dfply tool, similar to dplyr for R it makes your data handling steps a lot smoother. As you can see we included the most commonly used libraries.

2. Use pandas to read your file

As it sometimes happens, this one needed the encoding instruction in order for the file to be read correctly.

3. Take a quick look

Use .head() to get the general idea of how your dataset looks and what to expect during the analysis.

4. Shape, missing values, data types

Firs of all let’s find out the shape of the dataset.

We now know that there are 1,515 records and 13 columns or attributes in total. But how many fields are null?

This is where the need to know your data and where it comes from comes in handy for the first time, we knew that the ‘observaciones’ column which is for observations and comments doesn’t need to be filled out and it makes sense that there were so many null or missing values. Same as ‘fecha_salida’ which is the exit date, but in the case of cattle that are still in the ranch then this field won’t be filled out and that’s also an acceptable thing. However, the first column ‘siniiga’ represents the unique ID of each cow/bull, and it shouldn’t be missing, so we made the decision to remove the records that had this field as null.

And also made sure to remove the rest of the records with null values that should have been correctly recorded.

This looks better. It is important to note that the data comes from a manually filled out system and the missing values where due to mistakes while entering new files.

Now, we can move on to the data types.

This is where we detect things that can make the analysis a bit more complicated in the future and prevent that by correcting where necessary. In this case there are a couple fields that are for dates and have been processed as objects during the reading of the file, we can also see some fields that are actually numbers that should be under float but are showing up as objects as well. On the other hand, ‘siniiga’, even though a number, it is the unique ID and can be better off set as a string.

The following lines fixed all of those issues for us:

Further exploration can be done here, but at this point we consider the dataset to be much more user friendly and useful for analysis. Just as an example, one way we may be able to detect anomalies on the dataset is to look at unique values on a certain column. Here we took a look at the ‘productor’ attribute and didn’t find anything ‘odd looking’ but if something were off with the data this would be one way to quickly detect an issue.

5. .describe() for the win

Once we have a better understanding of the dataset and have cleaned it up a bit, it is very useful to take a look at quick stats via the .describe() command. This will help us take a better look at the overall state of things when it comes to numeric values. When searching for outliers, this will be the very first step towards any immediate findings.

So there it is, our first victim, the ‘conversion’ column. This field is populated by the number of kilos of feed an individual gets over the amount of kilos of weight it has gained. The amounts on this column are often between 3 and 10 with rare exceptions. Again, we know this by being familiar with the data and the process it comes from, never underestimate the value of really understanding the problem at hand.

Let the hunt begin

With a nice looking dataset and the first look at the overall stats we now know where to begin looking. When it comes to detecting outliers is not always needed to use very fancy algorithms, sometimes it’s about very easily detecting and removing human error. In the case of the ‘conversion’ column the minimum and maximum values are not possible within the context and we proceeded to take a closer look and then removed manually:

If we generate a couple visuals for the column we can see that there are additional outliers within the data.

We will approach them with some more detail by exploring possible relationships with this conversion rate and the cattle’s origin.

We have now narrowed some big issue to the producer ‘Benjamin Valdez’ and specifically to the male population. We found two rows that are not admissible and proceeded to remove them:

This is how everything looks at this point:

From here, the outlier detections needs to become a little less case by case with the help of specialized algorithms.

Data distribution is important

Something to consider when detecting outliers is the way that our data is distributed. One way to establish this is by generating the histograms for the numeric fields as follows:

We continued the analysis creating a new dataset that only includes these numeric attributes:

The big guns

We chose to work with the z-score value to detect anomalies. The first try was with a value of z > 4 for outliers:

At a quick glance, the values detected by our filter are looking like very good candidates for outlier behavior. This can be easily confirmed by the visuals:

Now, we try with a filter of z>5 for outliers.

We decided to apply this filter to the dataset and remove all data with z>5.

Finally, we take a look at the new, outlier free (we hope) dataset.

Note that this exercise was purely demonstrative and we only found anomalies within the one column. Outlier detection can become a much more daunting task depending on the data and therefore one may find the need to use more sophisticated algorithms and elegant solutions provided by tools like sklearn. However, for a first exploration on a relatively small and manageable dataset, the techniques presented here can be a good approach and part of a well done exploratory data analysis.

Team members:

Aaron Lara

Brenda Leyva

Enrique Alvarado

Victor Ibarra

--

--

Brenda Leyva
MCD-UNISON

Former business administration professional turned physicist, turned data scientist with a unique approach to problem solving and data analysis.