Data Analysis with Elixir

Exploring the Explorer library

João Pedro Lima
Elemental Elixir
Published in
8 min readJan 25, 2024

--

Data wrangling is the process of cleaning, structuring, and transforming raw data into a more usable and understandable format. In a world where powerhouse tools such as Python’s pandas library combined with the Jupyter Notebook, or even R combined with Rstudio often take center stage for data manipulation, Elixir introduces an exciting alternative: the Explorer library.

Explorer offers an efficient way to explore (duh!), manipulate, and transform data frames, and it can be seamlessly combined with Livebook, a tool for crafting interactive and collaborative code notebooks that also provides a platform for experimentation and prototyping with ML models, transforming data analysis from a solitary, code-heavy process into an immersive and visually engaging experience.

Disclaimer: I probably should’ve said it before, but I’m no Data Scientist. I’m just an Elixir enthusiast sharing some data tricks I’ve picked up along the way. Consider it a data playground pass, not a masterclass 😉.

Let’s begin!

Our data

For the purposes of this article, we’ll be using this DataWars challenge, and we’ll be answering a couple of questions based on a Premier League data set.

The data set is also available here.

We’ll use Explorer to do some data cleaning: by filling in missing values, identifying outliers, and correcting wrong data. We’ll also be able to perform some data analysis, that will allow us to find, for example, the average number of goals per match, the biggest goal difference in a match, and so forth.

Livebook and set up

Running Livebook should be straightforward and it shouldn’t present any major challenges. You have a variety of methods available to do so, and their official documentation provides helpful details on each of them.

With Livebook up and running, we are going to create a new notebook, give it a cool name, and we’ll set up the dependencies. We’ll be using three libraries:

  • req, an HTTP client library that will be used to load the data set
  • explorer, of course
  • kino_explorer to help with the visualization of the data frames

To install those libraries, click on the Notebook dependencies and setup and manually add the packages.

The dependencies should look like this:

Mix.install([
{:req, "~> 0.4.8"},
{:explorer, "~> 0.8.0"},
{:kino_explorer, "~> 0.1.18"}
])

Also, make sure to run

require Explorer.DataFrame, as: DF
require Explorer.Series, as: Series

since they’ll be constantly used.

To retrieve the data from our data set, simply use

response = Req.get!("https://raw.githubusercontent.com/joaorlima/data-sets/master/premier-league-data.txt")

and to load our data frame, we’ll use df = DF.load_csv!(response.body). And there we have it: our data frame.

The kino_explorer library lets us visualize data interactively, making it easier to identify trends and patterns. We can already see, for example, that the data frame has 4560 rows, and that the average of away goals scored is 1.11, while the average of home goals is 1.52, and so forth.

So let’s explore!

Exploring Explorer

  • Data Cleaning

Looking at the season column above, we can see some question marks in there. How many exactly are we talking about? To find that out, we could do something like:

df = df 
|> DF.lazy()
|> DF.group_by(:season)
|> DF.summarise(count: count(season))

and that would give us the number we’re looking for.

We have 31 occurrences of ? in the season column, but actually, I don’t need to write any code to find that out. Livebook offers an incredible automation toolkit: the Smart Cells. With drag-and-drop simplicity, smart cells handle advanced tasks like filtering, querying, and even charting your data, right within the UI.

Simply select the Smart cell when creating a new block

And use it as you wish.

Alright! Now, let’s replace these invalid values with an "Unknown season" string, shall we? We could do:

df = df
|> DF.put(
:season,
Series.transform(df[:season], fn n ->
if n == "?" do
"Unknown season"
else
n
end
end)
)

Analyzing our data, we notice some invalid data in other columns as well. For example, the home_goals and away_goals columns have a minimum value of -4 for both of them. What does that even mean? How can a team score -4 goals? I mean, I’m Brazilian, so I can see how a team may score 7 or more goals in a match, but -4?

I checked the Premier League’s official website to manually check some of the matches’ results, and I saw that the results are not simply off by a factor of one, but they vary between zero, two, and even three in some cases. Just to keep things tidy in this article, let’s pretend all the negative numbers are actually zeros, okay?

So let’s replace all negative values for both home_goals and away_goals columns with0.

Now let’s take a look at the result column. It contains four distinct values: A for an away win, H for a home win, and D for a draw and ?. You don’t have to be a football expert to know that there are only 3 possible outcomes for a football match, so the ? is invalid data.

We can check that there are 43 occurrences of ? in the result column:

Analogous to what we did for the season data, let’s replace these values with "Unknown result", right? No! Instead, we can compute the results. For any given match, we have the number of home_goals and the number of away_goals. Again, you don’t need to be a football specialist to understand that if the number of home goals is greater than the number of away goals, the home team wins. If the number of away goals is greater than the number of home goals, the away team wins. Otherwise, it’s a draw.

With that in mind, let’s override the result column. But instead of using the put function, like we did before, we’ll use mutate/2.

df = df 
|> DF.mutate(result:
select(
home_goals > away_goals,
from_list(["H"]),
select(home_goals < away_goals, from_list(["A"]), from_list(["D"]))
)
)

Ta-dah!

And that should be it for data cleaning! We’re now able to answer some questions, like: what’s the biggest goal difference in a match? Which team has the most away wins? Which team has the most home losses?

  • Data Analysis

Biggest goal difference

To discover the biggest goal difference in a match, we could simply subtract home_goals from away_goals, take their absolute value, and select the largest resulting number, something like:

Series.subtract(df[:home_goals], df[:away_goals])
|> Series.abs()
|> Series.max()

But let’s actually see which games had the largest goal difference. To do so, we’ll add a new column for goal difference and then sort it to see the data we’re looking for.

df
|> DF.mutate_with(fn d ->
diff = Series.subtract(d[:home_goals], d[:away_goals]) |> Series.abs()
[diff: diff]
end)
|> DF.sort_by(desc: diff)

So there we have it: 4 games with an 8 goal difference.

Most away wins

Let’s use the Smart cells for this one. Remember that an away win is represented by the value A in the result column.

And there we have it: Chelsea was the team with the most away wins! Not needing to write any line of code to find that out. Just as a bonus point, you can rename the away_team_count column by running

DF.rename(away_team_count: "away wins")

Top 5 teams with most home goals scored

Another great thing about Smart cells, which I left intentionally at the end of this article, is that you can convert the cell to a code block and it’ll give you the actual code used, and you can combine it with other functions as well!

That would give us the following editable code block:

Let’s rename the home_goals_sum and retrieve the top 5 teams using the head/2 function:

Team with the lowest rate of goals conceded at home

This is a tricky question, and the challenge itself says that. We’re not looking for the total of goals received, that’d be very simple to do. This isn’t just about total goals conceded, it’s about finding the team most efficient at keeping their home net clean.

For example, let’s see the teams that received the least amount of goals playing at home.

That’d be Charlton Athletic. They received only 20 goals playing at home. But how many games did they play? We already know how to do that, so I’ll just leave the answer.

They played 19 games. Their ratio would be 20/19 = 1.052631. That’s pretty good in my opinion, but I bet there are other more efficient teams in there. To find that out, let’s group home teams, then aggregate them into a single row with their total home games and away goals conceded.

df
|> DF.lazy()
|> DF.group_by(:home_team)
|> DF.summarise(
home_games_played: count(home_team),
home_goals_conceded: sum(away_goals))
|> DF.sort_by(desc: home_games_played, asc: home_goals_conceded)

Also, let’s filter for Charlton Athletic’s data since we’re already familiar with them

That’s exactly what we had before. So now the only thing left for us to do is to divide the values on the home_goals_conceded by the values on the home_games_played and we’d get our ratio.

df
|> DF.lazy()
|> DF.group_by(:home_team)
|> DF.summarise(
home_games_played: count(home_team),
home_goals_conceded: sum(away_goals))
|> DF.sort_by(desc: home_games_played, asc: home_goals_conceded)
|> DF.mutate_with(fn d ->
ratio = Series.divide(d[:home_goals_conceded], d[:home_games_played])
[ratio: ratio]
end)
|> DF.sort_by(asc: ratio)

Alright then! Manchester United was the most efficient team at keeping their home net clean. Good days, huh?

And that’s pretty much it for our data analysis.

Conclusion

The advent of Explorer and Livebook within the Numerical Elixir project marks a significant stride toward revolutionizing data analysis in the realm of Elixir programming. These tools not only offer an exciting alternative for data analysis but also exemplify the commitment of the open-source community to enhancing the capabilities of Elixir in the domain of numerical computing.

I’m also writing another article delving into a comparative analysis between Elixir and Explorer against the formidable Python and pandas duo.

By the way, Explorer’s new version was released just a couple of days ago, feel free to check out what’s new.

More to come.

--

--