Stories by Felipe Mezzarana on Medium

Python + Excel - Generating Highly Customized Reports

Felipe Mezzarana — Sat, 08 Apr 2023 20:43:27 GMT

Learn how to use Python to create Excel reports with proper format, formulas and images!

As a data engineer I’ve had many opportunities to drive value by creating an Excel file at the end of an ETL pipeline.

Data won’t always be used by data analysts and scientists only, often other areas need a tool that allows manipulations and changes on data, but they probably won’t know SQL to query directly from data warehouses, and that’s where Excel reports comes in.

First of all, it is important to understand if it is really necessary to use Python and increase the complexity of your pipeline. Excel allows connection to several external data sources, which is enough for many situations.

You should consider using Python to generate reports in cases where:

Version control is important - User recurrently needs access to historical data
Formulas matters - User needs to actively change the data
You have security concerns - Files need to be sent to external clients/users without access to data sources
Multiple queries/data sources - Updates take a long time for the user and it is difficult to control possible changes

1 º Dataset

Before starting, let’s quickly define and introduce the data set to be used for testing purposes. I’m going to use two datasets about Twitch information.

The first contains data about top games on Twitch 2016–2021. It is publicly available on kaggle

import pandas as pd

twitch_games_df = pd.read_csv('Twitch_game_data.csv')
twitch_games_df.head(5)

The second contains data on the top 1000 streamers from the past year, also available on kaggle

twitch_streamers_df = pd.read_csv('twitchdata-update.csv')
twitch_streamers_df.head(5)

2 º Generating and Formatting

As you probably already know, generating an Excel file from a DataFrame can be as simple as running DataFrame.to_excel(‘file_name.xlsx’). However, there is a better way that allows for a high degree of customization.

Please note, although you do not need to import extra libraries other than pandas, you will need to install xlsxwriter with “pip install xlsxwriter”

Let’s start creating an Excel with multiple tabs.

https://medium.com/media/778ece60ccecaa56f2caf67cd20dc930/href

We created the “base” file now it’s time for formatting! Please note , to avoid repetitions (we just want to learn!) let’s format only the first tab, “games_sheet”.

Also note that the following code is a continuation of the previous one.

https://medium.com/media/c57146b1b5e4aa91c476b4bdf5087a99/href

Let’s go a little further and format the header. To improve the header we will format the cells and rewrite the column names, removing the snake_case style:

https://medium.com/media/78a161117d5ebe1c18a70d106e71ba02/href

let’s take a look at the Excel generated after running the three blocks of code above:

Way better than just running df.to_excel(), right?

Tip: You can try to improve the formatting even more by adding a special formatting for text columns, with left alignment and larger width

3 º Adding Images/Charts

Adding an image to a worksheet is a very straightforward process:

https://medium.com/media/55fbbcfb29a4bda68c407ded5ea43dd6/href

However, I would like to go one step further to show how we can use this feature to create amazing reports.

The trick here is that you can create any visualization with matplotlib, seaborn, or your preferred library, save the image provisionally, insert it into an excel and then delete it.

Lets start by creating a function that will save a bar plot about the average watch time of top 10 Twitch streamers.

Ps: the goal here it’s not teaching how to create good visualizations, but if you’re interested in the subject, take a look at one of my other article: Get your bar chart to the next level with Python

https://medium.com/media/fcf4959aba26ad2e4adfb6392c935b4c/href

This function will save the plot below as img.png in the current directory

Now we can add it to our Excel report:

https://medium.com/media/a7c6bd3002e7cffa603844a6d32609bc/href

Final result:

Cool right? With this trick I’m sure you will be able to create impressive reports.

4 º Adding Formulas

There are two ways to insert formulas in excel. The first and easiest is using array formulas. Personally I’m not a fan of this option because it creates a formula for the entire column, and the user is not used to this format.

Even so, because it is simpler, I believe it is worth demonstrating. To do so, let’s create a simple feature in the twitch_streamers_df:

watch_time_rate = Watch time(Minutes)/Stream time(minutes)

https://medium.com/media/6c2c0d05aa0bf14e803429fe613cde06/href

It will work, but note that the formula will be written as an array, and cannot be partially changed:

The second option, which I particularly prefer, is to write the formula individually in each cell. To do so, it will be necessary to loop through each value:

https://medium.com/media/fe0cf39e99425f648c717ecb0929d356/href

The result:

Exactly as the user expects! Which certainly facilitates the use of your report, adding value to your work.

Extra! Name your files according to the current date

This is a quick and simple tip, but extremely useful. When generating files recurrently, it’s a good idea to add the generation date to the file name:

https://medium.com/media/c53797404c0daf8c80c123fb7c1849c8/href

It is a simple and useful solution to not generate duplicate files

Final Thoughts

Data engineers often avoid working with Excel for a number of very good reasons. However, it doesn’t matter how good your data pipelines or your data model are if they aren’t being used.

Generating good reports in Excel is an excellent way to universalize data, create value, and give visibility to your work.

I hope this guide can help you create professional reports. If you have any questions or feedback, please let me know in the comments.

Thanks for reading!!

Overcoming Web Scraping Challenges

Felipe Mezzarana — Sun, 08 Jan 2023 13:57:36 GMT

Learn how to create an organic header, find out the best way to rotate your IP, and how to reduce the crawler runtime.

Photo by Chase McBride on Unsplash

In this article I am going to talk about some of the biggest challenges that you may face while building a web crawler, and explore how we can overcome them, providing solutions that works at scale.

I would like to address two challenges: how to avoid being blocked and how to decrease the crawler runtime. The solutions will be discussed across three topics:

How to Create an Organic Header
The Best Way to Rotate Your IP
Run Functions Concurrently

1º How to Create an Organic Header

Usually, the most commom tip people come across when looking for “how to avoid being blocked” is to send a header with a real user-agent, rigth? It’s a good advice, however, to create a more organic header, i.e a header that mimic the behavior of organic users, a steap futher is needed.

The first thing to do is to send a random user-agent instead of always the same one. We can easily do this with the help of the library fake-useragent. This awesome lib allows us to generate random and valid user-agents. Take a look:

https://medium.com/media/9d3c1c75ab6501c1728bdf47f89fb1f1/href

Sending random user-agents will often be enough, but there is still room for improvement.

First of all, it’s important to understand which header a user (like you) are actually sending when accessing a site. To do so, go to the site you want to scrape, open the developer tools panel (inspect button) and enter the network tab.

In the network tab you will see all the requests made to build the webpage (you may need to refresh it). Go to the first one -it should be the request for the address you entered, blue icon- there you can find the headers you just sent.

What you want to do now is to add these keys and values to your header. There are no rules here, and sometimes less is more. Each site checks different points, so I encourage you to try different combinations and see what works.

A good start is to add a platform and version that matches the user-agent (you can extract this information with some regex). Adding a referer can also be a good idea.

Let’s create a function that implement those concepts:

https://medium.com/media/319f2afefe23e8fabe92b1d35fd2991a/href

That should get you off to a good start! Just avoid copying cookies and authorization keys. In case you really need to send cookies, it’s better to use a session or try to retrieve a “fresh cookie” with another request.

2º The Best Way to Rotate Your IP

A good header often isn’t enough. Many sites detect web scrapers by examining their IP. Therefore, we need a workaround. The most common solution is to rotate proxys. A proxy server assigns a new IP address from the proxy pool for every connection, so problem solved, rigth?

Well… not really. Free proxies are problematic for a few reasons, but most importantly they are simply not safe. There’s risk of malware, cookie theft, some proxy servers are even set up to be a front for data mining and identity theft. You should never use free proxies!

Paid proxies, on the other hand, are reliable and normally can solve the problem, although prices can escalate quickly, making medium and large projects completely unfeasible.

There is a slightly unknown solution capable of solving all these problems. Basically, we can use the AWS APIGateway service together with the library requests-ip-rotator to send requests from different IPs each time.

You will have to create an AWS account with a valid credit card, but you don’t need to spend any money since the first million requests will be free, and after that you will only be charged ~$3 per million request.

You will also need to retrieve credentials to use AWS services, to do so, just follow the steps here. I highly recommend that you don’t use these credentials directly in your code, instead, you can export them as environment variables:

# On windows CMD
setx AWS_ACCESS_KEY_ID your_access
setx AWS_SECRET_ACCESS_KEY your_key

# On Linux - current session
export AWS_ACCESS_KEY_ID your_access
export AWS_SECRET_ACCESS_KEY your_key

# On Linux - permanently
nano ~/.bashrc
# Will open a script file that's executed when a user logs in
# Add at the end of the file 
export AWS_ACCESS_KEY_ID your_access
export AWS_SECRET_ACCESS_KEY your_key
# Save and reset the terminal

Now that you have the AWS credentials defined, the process to rotate the sent IP is simple:

https://medium.com/media/a8d196c645d3c061caad1b7208438361/href

Please, note that this method won’t always work. From the docs: “these requests can be easily identified and blocked, since they are sent with unique AWS headers (i.e. “X-Amzn-Trace-Id”).” Nevertheless, in my experience this method has a good success rate, and it is such an efficient and simple solution that it is worth trying.

Important:

From de docs: “Please remember that if gateways are not shutdown via the shutdown() method you may be charged in future.” When we use the with statement (as in the example) it is not needed to call the shutdown method.

However if you have any connection issue during the requests, the gateway will remain open an you need to close it manually, either by de AWS console or by opening again a gateway with the same name and closing it

3º Run Functions Concurrently

As soon as your projects start to scale, and requests go from tens to hundreds and thousands, it’s inevitable to realize that web scraping is a high time consuming job. In larger projects, a high runtime can even makes it unfeasible.

Luckily, there are a few solutions out there, usually involving some kind of concurrency. I don’t want to go into the technical details of each option because this might deviate too much from our goal, instead I will expose in a practical way the async solution, which usually gives good results in this kind of task.

Basically, we will use the packages asyncio and aiohttp to run requests asynchronously, for comparason purposes, lets first create a function that executes requests synchronously.

https://medium.com/media/1139ab99ea545e3591c1900010751bce/href

It took approx. 30s. Let’s see how to do the same thing asynchronously:

https://medium.com/media/5444f5d58bce7f00a79193bed61144b7/href

It took approx. 8s, 3.55 times faster! Writing functions that run with concurrency may be tricky at first, but it’s definitely worth the effort.

Just keep in mind that intensive scraping can slow the website, so try to use this tatic only when you need to scrape different websites at the same time, or if you really have no option, scrape at times when the site receives less access.

Final Thoughts

Web scraping is an amazing tool to collect data available on the web, and came in hand in many situations. The topics discussed here should give you good start if you intend to develop scalable web scraping crawlers.

I hope this guide may help you in building solid solutions. If you have any questions or feedbacks, please let me know in the comments.

Thanks for reading!!

5 Cool Python Tricks to Make Your Life Easier

Felipe Mezzarana — Sun, 13 Nov 2022 17:34:20 GMT

Improve your productivity and code quality with these quick tips.

Photo by Julius Drost on Unsplash

Who doesn’t like a quick trick to get something done faster, more efficiently or simply more elegantly?

I made this list to share five random tricks that I learned /developed working with data and using Python every day. let’s get to it!

1 º Quickly renaming all columns

In many situations, changing column names may be a bad choice. However, when your goal is simply to analyze data or generate visualizations, using a DataFrame with columns names containing whitespace, accents or symbols can be quite annoying.

It’s easier to select columns with standardized names, not to mention when someone left that unwanted whitespace at the end of an Excel column name. So this is how we can quickly transform all columns names to snake_case pattern:

https://medium.com/media/8bf19f36246c22fa41fc0849f8a7ca18/href

2º Creating your own modules

In the first trick we created a nice function, now we just need to copy and paste it into all the scripts we intend to use it… right?

Please, don’t! how about just importing the function exactly like you usually do with any other library? It’s easier than it looks and it can be very useful, both to make your life easier and to share your work with others.

You just need to create a python file (.py) containing your function, and copy that file to the path where Python checks for modules and packages. To find out what path Python checks, just run in your favorite IDE:

import sys
print(sys.path)

Now that you copied the file to one of the printed paths (let’s say you named it ‘my_first_module.py’) you will be able to import it like any libary:

import my_first_module

# Calling the function
my_first_module.rename_columns(any_df)

3º Generating fake data

Whether for testing or learning purposes, we often need to create fake variables. For example, you might need fake data to write unit tests or to play with a new library.

Of course, we can always import or create these variables from scratch, but there is a much easier and more versatile solution! I’m talking about the Faker library, with it we can generate random data of different types and characteristics, take a look:

https://medium.com/media/2fa316de9f7bc3b584153b9c2aaa37e4/href

I showed only a few examples, there is still much more to explore in the documentation, like selecting a specific language and other data options.

4º Using f-strings like a pro

If you study/work with Python you probably already use f-strings. If you still don’t know it, no problem, it’s a simple concept. Basically, using the letter “f” before the string allow variables to be inserted into the text, inside {}. Something like that:

age = 29
print(f'My age is {age}')

f-strings are awesome! But we can make them even more elegant and professional with a series of formatting options, allowing, for example, to improve user interfaces or generate better custom logs. Take a look:

https://medium.com/media/2737d69a016b8cd91fa18903d8d882e5/href

5º Loading bar

Have you ever needed to run a code that goes through a long loop and was left wondering if there was still a lot of time to finish or if everything was going as expected?

Your problems are over! How about developing an elegant and reusable solution? here is how I did it:

https://medium.com/media/0e35b75e6406c63e250d845733da51f5/href

Extra!! A function to analyze any DataFrame

There are some basic analysis that needs to be done on any imported/generated DataFrame. How about creating a function to perform these repetitive steps and just importing it (as shown in the second trick)

I’ve already developed this function for my personal use, and I thought it would be cool to make it available here for anyone who wants to use, modify, or have as a basis to create their own!

https://medium.com/media/71b4b5c6355bdee182d4e6be04d0e6cb/href

To exemplify, I will apply this function in a dataframe that contains information about customer characteristics and payment history of a credit card

credit_card_history_df = pd.read_csv('credit_card_history.csv')
analyse_df(credit_card_history_df)

That’s it! Thanks for reading, I hope you enjoyed and learned something new. If you know any other tricks or have any suggestions, please let me know in the comments!

How to Deploy a Machine Learning Model to the Cloud

Felipe Mezzarana — Thu, 20 Oct 2022 23:55:07 GMT

Deploying a ML model with FastAPI, Docker and AWS EC2 to make it available to end-users\production environment.

Photo by Jelleke Vanooteghem on Unsplash

After a lot of study and hours of coding you developed a ML model! That’s great, but the work doesn’t end here. It’s still necessary to ask how this model will be used, and this is where the deployment phase comes in.

When you don’t need immediate results, like in a periodic job, the best deployment solution will probably be batch predictions. This can be done by simply calling the predict function or using scheduling tools like Airflow.

However, in many cases we need an on-demand service with near real-time predictions. As an example, we can mention recommendation systems, fraud detection, search tools, medical diagnostic, etc. In this article, we will cover how to create a web service for prediction, solving this kind of problem.

To do so, we will go through three steps:

Create a REST API with FastAPI web framework.
Build a Docker image to run the server through a container.
Hosting the Docker container in an AWS EC2 instance.

The Model

The purpose of this article is not to teach you how to build an ML model. Therefore, we will use an already developed XGBoost model capable of predicting the shipping cost of a product.

You may find it interesting to take a look at how this model was build, since during its development I approach a series of important subjects such as: data processing, feature engineering, metric definition, hyperparameter tuning, model selection, model evaluation, and much more. You can find all the steps (as well as the source code used in this article) in the repository below.

GitHub - FelipeMezzarana/shipping_price_estimate: ML Model to estimate the shipping price of an order, based on one e-commerce dataset + Deploy with FastAPI, Docker and AWS

However, if you are only interested in deploying, what you really need to know is that after building your model, you will need to dump it into a file, This can be easily done with the pickle package:

https://medium.com/media/2e1ef5a29fd068e60f1c1edcbf8a6350/href

1º Step: Create a REST API

Now that we have a model, the first step to deploy it in cloud is to create the REST API. At this point, many choose to use Flask framework, mainly because it is an older tool and people are just used to it.

Yet, although FastAPI is a younger framework it has a number of advantages over Flask, such as a higher performance, native concurrency support, inbuilt data validation and an automatic documentation system. For these reasons, I prefer FastAPI over Flask.

To start, create a .py file, mine will be named “server.py”. Then, we can create the FastAPI object and load the previously saved model. Be sure to have everything saved in the same path.

https://medium.com/media/46502c501109db2acdc4c0a9b68c15ca/href

Tip: In this exemple the model will be loaded after each request, you may want to define a function to load the model as a global variable only once. It’s a trade off, you gain speed but need to allocate more memory.

As said before, FastAPI supports inbuilt data validation, so we need to define what kind of data will serve as input for our predict, and what data will be outputted. To do so, we will define a class using BaseModel with all the input variables and another for the output. It’s easier than it sounds:

https://medium.com/media/81cc1635c2692bc7817872ffc4effc39/href

Note that “price”, “product_weight_g”, “product_height_cm”, “delivery_distance_km” and “product_volume_cm3” are all the input needed to predict the shipping price.

Now we just need to define our API endpoints! In our case we will only have two endpoints, one in the main page to check if the server is working, and another to make the predictions. This is the way to do so:

https://medium.com/media/96a109c5f4e743a2b6e4552144863442/href

There are some important things going on here.

When defining the endpoint you need to say whether it will use a “get” or “post” method. For prediction we will need user input, so it is important to use a “post” method.
We need to define the input and output data type as shown. Default is string, that’s why we don’t need to declare the output in home_page().
We defined the functions using “async def” which makes our functions able to run in parallel.

The final server.py file should look like this:

https://medium.com/media/f67a79c9ac6f7bf5ad421c5c0486f374/href

You can test the server locally using uvicorn. After installing the package, just run in the comand line:

uvicorn server:app --host 0.0.0.0 --port 80

You should see something like this:

Now we can make requests directly to ‘http://localhost:80', so it will be possible to perform tests in a jupyter notebook as follows:

https://medium.com/media/f23186000fed9f3007c2e20b19c44eec/href

2º Step: Build a Docker image

Docker came to solve the oldest problem in programmers life!

“But it works on my machine…”

With Docker, it is possible to create a “place” (container) with everything needed for your application to work, exactly the way you developed it, so it will be guaranteed that the application will run on any machine!

It also comes in handy to deploy a ML model in cloud, we can put everything our model needs to work inside a docker image, test it locally, and if everything is ok, upload the image to the cloud!

To do this, the first step is to create a Python virtual environment (it’s not exactly a mandatory step, but it makes our life a lot easier). In this guide, I won’t cover how to create a virtual environment, there are hundreds of tutorials on the internet that you can use.

Having your virtual environment ready, be sure to install in it only the necessary libraries for your application to work. Now, we will generate a file containing all dependencies of this virtaul env. This file will be very useful when creating our Docker image.

On the command line, activate the virtual env you just created. You can check the env path with the first command and activate it with the second:

conda info --env
conda activate  your_env_path

Now, to generate the file just enter:

cd path_for_your_file
pip list --format=freeze > requirements.txt

This command will create a requirements.txt file in the path you defined. The file should look like this:

requirements.txt example

We are ready to create our Dockerfile! A Dockerfile is simply a text file that contains the build instructions.

A Dockerfile has no extension. If your using docker on windows, use notepad ++ to write the instructions, while saving select “All type” and save the file name as “Dockerfile”. With Linux, just use “vim Dockerfile”.

Our Dockerfile must have these commands:

# FROM defines the "starting point" of your image
FROM python:3.9.13

# We need to copy all the files that will be used in our container 
# Note, /deploy/ is a created folder, it could have any name
COPY ./requirements.txt /deploy/
COPY ./server.py /deploy/
COPY ./shipping_estimate_model.pkl /deploy/

# Define where instructions perform their tasks
WORKDIR /deploy/

# Remember the file created earlier? 
# Here we install all the libs listed in it
RUN pip install -r requirements.txt

# execute the command only when we create the container
CMD ["uvicorn","server:app","--host", "0.0.0.0", "--port", "80"]

Now we have everything we need to build our docker image! Be sure to have the files “shipping_estimate_model.pkl”, “requirements.txt”, “server.py”, and “Dockerfile” in the same directory. Then, to finally create our docker image (mine will be named “app-shipping”), enter on the command line:

cd files_path
docker image build -t app-shipping .

Tip: The last argument “.” , indicates the path of the dockerfile. We use the dot to indicate that we are already in the correct directory (accessed with the cd command)

This will execute all commands defined in the Dockerfile, creating the Docker image named “app-shipping” ! Now we just need to run the image to create the container and start the local server. Enter on the command line:

docker run -p 80:80 app-shipping

You should see this output:

It looks and works just like the local server created in the first step of this guide. To confirm, you can again make requests to ‘http://localhost:80' from a jupyter notebook or from a webpage(only get methods)

The only (and very important) difference, is that now the application is running completely isolated from the rest of your machine, which will be very useful in the third and last step of our guide.

3º Step: Hosting the Docker container in an AWS EC2 instance

We finally have everything ready to make our model available to the rest of the world!

First, you will need to create an AWS account. The process is very straightforward, but you will be required to enter a valid credit card. Don’t worry though, in the first year you will have access to the AWS free tier, which grants free access to a number of AWS services, including everything needed to complete this guide.

Now that you have an AWS account, we will create a virtual machine that run on the AWS Cloud. In the search bar, type “ec2” and enter in “Dashboard”. On the new page, click in Launch instance

You will need to select the virtual machine system and specifications. We will use an Amazon Linux system, and you may choose any virtual machine with the “Free tier eligible” seal.

To access the virtual machine, you will need a key. This key is a file, and whenever you want to access the VM, you will need to pass the key file path as an argument. You will have the option to go ahead without using a key, but don’t choose this option! After all, without a key anyone can access your virtual machine, it doesn’t seem like a good idea, isn’t it?

So, if you dont have a key, you will need to generate one. There is no secret here, just click in “ Create new key pair” choose a name for the key, click on “Create key pair” and a file “chosen_name.pem” will be downloaded. Keep it in a safe place.

The last necessary configuration is in the “Network settings” tab. It is very important to check all the options, as this will allow our API to connect to the internet.

Tip: With this configuration, any IP will be be allowed to make requests to your API. In a production environment it may be interesting to restrict this access, which can be done in this phase.

We have everything settled! Click on start an instance, and after the message indicating that the instance has been created, go to the “instance tab”. You should be able to see that your instance is running. Take the opportunity to copy the virtual machine IP.

Now it’s time to connect to the virtual machine we just created with SSH protocol, enter on the command line:

ssh -i pem_file_path ec2-user@virtual_machine_ip

If everything went as expected, you are now connected to your Amazon Linux virtual machine!

The VM is empty, so we will start installing Docker, starting it, giving user permission to use docker, and finally exiting the virtual machine (next steps will be performed from your machine).

# Installing Docker on VM
sudo amazon-linux-extras install docker

# Starting Docker
sudo service docker start

# Giving permission to the default user
sudo usermod -a -G docker ec2-user

# Returning to local machine
exit

Now we are going to use SCP protocol to copy the required files to the VM.

# Copying 4 files to /home/ec2-user (linux default directory)
scp -i pem_file_path ^
path\dockerfile ^
path\requirements.txt ^
path\server.py ^
path\shipping_estimate_model.pkl ^
ec2user@ip_maquinavirtual:/home/ec2-user

All copied! Time to reconnect to the virtual machine and finally build our Docker image and run the container, enter:

# Connect to VM
ssh -i pem_file_path ec2-user@virtual_machine_ip

# Build Docker image
docker image build -t app-shipping .

# Run container
docker run -p 80:80 app-shipping

Congratulations, you managed to create an web service!!

Now your API is available for the whole world to make requests. To get the API address, go back to the instance panel on the AWS website and search for Public IPv4 DNS.

That’s it, now you can make requests to this address from anywhere! for testing purposes, let’s use a jupyter notebook

https://medium.com/media/2bbea7835d2b401e117d3faccf1437a4/href

You can also test directly on a web page (“get” methods only). For example, you might want to take a look at the amazing documentation automatically generated by FastAPI

Important!

Now that you’ve tested your API and ensured that everything is running as it should, don’t forget to terminate the instance, otherwise there may be charges.

Where to go from here?

It is important to say that this guide is a basic example. Deployment is a complex subject and there is still much more to it than what I have presented here.

If you want to keep learning, here are some topics that you might want to take a look:

How to control metrics over time;
How to keep the model up to date (concept drift, model retraining);
How to store results over time;
How to increase the security of your deployment.

Thank you so much for getting this far! I hope I’ve made myself clear, but fell free to contact me with any questions or feedbacks!!

Mlearning.ai Submission Suggestions

Get your bar chart to the next level with Python

Felipe Mezzarana — Mon, 03 Oct 2022 23:17:43 GMT

How a few lines of code and some good practice standards can help you create beautiful and informative bar charts.

Photo by Isaac Smith on Unsplash

We all know that creating good visualizations is vital for anyone working with data. This article aims to teach some good practices in data viz and show how to accomplish them using Python (Matplotlib & Seaborn).

Without further ado, let’s get started!

Dataset

For this article I will use a dataset containing several information about Pokemons! I choose this dataset because it contains different types of features: continuous (Pokemons specs like attack, defense, etc) , categorical (types, name and gen.) and boolean (legendary) . Thus, we will have several visualization options to explore.

You can download this dataset directly from the repository with the source code used in this article. Lets take a quick look at our data:

https://medium.com/media/2cf59ff6acba9760e308f66ea21f3113/href

Defining which question will be answered

To generate good visualizations, the first step is to define the direction of our analyses, that is, which questions we want to answer with the data we have at hand.

We can think of dozens of questions that this data can answer, however, our goal is to generate a good bar plot, so we will select a simple question involving categorical values, like the Pokemon type:

What types of Pokemon have the highest attack?

The Bar Plot

Let’s start by preparing our data and creating the first “basic” bar plot. Using group by we can extract the information about the average attack by pokemon type and with Seaborn we can quickly plot the data.

https://medium.com/media/bde1f724f7b0410a8d3f5cde8f51d7ec/href

Simply looking at this chart is it possible to answer the question about which type of Pokemon has the highest attack? Well… maybe, but the information is confusing, and there are several elements that make it very difficult to interpret. To be honest, this chart is a mess!

Let’s try to improve it. First we need to organize our data. We can start by ordering it. Whenever there is no clear order of categorical data, we should present the data organized in ascending or descending order. We can also limit the number of categories shown. Filtering the top 10, for example, we can clean and make the chart more pleasant to interpret.

Let’s also take the opportunity to improve some basic elements of the chart. First of all, we need to select a single color! When generating a visualization, colors must be used very consciously, they can totally modify the interpretation of a chart, drawing attention to a specific point. In this case, the use of different colors distracts the attention and does not help at all. with just a few more lines of code we can also modify the image size, add a title and change the font size.

Tip: Colors can be chosen with Hex code, you can easily customize your colors with the help of a site like this

Let’s see how to write the code to generate our next view

https://medium.com/media/781d45bdbb8387934bd8e82bd12a4a20/href

Much better, right? With the new organization, it has become easy to identify the types of pokemons with the highest attack. Also, modifying the dimensions makes the information clearer, and a good title makes people more likely to be interested in your view.

However, we still have some problems and improvement points. First of all, we have some elements that isn’t adding informative value. Quoting Knaflic:

“Every single element you add to that page or screen takes up cognitive load on the part of your audience — in other words, takes them brain power to process. Therefore, we want to take a discerning look at the visual elements that we allow into our communications.” (Storytelling with data, p.71)

In our chart we can identify at least three useless elements: The borders, the x axis title and y axis title. Note that the title already explains what each axis means, there is no need to repeat the information.

We still have a very interesting point in relation to the alignment of information. There is a theory that claims that people will usually scan images from left to right, and, to a lesser extent, top to bottom, its called Z-pattern. This is interesting because knowing which information will be read first can be used to our advantage, we can first show to our audience how to read the graph, before they get to the data itself.

Z-Pattern layout

Thinking about this layout, we can make two more changes to our graph, move the title to the left to make it be read first, and move the x axis to the top, for the same reason.

Let’s see how our code will look like!

https://medium.com/media/25aa76f1278b57d0848e93168a9366b7/href

Did you notice the difference? Now we have a much cleaner and pleasant graph. The information is well organized and easy to interpret, Ccrtainly people will be more willing to read this graph!

This concludes my guide to quickly improve a bar chart! Thanks for reading, I hope you enjoyed😄