#DS 02 . Pandas, DataFrames and plot
In my last medium story, I told how I discovered the Data Science world and in the next posts, I’ll describe what I had learned about it.
=>Gihub repository<=
https://github.com/saulotp/ds02
I’m focusing to study python to be a simple but powerful language, actually, I believe that python is the ‘entrance door’ for the programming world. Looking for study material about python in Data science, I found about Spyder and Jupyter, which are amazing tools for Data analysis. Furthermore, these tools will help me to show my project's evolution.
First I’ll tell about the project. I got an excel file (the file will be in GitHub repository) somewhere from the internet, this file contains sales data from some malls scattering in brazil. With this data, I want to answer some questions like:
- Wich mall had more sales
- What the total value of all sales from all malls
- What the mean of all sales from malls
- Verify some data throughout a date interval
- Plot some data
First I have to import pandas library and make python read the .xlsl file:
After reading the excel file and have created a DataFrame ‘dfmain’, we can print to see the result:
We can also use the command ‘display’ to style the output (looks more cool, I’ll use display instead print):
In the code below, I convert the date column from string to a format that pandas read as ‘date’. After this, a new DataFrame was created to shows only the ID mall and total value from all sales for each mall sorted in decrescent order.
Result below:
For the next, I would like to know how the sum of all value sales from all malls, then I get a method that can sum one or more columns, and sum the ‘Valor Final’ Column. After this, I created another column on DataFrame to show the values in percent format.
To answer the question about the mean of sale I used the same code but with ‘mean()’ method:
In the next section I decided to write a script that can extract data for each mall only changing the [‘ID Loja’] == ‘Selected Mall’
To know how much each product was sold I created another DataFrame from (dfmall).
If we want to analyze a specific date period, we can create a filter to determined date period, and then, extract data more precisely. With the code below I found which product was more sold in January, but the date can be changed for any period, for example: from `2019–02–18` until `2019–07–23`
As result we have:
To view the total value of sales per month we can use the command “resample(parameters)” and select M (month) as a parameter in “data” (date) column. Thus, we will create another DataFrame with data selected:
Finally, the plot time. To plot we have to import another python library :
As result, we have an image representing the data with the sum of all sales per each month:
We can do the same to plot the mean of sales:
Well, “This is it”.
Study alone can be so hard sometimes, but it is possible.
Now, answering the previous questions:
Q: Wich mall had more sales?
A: Shopping Vila Velha R$ 1.615.271,00
Q: What the total value of all sales from all malls?
A: R$ 38.959.752,00
Q: What the mean of all sales from malls?
A: R$ 1.558.390.08
Q: Verify some data throughout a date interval
A: January from 2019–01–01 until 2019–01–31, shopping Morumbi was sold a total value of R$ 9.100,00 to “Terno”.
Q: Plot some data
A:
=>Gihub repository<=
https://github.com/saulotp/ds02
Contact me :
- saulodetp@gmail.com
-Ig : saulodetp
- Linkedin: saulodetp