# Visual representations

This is a follow-up to my last publication, about learning business statistics in order to find a data science job, if you didn’t read my first post, please check it in here.

So, this text will cover the contents of the first and half of the second chapter of the book Statistic for Business and Economics by David R. Anderson, Dennis J. Sweeney, and Thomas A. Williams, if you want to check out I’m going with the eleventh edition.

If you want to follow this path, you need to rely more on learning with documentations than learning from free internet content.

Why? Well, this text will cover a little bit about some very early concepts but more on visual representations, and doing it on python, can be very tricky. The most tricky part of it is the part about how to input your data, and how your input will give the desired output, I looked into some content online to try to help me and, some of it just make you go and reproduce exactly what the person in the other side of the screen is doing, which is very far from reality (even the exercises for you to think about, are pretty much a copy and paste).

Given that situation, I had to go choose my tools of choice, and I how they work together, as my computer is very old (I’m running a 4GB RAM so pray for me in the future), I’m going with Jupiter Notebooks as my IDE, for running each individual cell, Pandas for data manipulation, just because I have more affinity with it now but I’m learning how to switch to Numpy when necessary and Plotly for data visualization, because, apparently, Matplotlib doesn’t work so well with Jupiter and I had an annoying bug running it.

Another thing I found myself really surprised in the first chapter already, the number of concepts covered by the book that often listen associated with data science. Inference, data mining, time series, quantitative data, all that jazz really explained and really giving me some ground on all those stuff.

So now that you are up to speed, let’s get into what I’ve done so far.

## Numerical facts and statistics

The point of the first chapter is to ground you on what statistics is really about, in a very clever way, hooking you up with the difference between the discipline and what you read in the newspaper as statistics about some subject.

So what is statistics? Collecting, analyzing, presenting and interpreting data, but is that science? Well, I’m definitely not the one you should ask that question, but according to this book, it is considered an art, for the fact that data can have more than one interpretation, which is often the case of discussions over some set or some way to find the answer to some problem, in a more objective way.

So, now I’m going to talk about some concepts and clarification about things I didn’t see out there, but are very important during solving a problem.

First, let’s talk about a time series, well, a time series is a data set with data collected over time, and cross-sectional data is data collected at a certain point in time. And if you are like me and already look into some things about machine learning, well, you know that a time series and forecasting are very hard to achieve, so there is a high chance of what you see to the point where you learn something like Prophet, is going to be a cross-sectional dataset.

Second, population, which is the set of all the elements, a population is basically the blob of all your data, it has a shape and characteristics which you're working on. Now, a sample is a smaller part of your population, and a good sample is the one that has the same shape and characteristics of your blob, there are some deviations, is a smaller blob after all, but is good to try something out before applying it to all your population.

And lastly, everybody already played some kind of lottery at least once, for fun or not, and also you asked yourself, “Well, with the right amount of data would I hit the right numbers?”. This my friends is a very good example of statistical inference, where you use data to estimate or to test a hypothesis. Basically something a data scientist needs to be very familiar with.

## How do I see my blob?

The second chapter and third chapter are focused in how can you summarize your data, which can be in various numbers of ways. The second chapter focus on visual representations on descriptive statistics, which is the set of techniques that you can use to make a summary of your data.

Summarizing your data is important because can give you the shape and characteristics of your population, a very important step if you want to apply some algorithm to it. And as I said in the first part I got very swamped here, starting very confident about my pace, but then it got really slow. Just to show, here is my first graph, question 14, chapter 1, it took me three days to come up with this, a lot more than I expected:

`# 14.import pandas as pdimport plotly.express as pximport plotly.graph_objects as gocsm_data = {"Manufacturer":['General Motors', 'Ford', 'DaimlerChrysler', 'Toyota'],            "2004":[8.9, 7.8, 4.1, 7.8],            "2005":[9, 7.7, 4.2, 8.3],            "2006":[8.9, 7.8, 4.3, 9.1],            "2007":[8.8, 7.9, 4.6, 9.6]}df_csm_data = pd.DataFrame(csm_data)print(df_csm_data)# A)"""Setup data"""tdf_csm_data = df_csm_data.set_index('Manufacturer').T.reset_index()year = tdf_csm_data.iloc[:, 0]car_production_values = [tdf_csm_data["General Motors"], tdf_csm_data["Ford"], tdf_csm_data["DaimlerChrysler"], tdf_csm_data["Toyota"]]print('\n', tdf_csm_data)"""Build histogram"""fig1 = px.line(tdf_csm_data,             x=year,             y=car_production_values)fig2 = px.scatter(tdf_csm_data,                x=year,                y=car_production_values,                size=[5, 5, 5, 5],                opacity = 1)fig3 = go.Figure(data=fig1.data + fig2.data)fig3.update_layout(title='Global Auto Production',                   xaxis_title='Year',                   yaxis_title='Car production (millions)',                   legend_title='Automotive Brands')fig3.show()`

And here is my last graph made in question 21, chapter 2:

`# 21.computer = pd.read_excel('~/Documentos/projetos_ds/livro_1/excel_files/ch_02/Computer.xlsx')# A/B)interval_1 = (computer['Hours'] <= 3).sum()interval_2 = ((computer['Hours'] >= 4) & (computer['Hours'] <= 7)).sum()interval_3 = ((computer['Hours'] >= 8) & (computer['Hours'] <= 11)).sum()interval_4 = ((computer['Hours'] >= 12) & (computer['Hours'] <= 13)).sum()interval_5 = (computer['Hours'] > 14).sum()computer_freq = {'Intervals':['< 3', '4-7', '8-11', '12-14', '> 14'],                  'Frequency':[interval_1, interval_2, interval_3, interval_4, interval_5]}df_computer_freq = pd.DataFrame(computer_freq)df_computer_freq['Relative Frequency'] = (df_computer_freq['Frequency']/df_computer_freq['Frequency'].sum())df_computer_freq['Cumulative Frequency'] = df_computer_freq['Frequency'].cumsum()print(df_computer_freq)# C)fig = px.histogram(x=df_computer_freq['Intervals'],               y=df_computer_freq['Frequency'],               title='Hour of computer home usage',               labels={'x':'Intervals', 'y':'Frequency'})fig.show()# D)fig = px.scatter(x=[3, 7, 11, 14, 17],                  y=df_computer_freq['Cumulative Frequency']                 )fig2 = px.line(x=[3, 7, 11, 14, 17],               y=df_computer_freq['Cumulative Frequency']              )fig3 = go.Figure(data=fig.data + fig2.data)fig3.update_layout(title='Ogive',                   xaxis_title='Classes',                   yaxis_title='Cumulative Frequency')fig3.show()`

Much more cleaner than the first one right? Another thing, the first graph is a time series and the last one an ogive, so, an ogive is a line and scatter graph the displays the number of cumulative frequencies from a data set. Which is what you’re displaying mostly in the first half of the second part, histograms, bar charts, scatter, dot charts, all those things to display frequencies of elements is your population, helping you to visualize a fashion, a mean or things like that.

Also, they are all one dimension representations, which means that are graphs that show only one variable from your data set, no correlation, no variable relationship in these graphs, pretty good stuff to learn about how to make a graph in python, and most importantly, how to improve it.

## What’s next?

Basically, the next post is going to be about the last part of chapter two, where more dimensions are added to the graphs, and some things that we can do with it, and the third chapter, which deal with numerical summaries of data, pretty much what you see on the news every day.

I’m looking forward to it, thank you if read till here, check out the last post about how it all started, leave a clap and feel free to comment if you want to add something to this text.

I still didn’t put this on Github, but I’m leaving this here and my next update to this post should be there with the codes, if you read this you probably read it very early, thanks for the support!

Take care!

## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Written by

## Rômulo Peixoto

Learning Data Science in a humorous way. ## Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com