Tell Compelling Stories With Your Data

Published in

The Deep Hub

15 min readJul 19, 2024

The title, for anyone with experience in data science or machine learning or anything related to plotting and displaying data, is something you have heard a lot. Displaying data is showing numbers to people (who in my opinion) kind of, in the moment that you are showing them numbers, don’t want to see them. Like, a normal human doesn’t like numbers. Our goal as data people (haha, I will call it data people, it encompasses all people that deal with data) is to make people like complex number and feel attracted to the data. They should feel like they are best friends with the data after they see it.

So why am I writing this? Well, the big reason is so that I can improve my data displaying skills, and also, so that I can help you and I learn about important data displaying techniques, and how to find out a lot about our data in just a few plots. I kind of envision this to allow me (and you, if you of course read the whole thing) to be able to go into any problem that deals with data, and you immediately, after typing a few lines of code, get to know the data. I also see this as helpful for my statistical knowledge. As I am about to go an take the SAT, the more I can analyze graphs, the better. So let us begin!!!!

Plotting

I know python really well so I will be using python (maybe there is some python to R translator out there so you can translate the code to R). I know R is really popular for data people but I will use python. The syntax is very readable and really easy to understand. What in python are we going to use? How are we going to approach this learning? Well, here is the plan:

Display Regression Data (of course make it very pretty)
Display Classification Data (of course make it very pretty)
Display Image Data (of course make it very pretty)
Display a bunch of popular statistical plots of some types of data (of course make it really pretty)

I feel like with each step, it gets a little more challenging. And I want it to be that way so I kind of struggle, but also, I do think, concept wise this is much easier to understand than anything I have done before. So let us begin with this. Before I actually plot anything, for each section, I will try to pick a plot (without looking things up) and explain why I think it is a good plot, then I will look up popular plots (depending on the problem of course) and plot that. My main two libraries will be matplotlib and seaborn. I don’t know of any other plotting libraries in python, but, if I do discover one, I will of course use it.

Displaying Regression Data

I’m not sure if regression data is the right term, I kind of just called it that. I think the write term might be like continuous data. But what I mean is data with and x and y and you can fit a linear (straight) line to the data. I think we can use a scatter plot for this, a scatter plot just plots each point which is an x and y coordinate on a cartesian plane. Just think dots on a graph. So here is the code for the X and y data. I just generated this from ChatGPT. You can do this with your X_train and y_train data, and you can also do this with pretty much any X and y data.

x = np.array([ 0.        ,  0.1010101 ,  0.2020202 ,  0.3030303 ,  0.4040404 ,
         0.50505051,  0.60606061,  0.70707071,  0.80808081,  0.90909091,
         1.01010101,  1.11111111,  1.21212121,  1.31313131,  1.41414141,
         1.51515152,  1.61616162,  1.71717172,  1.81818182,  1.91919192,
         2.02020202,  2.12121212,  2.22222222,  2.32323232,  2.42424242,
         2.52525253,  2.62626263,  2.72727273,  2.82828283,  2.92929293,
         3.03030303,  3.13131313,  3.23232323,  3.33333333,  3.43434343,
         3.53535354,  3.63636364,  3.73737374,  3.83838384,  3.93939394,
         4.04040404,  4.14141414,  4.24242424,  4.34343434,  4.44444444,
         4.54545455,  4.64646465,  4.74747475,  4.84848485,  4.94949495,
         5.05050505,  5.15151515,  5.25252525,  5.35353535,  5.45454545,
         5.55555556,  5.65656566,  5.75757576,  5.85858586,  5.95959596,
         6.06060606,  6.16161616,  6.26262626,  6.36363636,  6.46464646,
         6.56565657,  6.66666667,  6.76767677,  6.86868687,  6.96969697,
         7.07070707,  7.17171717,  7.27272727,  7.37373737,  7.47474747,
         7.57575758,  7.67676768,  7.77777778,  7.87878788,  7.97979798,
         8.08080808,  8.18181818,  8.28282828,  8.38383838,  8.48484848,
         8.58585859,  8.68686869,  8.78787879,  8.88888889,  8.98989899,
         9.09090909,  9.19191919,  9.29292929,  9.39393939,  9.49494949,
         9.5959596 ,  9.6969697 ,  9.7979798 ,  9.8989899 , 10.        ])

y = np.array([-0.56788088,  1.26589747,  1.08627425,  0.60354599,  0.60452786,
         1.39564374,  2.12172136,  1.59556151,  1.30226343,  2.44866732,
         1.95718929,  2.9761534 ,  2.52881873,  2.05636421,  3.34146315,
         2.13329713,  2.3495311 ,  2.97422536,  2.43056355,  1.67422831,
         1.78733175,  1.60921901,  1.84900907,  2.79998462,  1.72041244,
         0.16908494,  1.86163872,  0.62143103,  0.65862417,  1.07016857,
         0.45292663,  0.489355  , -0.58044053, -0.59595413, -0.51806914,
        -1.30623871, -1.32535287, -1.20818851, -2.8850232 , -1.93288002,
        -3.21166794, -2.79179644, -2.11404187, -2.1899814 , -2.17178458,
        -2.80072544, -2.46907737, -3.28491015, -2.97664276, -2.49937246,
        -2.54051741, -3.71927225, -2.12097507, -1.70392433, -1.73857926,
        -1.12087076, -1.54266643, -1.33019865,  0.02482211, -0.33741036,
        -0.29987011,  0.09406606, -0.35278573,  0.20988918, -0.08249019,
         0.50139092,  1.31406254,  1.62319323,  1.29241875,  1.40747192,
         1.96862544,  2.21011652,  2.20199745,  2.12302654,  2.56086233,
         1.76075256,  1.95540709,  2.50184777,  2.67419324,  2.26974669,
         1.96658097,  2.08010722,  2.7751688 ,  2.25591548,  1.88968532,
         1.60349614,  0.99669517,  1.58909715,  2.4751188 ,  1.93359603,
         0.18003145,  0.57448863, -0.36243992, -0.02639957, -0.39099368,
        -0.60032941, -1.15431415, -0.7900333 , -0.0547462 , -1.04741739])

It gave me a lot of data to say the least, but here is how I would plot it (this will be a pretty plot).

plt.scatter(X, y)
plt.title("Relationship Between X and Y")
plt.xlabel("X Values")
plt.ylabel("Y Values")
plt.show()

It is a very simple plot, it is the prettiest I could think to make it, but I wonder what other things I could do to make it prettier.

We need to make this prettier, we need to make it spicy and attractive to the eye. Now, for normal, data displaying purposes, this graph shows us the relationship (for some reason it’s an M). But, for the purpose of showing big people who are bored your data, you need some spice. Matplotlib offers a lot of customization. And we need to show more detail, so why don’t we show some smaller increments within the axes. Also, lets make the background black and the dots yellow. Why don’t also try and show the mean and the median with a line on the plot. And also, if I find anything that does something cool, we could do that too!!!

So, I was doing some reading, and I have finally figured out what this ax and fig mean. For those who know what I am talking about, you don’t need to read the next few lines. But, for those who don’t know. Everytime I want to make a change the plot, I use this plt.plot or plt.xlabel. These all change the plot and the API is called pyplot API. The ax and fig is the object oriented API. I think the OO (object-oriented) API is much much more customizable. I will redo this plot using the OO API.

## OO API
fig, ax = plt.subplots()

fig.set_size_inches(9,9)

ax.scatter(X, y)

# Title and Axis Names
ax.set_title("Graph of Relationship Between X and Y")
ax.set_ylabel("Dependent Variable")
ax.set_xlabel("Independent Variable")

plt.show()

This code outputs the exact same thing, and what is good about this, is that I understand more about the ax meaning axes (or plural off axis) and fig, which is the whole figure itself. I believe the fig controls the entire thing, while ax controls what inside of the axis (ie. the data points) and the axes themselves (like color, width, increments, all that).

Actually, now that I think about it, the ax and fig (or the OO API) is, I think, more for when you have multiple plots on the page, you can control the layout and all the things for the whole set of plots, but when you have a single plot, it can be done using the pyplot API. I will use the pyplot API to manipulate this graph to make it prettier.

ax = plt.axes()
ax.set_facecolor("black")

plt.scatter(X, y, c='yellow')

# Plot title
plt.suptitle("Relationship Between X and Y\n", y=1.0001, fontsize=20, color='black')
plt.title("Data from Chat-GPT", fontsize=15, fontweight=0, color='black', loc='center', style='italic')
plt.xlabel("X Values")
plt.ylabel("Y Values")


plt.grid(visible=True)
plt.show()

This plot is much much better, I still don’t like how it looks, but it is much much better than before. I think this is a good final look, there is a super title and then there is the normal title which is good. I also like the color scheme. When you get more detailed, you could add minor ticks on both axes and also maybe change the background color of the outside of the plot.

Display Classification Data

What do I mean by classification data? Well, I will show you. I will be using the famous iris dataset and will be displaying it first. Of course, this will be of the ugly style and little information, your boss will hate you. This is what I sometimes do. I just plot like this:

fig, ax = plt.subplots()

scatter = ax.scatter(data.data[:,0], data.data[:,1], c = data.target)

And out comes this hideous thing:

There is no legend, we don’t what color means what, we have no title, no sub title, no labels on either axis and the background color is ugly, I think we need a grid, so lets get to work.

fig, ax = plt.subplots()
scatter = ax.scatter(data.data[:,0], data.data[:,1], c = data.target)
ax.set(title="Iris Dataset", xlabel=data.feature_names[0], ylabel=data.feature_names[1], frame_on= True, xmargin=0.2, ymargin=0.2)

ax.grid()


# Changing the outside grid color
fig.patch.set_facecolor('xkcd:mint green')
ax.set_facecolor('orange')


_ = ax.legend(scatter.legend_elements()[0], data.target_names, loc = "lower right", title = "Classes")


# Saving the plot
fig.savefig("test.png")

I realize that this plot doesn’t look to fancy, but it looks good. There is a limit on how much you can customize a plot until it becomes an annoying thing to look at. I added some face color (that is the term) and some background color. In this case we are using fig and ax because (I think) that the pyplot API doesn’t really support the categorical type of data and also, it’s easier to understand the actual components that make up the plot when you use fig and ax. The methods that can be used make sense without you needing to think about them.

See, its so much better, so much more attractive. The colors allow your eyes to pay attention to it. I do think though that maybe I should have left the margins on the axes alone and I realize this now because it allows outliers to more easily be seen. I think (now I have learned) that there is a fine line between the distribution of the data looks same (because of course we manipulated the plot) and the outlier are way to far out, because then your whole plot looks really weird, though that wasn’t really the case with our data here. Let’s move onto the next type of data.

Display Image Data

Displaying image data is very simple. I don’t really know how I would make it prettier, maybe their is a library that allows images to be displayed just better. But, we are of course going to be using the famous MNIST dataset for this. While these are very simple images, even the more complex ones are the same when it comes to displaying them (mostly).

from tensorflow.keras.datasets import mnist
import matplotlib.pyplot as plt

# Loading Data
(X_train, y_train), (X_test, y_test) = mnist.load_data()

#Show a single image
plt.imshow(X_train[0])

And simply, we get this ugly thing. But, when you think about it, when you look at an image (at least in python) you aren’t really trying to display an analysis of the data, you are just trying to show the contents of the data file, if that makes sense. We can make it prettier, I will try, but not much can be and is needed to be done.

We can add a super title, a sub title, we can add another title saying what the actual label of the image is and then we can make it a bit bigger. Not doing to much with the color.

# Making it a bit bigger
plt.figure(figsize=(8,8))


# Titles
plt.suptitle("First Number in MNIST", y=0.93)
plt.title("Data from MNIST Image Dataset")
plt.title(f"Actual Label:  {y_train[0]}", loc='right')


# Axis Labels (with metrics)
plt.xlabel("Width (pixels)")
plt.ylabel("Height (pixels)")



plt.imshow(X_train[0])

We get this thing:

Something I am just noticing in using this plt.figure is that it is the same thing as the fig when you plt.subplots using the OO API. I never realized that they have the same control of the whole thing and control dimensions and sides on the outside of the whole thing instead of the inside. What I want to do next is to display 5 random images and make it look exactly like this image above.

def plot_images(datax, datay):
    len_of_data = len(datax)
    random_index = round(random.randint(0, len_of_data))

    plt.figure(figsize=(7,7))



    plt.suptitle("First Number in MNIST", y=0.95)
    plt.title("Data from MNIST Image Dataset")
    plt.title(f"Actual Label:  {datay[random_index]}", loc='right')


    plt.xlabel("Width (pixels)")
    plt.ylabel("Height (pixels)")



    plt.imshow(datax[random_index])

Above is a function that makes it so that a random image will be displayed with it’s label. I don’t know if the code is the best code possible, I also do think in this case it doesn’t really matter, but the function is pretty simple, and we can apply it to the subplots thingy that comes up later.

ax=fig.add_subplot(1,5,1)
image = plot_images(X_train, y_train)
ax=fig.add_subplot(2,5,1)
image = plot_images(X_train, y_train)
ax=fig.add_subplot(3,5,1)
image = plot_images(X_train, y_train)
ax=fig.add_subplot(4,5,1)
image = plot_images(X_train, y_train)
ax=fig.add_subplot(5,5,1)
image = plot_images(X_train, y_train)

Now this code looks hideous, I think there has to be a better way to do it. But it achieved what we wanted it to achieve. Think about it, they aren’t going to see you code, they just want to know what the images look like. And for personal use, I just want to know what the images look like. Do I need to resize them? Do I need to change their colors? What do I need to do. Getting a random sample gives you an idea of all of that plus more.

There are more images than this, just didn’t want to show you

Next, let us move onto displaying statistical plots and all of that, this is where some learning in the math department happens.

Display a Bunch of Popular Statistical Plots of Some Types of Data

So, our first task is to look up 5 useful statistical plots. I want to do five because it is my lucky number and also, I feel like for any data, you will have enough information about your data to continue (in my case) to make a model with tensorflow or pytorch around the data. So let us begin. Here are our 5 plots:

Of course, we have the famous histogram. We can learn about the distribution of our data (we can even add a line for the distribution of our data) and we can also add line about the median and mode of our data.
We also have the pie chart. It visually explains to you how much of each thing (we could say category) you have in your data. It is useful to see what your model will train on and be overly trained on one type (label) vs another one.
We can also do a box plot. We get so much information about the range, the median, the quartiles. We get to know if the dataset (or the column in which you are performing the plotting of the box plot) has any outliers or not. And it’s very fun to look at.
The violin plot (best instrument by the way). Extremely similar to the box plot but instead it also shows the distribution of the data to. You could (in real world scenarios) just use a violin plot instead of a box plot and histogram, but it doesn’t matter that much.
For natural language processing, if you would like a better visualization of the frequency of words, you can use a word cloud. Its a colorful, less mathy way for showing word count.

With those 5 plots, let us begin the plotting, I will immediatly try and make these plots simple yet nice. Not crazy like before with the colors because think about it, when you want to see the data, you don’t want to code to much, you want to see the data more. As a short cut for me making this data I will be using Chat-GPT to make the data.

Histogram

We have this dataset of scores, we can see how many instances of each score there is using a histogram. We can also see the distribution of it.

scores = [55, 62, 75, 80, 85, 90, 45, 70, 76, 82, 88, 91, 94, 96, 60, 72, 65, 58, 78, 83, 87, 93, 98, 100, 68, 74, 81, 89, 66, 77]


plt.figure(figsize=(8,8))

plt.suptitle("Student Test Scores", y=0.94)
plt.title("Data from GPT")
plt.xlabel("Scores")
plt.ylabel("Frequency")



plt.hist(scores)


#plt.grid()
plt.style.use("bmh")


plt.show()

The code here is nice an simple and it displays a clean histogram. You can see the distribution is tilted a bit to the right, and while I will not talk about what that means here, we can see that maybe (if we were to make a model) that it would be trained to much on the higher scores, or tend to predict higher scores.

Pie Chart

Now, a pie chart is also something that can tell you the distribution of something, but not the same way the histogram does. It can tell you how much of what categories make up a certain dataset. This can let you see whether you have a high amount of one type of data over the other.

# Sample data
labels = ['Apple', 'Samsung', 'Huawei', 'Xiaomi', 'Oppo', 'Others']
sizes = [30, 25, 15, 10, 10, 10]
colors = ['#ff9999','#66b3ff','#99ff99','#ffcc99','#c2c2f0','#ffb3e6']
explode = (0.1, 0, 0, 0, 0, 0)  # explode the first slice (Apple)

plt.pie(sizes, labels=labels, explode=explode, colors=colors, shadow=True)
plt.title("Market Share (by %)")
plt.show()

This code outputs a simple pie chart. And while, the data here isn’t exactly what the data would look like in real life. You can think of each label as your target classes and then the sizes can be the percent out of the total amount of target labels there are (of course, this is all really obvious).

I think you can all tell what the explode thing does. It makes it bigger. I think for real world data you would have to hand perform some of the calculations.

Box Plot

I think we can use the same data for the box plot as we do for the histogram. While this version of the box plot doesn’t give us as much information it is still useful. There is ways to make many box plots appear on one graph but we don’t need to do that.

plt.boxplot(scores)

plt.title("Box Plot of Scores")
plt.style.use("bmh")

And simply, the good output, a simple output, is the box plot itself:

With this box plot you can tell the median, the range, the outliers (if there is any) and then also the 25th percentile and the 75th percentile. Unfortunately, you don’t have the distribution, which would be very helpful in this case.

Violin Plot

Box plot with distribution attached to it, it is very simple, nothing else. I don’t know how they think it looks like a violin though.

sns.violinplot(scores)

plt.title("Violin Plot of Scores")
plt.style.use("bmh")

And, our simple output is a nice violin plot. We use seaborn as matplotlib doesn’t give the median and the quartiles. We can see the distribution, but the plot does extend past the extremes which gives a wierd look to the distribution.

Let’s move on to our last one.

Word Cloud

Nice way of getting to know how many counts of each word there are (without actually knowing how many of each word there are).


import matplotlib.pyplot as plt


from wordcloud import WordCloud, STOPWORDS
text = 'cat cat cat cat cat cat dog dog dog dog dog panda panda panda panda koala koala koala rabbit rabbit fox'

wordcloud = WordCloud().generate(text)


plt.imshow(wordcloud)
plt.title("Word Cloud of Word Frequency")
plt.axis(False)
plt.show()

And this simply outputs a nice beautiful word cloud chart/plot (I don’t know what you call it).

What I learned

This is a bit different. I isn’t like what I normally write. I learned many things about plots. I now know how to make many plots, how to analyze them, and which ones to use in which cases. I wanted you guys to learn to, hopefully you did. My next article will come out in September, I am going away for a month and won’t be able to write for that month, I will come back with some more exciting machine learning topics. Bye for now!!! (As usual, if you recommend anything, please comment!!!). Thanks!!!