First steps into Data Science

Story about inferential statistics project

This story won’t be a step by step guide to becoming Data Scientist, because I have no idea what should I do after completing following project. So I decide to write a story each time I have some accomplishments on my way.

Something about my background: I’m full-stack software developer in banking sector, so my primary tech is Java 8, Oracle SQL and modern front-end frameworks like React/Vue. I’ve also finished bachelor’s degree in Applied Mathematics and master’s in Computer Science, so I have some math background, but I need some time to remember concepts as my current work does not include advanced math and algorithms. So I decide to pursue my goal to learn more about Data Science and land a data engineer job in USA, California.

I’ve recently finished following courses:

So to wrap up what I’ve learning I have to do capstone project on my own. I won’t have anyone to check up my data, so I’ve decided to learn some side technologies while making my analysis.

Obtaining the Data Set

I’m really interested in human relationship, so I decide it will be interesting experience to analyse dating group from social network Vkontake. I think this process called data cleaning and data wrangling, but I’m not sure.

This social network has Java SDK, so I’ve decided to pick up Kotlin language for this task and, as data given in JSON format, I’ve decided to use mongoDB. You could look at the code, but I’ll just summarise what data I’ve got:

  • Comments count
  • Likes count
  • Reposts count
  • Views count
  • Message text; I could use it for future natural language processing courses, but for this project I’ve used text length and extract some data for samples
  • Attachements count
  • Anonymous post or not
  • Source of the post — Android, iPhone, not detected

Test tools and Visualization

I have got some python projects back in the University, but I almost forgot it. So I thought it would be good idea to learn some python data science libraries and tools with this project.

I thought it would be also a good idea to learn how to use Jupyter Notebooks, but it seems that this tool is only for end-to-end research and will slow me down. So I put it away for a time when I’ll get shorter and simpler project.

Overall data plot

I’ve learned inferential statistic on mostly normal distributions, so it was a bit confusing to see such plot. It’s definitely not classic normal distribution. But I should work with what I’ve got.

It seems that there are several outliers on the right side. So I took a peek into the data to see what’s common in them. It was people in the ‘right’ age, between 17 and 19 years old, who is open for conversation with both — older and younger people than them. They’re also from the biggest cities in our country and most messages is anonymous, so ‘likes’ was only way to connect with them. That’s why I decide to cut them off.

Q1 = df['likes_count'].quantile(0.25)
Q3 = df['likes_count'].quantile(0.75)
IQR = Q3 - Q1

df = df.query('(@Q1-1.5*@IQR) <= likes_count <= (@Q3+1.5*@IQR)')
Plot after IQR filtering

Much better! Now there are less outliers on the right side and I could clearly see that there is some unexplainable high frequency between 8 and 15. I think it’s occurred during hidden variables like time of publishing or maybe date when lot’s of people are less/more active. But I don’t want to investigate this further.

z-test and t-test

Let’s do some tests. Z-test was part of descriptive statistics course, so I thought it would be nice to implement it in my work. But, considering following article, I need to have normal distribution of samples. I’ve tried to fetch some random samples from data set, but they wasn’t even slightly normal. So I pick top and bottom 50 records and analyse largest and smallest values.

#%% Z-index highly likable
max = high_likable['likes_count'].max()
min = high_likable['likes_count'].min()
mean = high_likable['likes_count'].mean()
z_index_max = (max - df['likes_count'].mean()) / (df['likes_count'].std())
z_index_min = (min - df['likes_count'].mean()) / (df['likes_count'].std())
print('Max value is %f with z-index %f\nMin value is %f with z-index %f\nMean %f with z-index %f' % (max, z_index_max, min, z_index_min, mean, z_index_mean))

#%% Z-index highly likable
max = low_likable['likes_count'].max()
min = low_likable['likes_count'].min()
mean = low_likable['likes_count'].mean()
z_index_max = (max - df['likes_count'].mean()) / (df['likes_count'].std())
z_index_min = (min - df['likes_count'].mean()) / (df['likes_count'].std())
print('Max value is %f with z-index %f\nMin value is %f with z-index %f\nMean %f with z-index %f' % (max, z_index_max, min, z_index_min, mean, z_index_mean))

I have had following results for N = 50 largest and smallest:

  • Max value is 52 with z-index 2.860468
    Min value is 42 with z-index 1.950164
  • Max value is 5 with z-index -1.417963
    Min value is 0 with z-index -1.873116

With alpha level of 0.05 and two-tailed test I have values from z-table between -1.96 and 1.96. So, smallest values doesn’t reject the null, but largest do. I could conclude from this, that receiving more likes isn’t common event and something like this couldn’t happen by chance. On the other hand it’s quite common to receive zero feedback.

It’s useless to apply t-test on smallest vs largest samples, so I think it would be nice to compare pandas random sample vs largest. I’ll do it to be sure that N-largest posts are unique and really won’t happen by chance.

And this was my mistake. Because this type of test required normal distribution too. But, at least, I’ve learned how to write t-test on Python with scipy library.

from scipy.stats import ttest_ind
ttest_ind(high_likable['likes_count'], random_sample['likes_count'], equal_var=False)
Out[9]: Ttest_indResult(statistic=17.98439692586785, pvalue=4.982895901908361e-24)

So huge values does not make sense. So, I can’t use t-statistic to answer my question. Uhm. Quite bad assumption for first article in Data Science, but, well, I could deal with it and continue my research.

Correlation

It’s quite interesting part of the research.

Likes / Comments correlation
Text length / Comments correlation

And pandas could calculate Pearson’s r, so this is the result:

  • Hight correlation between date and id tells us that all id is made from the date. Well, quite obvious movement, but I never thought about posts ids in that way.
  • From the second plot I could assume that perfect message size is somewhere between 3800 and 4300 characters. It’s also was quite funny to see weak but negative correlation between number of comments and text length. Well, it’s quite obvious that in our fast paced world long texts is overrated and I don’t think even on Medium it has right amount of attention.
  • And, as I expected, quite strong correlation is between views, comments and likes. And strongest among them is between comments and likes, so It shows us beautiful and accurate first plot.

It’s also interesting to investigate more in outliers, as they could tell me more about creating perfect submission, but they’re not so important from mathematical statistics point of view.

ANOVA

Final step of this research would be comparison of two groups. I’ve fetched data from more popular group with less strict rules — it has 20% more users and almost 2x published posts. So my null hypothesis will be that there is no difference between them. And, well, I have no confidence that null will be rejected.

I’ve used lecture and pandas dataframe to calculate ANOVA:

hf, ff = hfdf['likes_count'], ffdf['likes_count']
grand_mean = (hf.sum()+ff.sum())/(hf.size+ff.size)
print('Grand mean: ', grand_mean)
ss_between = (hf.size * ((hf.mean() - grand_mean) ** 2)) + (ff.size * ((ff.mean() - grand_mean) ** 2))
print('SS between: ', ss_between)
ss_within = hf.apply(lambda x: (x - hf.mean()) ** 2).sum() + ff.apply(lambda x: (x - ff.mean()) ** 2).sum()
print('SS within: ', ss_within)
df_between = 1
df_within = hf.size + ff.size - 2
print('Degrees of freedom: between (%f), within (%f)' % (df_between, df_within))
ms_between = ss_between / df_between
ms_within = ss_within / df_within
f = ms_between / ms_within
print('Mean square: between(%f), within (%f)' % (ms_between, ms_within))
print('f-statistic: ', f)

And got following output:

Grand mean:  14.833064516129033
SS between: 105009.49223173113
SS within: 486796.7295424624
Degrees of freedom: between (1), within (6198)
Mean square: between (105009.492232), within (78.540937)
f-statistic: 1337.0032979966788

Even without f-table I see that f-statistic reject null. Though it’s not enough to understand this numbers, that’s why I’ve draw some plots about bigger group:

I don’t think difference in group population really affects rating, but more frequent posts and lighter moderation do. It seems that it’s quite rare occasion to receive more than 30 likes.

I could also do artificial assumption that you can’t use same post in both groups. Even if it will be accepted in first group and will be highly appreciated it could also be totally ignored in the other group.

Conclusion

It was very fun project. I’ve learned a lot — I remembered basic Python syntax that would be useful for the future courses, set up Python and it’s virtual environment on my laptop, get some basic understanding of the Jupyter Notebooks, but haven’t explicitly used it in this project.

Though I can’t verify my conclusion and results, because there is no options in free courses. And I’ve got a lot of questions which I could ask this dataset and maybe I’ll return to it in the future. I’ve also have bad feeling, that I’ve made a mistake in ANOVA and got wrong ideas.

So, I should set several goals in the end of this step: 1) try to find someone on Udacity forums to verify this article, but I doubt that I could do it; 2) go on with my Data Science education and take course on Udemy as I’m not sure that I could manage Nanodegree with full-time job.