The beginning of an adventuRe

Alan Marazzi
The Data Experience
4 min readNov 3, 2015

--

by Alan Marazzi

A few years ago I had to present a research about social media to the board of an online magazine. I almost didn’t know where to start: Twitter in Italy was pretty new, but I felt it could have been good for our zero-budget magazine. After some research I discovered this paper about Twitter mining and sentiment analysis using a consonant: R.

That was my first encounter with R project. It didn’t end up well. Back then my laptop wasn’t so good and I managed to crash it at least 5 times before I could perform some basic analysis. To be fair, I had a problem as well with coding. My only experience in coding was in Html and CSS, and I wasn’t even so good at it.

What I didn’t understand is that I fell in love. Suddenly I realized that the web wasn’t just a huge noisy mess, but it could be organized, analyzed and exploited. I fell in love with data science.

At the time data science wasn’t such a buzzword, I would say almost nobody knew it existed. So my life went on. I was living like 90% of us in an Excel world: my master’s thesis tables wheighted about 40 mb. I almost started learning Visual Basic, but luckily enough I didn’t.

“Suddenly I realized that the web wasn’t just a huge noisy mess, but it could be organized, analyzed and exploited. I fell in love with data science.”

Data always intrigued me, I steadily look for new insights, charts, visualizations, studies, etc. I even got a job that for the most part is about data retrieving and analysis from many different countries (the next time you feel the urge to mock HR people, you may want to think twice about it: the work changed a lot in the last years). But the process with Excel was tedious and annoying. What could I do?

R! A spark.

I remembered about that tiny little mess I did a while ago. I could remember those little arrows <- assigning variables, the way I queried Twitter’s API and how many times my laptop crashed. I took a decision: learn to code in R.

Fun fact, in high school I was bad in math. Really bad. I mean it. VERY BAD. But weirdly enough I was good in physics, economics and statistics. So I decided to take the challenge and study statistics as well.

I had the luck to find a few nice books on the subject, that I suggest to everyone interested in statistics and machine learning:

  • “Naked Statistics” by Charles Wheelan
  • “The Signal and the Noise” by Nate Silver
  • “Data Science for Business” by Foster Provost & Tom Fawcett
  • “Discovering Statistics Using R” by Andy Field, Jeremy Miles and Zoe Field.

These books won’t make you a data scientist, but they are a good foundation on understanding what the job is about.

A regression showing the relationship between headcount and labor cost

If you would like to learn R, statistics, data science and such, don’t be afraid. I didn’t have a strong formal quantitative education, but I have knowledge in areas where many engineers and/or mathematicians don’t.

Storytelling, business knowledge and experience are more important than the ability to perform ANOVA or a complicated Bayesian model. You can learn in no time how to perform these tests if you have enough coding and statistical basis, but for the previous abilities you’ll need years pouring “blood, toil, tears and sweat” (W. Churchill).

You can even limit your data knowledge to a few tasks that can make your everyday life easier. Think charts. If you use Excel you’ll find yourself clicking more or less a hundred times trying to reproduce the previous plot. With R you can reproduce it with:

With a few lines of code you’ll have a publication ready chart, while exporting from Excel outside Office suite can be a bit annoying. Moreover, if you have much more data points it could be impractical to draw and format a chart with Excel.

Focusing on charts it’s the best way to learn R: you’ll have nice looking results from the beginning and incidentally you’ll have to learn how to “wrestle” with your data. The wrong idea about data scientists is that they work with very fancy models all day long.

Wrong. As soon as you get close to data science you’ll understand that 90% of the job is about data collection, data cleaning, transformations of variables, subsetting, exploratory analysis, and so on. To visualize your data you’ll have to learn how to perform these tasks, this is why I think ggplot2 could be used as the best “R tutorial”.

So don’t be afraid, R and statistics are not so difficult to learn. Upsides are way more than downsides, and even if you won’t work with them on a daily basis you’ll be able to better understand the world surrounding you.

If you enjoyed this post, or if you didn’t, please let me know in the comments. In my next posts I’ll cover some basic R code, so if you’re willing to learn or refresh R programming keep following me.

--

--

Alan Marazzi
The Data Experience

Data scientist, articles about machine learning, data science process and studies. https://www.rdisorder.eu