Getting Started in Data Science: A Beginner’s Perspective

A rookie giving advice to fellow rookies.

Thanks again to wocintech for the photo!

Hey everyone, hope you’re having a good day. Today’s post is dedicated to Julia who is another subscriber with a question for me. She asks, “ Data science is a relatively new field, how do young people even start to learn about it?” This is a fun and somewhat challenging question for me considering that I’m not a professional as yet. Even though I’m not a pro, I’m still in data science and have been for the last 5 months (Woot!). So let’s get started.

Data science is such a buzzword right now, but what is it exactly? For starters, the industry is pretty young, around 8 years old. But the things that make it up is not. Data science to me is an interdisciplinary field that extracts knowledge and insights from data using processes and techniques from computer science, statistics, business analysis, and mathematics. Seems like a doozy right? Let me just say that to start learning this field, you don’t need to be an expert in ANY of the underlying fields. Knowing them a little can certainly help a lot but you don’t need multivariate calculus to get started so don’t worry.

Now that you know what it is where to actually begin?! I’ll answer based on my journey thus far and provide you with the many resources I’ve collected over the past 6 months.

1. Choose a Tool

You’re going to need a tool or 2 to start working with data. This could be spreadsheet software like Excel or Google Docs, or a programming languages like R or Python. Spreadsheet software is a good place to start if you’re already familiar with them.

A great introductory book is “Data Smart” by John Foreman, a data scientist at MailChimp. It’s really well done and clearly written so you can follow along and experiment. There’s a bunch of different data science techniques taught in the book in the context of business like k-mean clustering, naive bayes, and more. I won’t say you’ll become the office hero if you read this book, but you can certainly benefit from learning a few skills.

If you want a free online resource, I recommend schoolofdata.org. They’re a organization that aims to empower people of different backgrounds to learn how to understand and work with data.

2. Choose a Programming Language

Depending on your background, 1 and 2 are interchangeable. Having programmed before, I started with 2 . There are 2 primary programming languages used in the data science field. R is specifically a statistics language, whereas Python is a general programming language. I chose R because of the the multitude of packages helps to simplify programming. R is open source and the community is really active and helpful. Python is just as popular and also has a very nice community as well. I recommend trying both and figuring out which one you prefer.

The resource that I used when I just started out was DataCamp. I’ve written about them before in a few blog posts. In summary, DataCamp is an interactive website that teaches R and Python and how to use them in the context of data science. The lessons are incredibly engaging, which helped me to retain knowledge. However, like I said in my last post never get caught up in just finishing a course. Try to get your hands dirty too. You can use the editor DataCamp provides to test your own creations! You can learn and try new things at your own pace, so take advantage of that.

Also on the R track is Sharp Site Labs. This is a great website for beginners that does a good job of getting you up and running with good explanations that aren’t too tech heavy. R for Data Science is a book Created by Hadley Wickham & Garrett Grolemund. Hadley Wickham is the man who has created packages that have made the lives of R users easier of all skill levels. The physical book comes out this summer but you can read it online in the meantime.

Other websites of note are Udacity, Coursera, Edx and CodeSchool. They all have courses you can audit, or take for free as well as paid options. They all teach Python and R and have a good variety of data science courses.

  1. Udacity: Data Analysis with R, Programming Foundations with Python,
  2. Edx: Foundations of Data Analysis 1 & 2, Introduction to Python for Data Science
  3. Coursera: R Programming, Python for Everyone
  4. Code School: Try R, Try Python

Special shout out goes to Bill Kimler who’s blog Dreaming of Data is a treasure trove. He’s been learning data science for almost a year now and has thoroughly reviewed all the courses he’s taken.

3. Get and Play with Data

After completing either step 1, 2, or both I now welcome you to the fun part! Can’t exactly call ourselves data scientists without data can we? Thankfully we’re in luck. There’s so much data all over the web. Here’s a few places to get you started:

  1. Data is Plural Newsletter
  2. Github Repository
  3. KDNuggets
  4. Data Science Guide
  5. Quora

Those are more than enough to get you for now. Before you download a data set and choose to explore one for a project, you should be strategic. Think of a topic or industry that you’re interested in and search find websites representative of them. Learning is really enjoyable if you’re studying something you’re interested in. That’s why I looked for video game related data for my first project.

Now that you’ve got a data set, what do you do with it? You’ll go through a process that looks like this: tidy →manipulate →analyze →draw conclusions repeat till satisfied. I’ll go into a little detail.

  1. When you get a data set sometimes it isn’t “clean”. Ideally you want to have your data structured in a way that conveys what you’re seeing in an easy to understand way. In the R community this means that every variable belongs in it’s own column and every observation or value belongs in it’s own row. Here’s an example of a dirty data set:
Eric  Qiana Christina Jacky 
32 19 26 23
Male Female Female Male
A B A A

This is a basic data set with patient names and some values. While it looks simple and nice this can be improved. Let’s change it with the tidy rules:

Name     Age Gender Treatment
Eric 32 M A
Christina 26 F A
Qiana 19 F B
Jacky 23 M A

Can you see the difference? The data is organized better and in my opinion easier on the eyes. This is what we ideally want to aim for before we do anything fancy. This guide was helpful for me.

2. Manipulate data

Sometimes you don’t need to analyze all of the data. You might just need to look at a subset of data based on certain variables or observations. So when you manipulate, you change its structure in a way to suit your needs . You’ll be doing this a lot. In my post about my project, I demonstrate this.

3. Analyze

The main course. This where your hypotheses are challenged and new ones are formed. In this stage you’ll usually visualize your data first so you can start seeing correlations and things of interests. You may need basic math and statistics here to further investigate your data for more insights. Don’t worry if your statistics skills isn’t strong right now. Practicing over data sets always helps!

4. Stay Up To Date

STEM fields are changing every second. Data science is no different. Make sure you’re reading to keep up with what’s new. Here are some resources to help:

  1. Analytics Vidhya
  2. KDnuggets
  3. R-Bloggers
  4. Data Science 101
  5. Data Science Central

Bill comes again to the rescue with a list of solid podcasts. I’m a big fan of Becoming a Data Scientist myself.

This is all for now. I hope you see data science as something a little less scary and more approachable. You don’t have to be an expert math wiz or super programming ninja. You just have to be interested enough to learn and get your hands dirty. I’ll probably do an update of this as I get more experience or remember something else that can be helpful. If you have any questions, reach out to me on twitter or follow my Facebook fan page.

If you liked this article and learned something new hit the recommend button!

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.