I Want To Be a Data Scientist, Part 1
My name is Tobias. I studied Economics in college and have worked as everything from a grocery store bagger to a hotel manager. I’ve never studied or done anything in what one might consider a “technical field”. A couple months ago, I wouldn’t have been able to tell you what a compiler is, but all the same I made a decision to pursue a career in Data Science. As well, I’ve decided to document my learning progress so that I can:
a) Give back to the community of — I don’t even know what to call it — vast online resources where 99% of your questions are answered with a quick Google Search
b) Externalize my progress and give myself a little social pressure to see it through
c) Do a mental review of the most important things I’ve learned
Here is the beginning of my path from not knowing how many bits are in a byte (or is it bytes in a bit?), to a being a productive member of the tech community:
“What IS a data scientist?”
It’s a question asked more and more frequently nowadays, and also answered, debated, and speculated by many countless people and blogs and news articles, so I won’t even try to get into it here. From what I know the best way I can summarize the field of Data Science is a cross between Statistics, Computer Science, and Domain Expertise — an interdisciplinary approach to answering questions like “How does the general TwitterSphere feel about the colossal fuck-up at the Oscars last night?” or “Would Jennifer prefer to watch Narcos or Orange Is the New Black?” or more serious questions like “How likely is Bob to suffer a heart attack given his current Big Mac consumption level?”
“Where do I start?”
The process of learning something like Data Science on your own, from scratch, seemed daunting to say the least. Just like some of you out there thinking about beginning your forays into data science, I had no idea where to start. But once I buckled down and started looking for places to start, I was slowly able to find resources that not only helped build my knowledge but pointed me in the direction of what to learn next.
Here’s a good place to start. This Springboard (more about them later) free curriculum will give you an idea of the concepts and tools you’ll be using in Data Science.
“What did I learn first?”
1. Basic CS Concepts I have to admit, that before all this hooplah, I’d spent a couple months learning the basics of computer science while living in Brazil. I enrolled (but sadly never finished) in a well-known edX course (online) called “CS50”, taught by Harvard’s David J. Malan, which you can find here. It taught me basic programming concepts like variables, loops, functions as well as some basic computer science concepts like binary, compilers, basic cryptography. The course is pretty comprehensive (it’s Harvard), and taught in C. I found that for a true beginner like myself, I had to spend hours dissecting basics concepts just to wrap my head around it — so while there was plenty of frustration and hair-pulling when I was doing it, I learned a great deal of fundamentals that would expedite my learning process with other languages later. The biggest takeaways were learning what a programming language is, and the idea behind loops and iteration. I was able to write a basic program after a week, and more importantly, it gave me a better perspective of all the complexities and hard work that goes into Computer Science.
2. Python The second thing I started to learn was Python. It took me a little while to choose what programming language to start with, but I landed on Python because of its widespread use in Data Science, as well as it being a good introductory language for beginners. Python is an object-oriented, higher-level programming language (scripting language) and one of the easier ones to learn. If you don’t know what Object-oriented programming is or the programming language hierarchy, you’ll probably need to brush up on computer science basics. Essentially, Python inherently simplifies some of the computer science logic that’s required in other lower-level languages.
To learn Python, I started with a couple Lynda courses. This course on programming fundamentals was a good starting point for me (remember I had learned some basics prior to this). It’ll cover some of the concepts I mentioned above, and give you a good base from which to delve into Python. If you don’t know about Lynda, it’s an awesome platform for online learning, owned by LinkedIn. It’s about $25/mo and totally worth it (given that you utilize it). I was lucky enough to have a housemate who let me use his Lynda login credentials, but I’d totally invest in a couple months’ worth of access, especially since you get access to over 5,500 courses, covering everything from 3D Animation to Business Analytics.
I followed that up with a couple more Lynda courses, Up and Running with Python and Python 3 Essential Training. It took me a couple weeks to get through them, accounting for Googling and YouTube time (related to Python and Programming). ***Interesting note: If you google ‘Python List Comprehension’, Google will open up a one-time-only programming challenge, which I’ve read is a sort of secret interview process.
I had a decent understanding of the basics at this point. I followed up the Lynda courses with this Udacity course on designing programs (it’s free). It got real difficult real quick (around Lesson 3), but I powered through most of it keeping in mind that you don’t always need a full understanding of the material immediately — you just need enough exposure to know what to look for when you come across similar problems again. The real learning comes from the millions of Google searches, StackOverflow explanations, and Youtube videos. You can find a lot of good tutorials by this guy Sentdex, here as well.
After another week or two of deteriorating eyesight and non-stop coffee consumption, I decided to take a break from Python to look into other subjects, and once again look at my Data Science learning progress from a birds-eye point of view. (It’s always good to take breaks)
3. Basic Math & Statistics At the same time I was doing Python, I was also reviewing math concepts every now and again just to keep things fresh (lol). These were concepts like sine and cosine, or limits and continuity — stuff you might have been taught in high school or college, but at the time couldn’t be bothered with actually remembering it past midterms or finals. Khan Academy is a great place to start. And I mean awesome. I re-learned concepts in a couple hours that would have required an entire semester to teach in college. Thanks to Sal Khan for putting it all together. I really mean that. Check it out.
With a few outside Youtube videos sprinkled in, I started with Trigonometry and worked my way through Calculus up to Statistics and Probability, which I am currently in now. I recommend watching these (and all other learning videos including Lynda) at 1.3x — 1.5x speed. As I developed my listening abilities, I was able to watch some of these videos at 2.0x speed, essentially cutting the learning time in half. Be sure you don’t watch it too fast or you’ll have to go back and listen to parts of the video again, which is redundant and defeats the purpose.
I find it’s a lot easier to be interested in Math when you have a purpose behind it, when you know you’re going to use it and apply the concepts to something tangible. That might have been why I almost failed my first semester of Calculus in college ;D. These skills are not only invaluable, but have been essential to understanding a lot of data modeling and tools that I’ll briefly get into in the next section.
“What else have I been learning?”
In addition to math (pun intended), I watched hours of tutorials on Excel, R, SQL, and the command line interface. I’ll get into these and what I’ve done on my next I Want to Be a Data Scientist post, but here’s a quick summary of what they are and why I want to learn them:
Excel — Microsoft Excel is used extensively in business analysis and is a simple, but fairly powerful tool. You can do basic data computations, and it’s got a wide range of functions from simple ones like “SUM” to more complex ones like the function literally named “COMPLEX”. You can also easily make nice graphs and charts, which is part of the reason it’s in such widespread use today.
SQL — or “sequel”, stands for Structured Query Language. It’s the standard language for relational database management systems, which is essentially a language to manipulate and retrieve data. Sites like Facebook and Google rely on SQL (MySQL) to function properly. More on this in the next post.
R — is “a programming language and software environment for statistical computing and graphics” –wikipedia. I won’t get into much detail here, but it’s used extensively for Statistics and Data Visualization (think graphs and whatnot)
Command Line Interface — The command line is one of the oldest interfaces. It allows you to essentially talk to your computer, and it’s good to know the basics in any sort of computer science context. This video will give you a better understanding of what’s what.
“What am I doing now?”
I’ve recently signed up for 2 courses on Springboard, the Foundations of Data Science and Data Analysis for Business programs. Springboard is a relatively new, online platform for learning Data Science/Analytics, and it’s even got a course on User Design (UX) which my housemate took and told me was extremely beneficial.
They have five programs total (Data Analytics for Business, Foundations of Data Science, Data Science Intensive, UX Design, and a Data Science Career Track). The Career Track is program (application-only) that helps you find work while it takes you through the course. I met a PhD student at a Python MeetUp who was taking the course, and he said he was enjoying it so far. More on the course offerings here. (Just scroll down a little)
Today is officially Day 1 of the course (although I admit that I’ve been getting a head start on the material), so I can’t say too much about the overall experience, but I’m pretty happy with it so far. Up to this point, the Student Advisors have been extremely helpful and well-coordinated.
The course is a curation of readings, tutorials, and courses from other places like DataCamp and Khan Academy (yay!). It creates a sequential learning path for you to follow and build on what you’ve learned previously. Even more valuable, supposedly, is the weekly one-on-one meetings with a mentor, an industry professional who is currently working in the field and offers insight and guidance to your learning experience, as well as the projects that must be completed to his/her satisfaction before you can say you passed the course. I meet one of my mentors tomorrow (a Data Engineer at Uber), and the other (a manager in Pandora Media’s Financial Planning Department) the day after that. I’m pretty excited about it. You best believe that I’ll be hitting them with a barrage of questions about everything from their role in the company to their mother’s maiden name. Updates on my Springboard experience in my next post.
I’ll go ahead and end this post here with a few valuable tips that I’ve learned from the last couple months:
1. Use Youtube and Google extensively. If you’re just starting out — a beginner like me –everything you need to know at this point is on the web. StackOverflow, Youtube, Blogs — these are all mediums of learning from your peers that are not only immensely helpful, but sometimes the only way to get an understanding of the problem or the course material. Just be careful not to follow your tangents too far. Those of you who have been trapped by Youtube’s Autoplay feature (everyone) knows what I’m talking about. Write down what you want to know and stay focused on that.
2. Make sure your environment is feng-shui and conducive to your learning e.g. get a bigger computer monitor, keep a water bottle nearby. Set up your desk in a way that feels right. Use some nice headphones (noise-cancelling is amazing) and maybe get an actual mouse so you don’t have to use the trackpad. All these little benefits accumulate over the hours you’ll be sitting there on your ass. Small improvements that seem inconsequential can go a long way over time.
3. Find the right music to study to when you’re not watching videos. For me, this was Lo-Fi Hip Hop. It not only sets the mood, but keeps your mind going at a consistent pace (and keeps you from losing it). Here is a cool article about what engineers at places like SnapChat and Pinterest are listening to while they put in work. While sometimes you’ll need some dead silence to internalize a difficult concept or just rest your ears, I’ve found that the right music helps immensely with concentration and creative rhythm.
Hope you enjoyed my first post (hopefully first of many) on the path to getting into the Data Science industry. Next time, I’ll be talking more about MOOCs (Massive Open Online Courses), Springboard, R & SQL, Excel, and other resources and tidbits about Data Science that I’ve picked up since this post.
If you enjoyed the article, please-please-please, show a brother some love, and hit the like button (that’s that little green heart icon below), or hit the subscribe button to keep updated with my progress :) Cheers.