If you want a more traditional breakdown, go google “How to Be a Data Scientist”. There are 1,000s of these posts like this one: https://www.innoarchitech.com/blog/what-is-data-science-does-data-scientist-do
These are detailed, but do they actually teach you what it is like to do Data Science day-to-day? I am not so sure. Here is my personal attempt at such a task.
A Bit About The Term “Data Science”
Some guy defined Data Science. I’ve heard his name a bunch but it never stuck. The term came about in the early 2000s or something and is used to define some space between computer science and statistics, where probabilities and programmatic logic meet. Formally, (if thats your thing) Data Science is the use of logic and mathematics to further understand how systems work based on data. This manifests as a human asking a series of questions, commonly through the lens of statistics, that are actually carried out with the use of computation.
Put more abstractly: Data Science is like exploring a foreign country trying to find a specific person. This person has the answers to everything you desire. You use your computer to find people who may know this deity figure. You use statistics to talk with those who you find. It’s the only language they can talk to you in. You have to go from person to person and find clues where this final person is. Different groups of people have accents though, so sometimes you need to learn different versions of statistics, otherwise the message is unclear. Some people also live in different regions so you need different computer technologies to travel to them. As you travel this land, you begin to feel like you know what this deity might say if you were to meet.
Data Scientists build systems that try to explain this world without ever getting the chance to talk with this person.
How Does It Feel to Be a Data Scientist?
Being a data scientist feels very unique. You are constantly switching up how you think to try to solve a problem. Thinking about new algorithms that may help you improve current systems. Reading about how those algorithms actually work. Implementing them and then trying to come up with a way to measure if you were successful or not. In this way, I think the word “Science” has its rightful place in the term “Data Science”. Missed details and incorrect calculations cause issues in this field, so there are not many places to hide. You have to understand what you are doing, and if you don’t, you have to learn on the fly. It can be pretty stressful, but when you embrace it, you feel like you can solve almost anything. You just know that it will take a hell of a lot of time and effort.
Data Science Jobs
A lot of “Data Science” jobs are probably better described with some other term. Even those that are best described as “Data Science” jobs are still varied. Here are some common things to know about different potential roles.
- If you are building pipelines to efficiently move big amounts of data around systems for other programmers and researchers, you are probably closer to a Data Engineer.
- If you spend all your time putting together reports and data visualizations, you are closer to a Data Analyst or Business Intelligence Analyst.
- Machine Learning Engineers are those who worry about the computational complexity and systems design of fancy math when its going to be run non-stop to do something important.
- A Data Scientist at a start-up is probably going to be doing all of these things, but not deeply enough to become an expert in any of them.
So You Wanna Learn The Technical Skills?
The best data scientists I know have a deep understanding of computing and mathematical analysis. It doesn’t really matter what tool they use, because they have moved past tools and are able to think on a theoretical level.
To start on the path to such an understanding, it is best to start learning the most popular tools. This way there will be plenty of information and support as you learn. Furthermore, it is more likely this technology will be around for longer so you can avoid learning a new language early on in your career. When you advance past beginner, the language and technology fade away. You become an individual interacting with the data in a much more organic way. That should be the goal of using computation, so don’t get distracted! Richard Feynman, an amazing teacher and Nobel winning physicist put it this way in his book “The Pleasure of Finding Things Out”:
Well, Mr. Frankle started this program and began to suffer from a disease, the computer disease, that anybody who works with computers now knows about. It’s a very serious disease and it interferes completely with the work. It was a serious problem that we were trying to do. The disease with computers is that you play with them. They are so wonderful. You have these x switches that determine, if it’s an even number you do this, if it’s an odd number you do that, and pretty soon you can do more and more elaborate things if you are clever enough, on one machine. And so after a while the whole system broke down. He wasn’t paying any attention; he wasn’t supervising anybody. The system was going very, very slowly. The real problem was he was sitting in a room figuring out how to make one tabulator automatically print arc-tangent x, and then it would start and it would print out columns and then bitsi, bitsi, bistsi and calculate the arc-tangents automatically by integrating as it went along and make a whole table in one operation. Absolutely useless. We HAD tables of arc-tangents, but if you’ve ever worked with computers you understand the disease. The DELIGHT to be able to see how much you can do. But he got the disease for the first time, the fellow who invented the thing got the disease.
Keeping that in mind, here are the best tools to know if you want to perform data science. Also, I’ve included information to hopefully save beginners from developing unhealthy idealizations about “learning to code”.
- Python : A programming language that is easy to learn, write and read. Python is very popular in the data science community. As a result of its popularity, there are open source libraries, usable code written and freely released by other people, that allow one to apply almost any technical idea with very little engineering work. Python isn’t the fastest or most efficient computer language out of the box, but nowadays it is becoming the lingua franca, where many smart engineers are building efficient backends and creating a way to interact with them using Python. You learn Python by using it to solve many different problems. In trying to do different things (statistics, building a website, performing calculations, automating trivial tasks) you understand how the language works and how to efficiently solve problems. Learning a computer language is hard so stick with it and try to solve problems! Programming isn’t about knowing the syntax, it’s about knowing how to transform a problem into code. Great programming is knowing how to transform it into efficient and readable code that other people can use and change easily.
- SQL : A language designed specifically for transforming data stored in database tables. This is the most common way data is stored so it is a very necessary skill. You become really good at SQL by using it a lot and wrangling different types of data in many different ways. Even after 5 years and 1,000s of hours, I still discover new ways to use it. I’ve seen many a punk come in thinking they didn’t need SQL. They were just going to do everything in Python. I was one of the worst offenders of such a crime. Anyone who thinks Python is the solution to every problem doesn’t understand the problems deeply enough. One of my favorite sources for learning SQL is W3 Schools. I owe them for getting me through more than 1 interview. https://www.w3schools.com/sql/sql_select.asp
- Computer Science : Data Science is only possible due to computers. The more deeply you understand computers, the more deeply you understand what happens every-time you take action as a data scientist. The smartest data scientists I know understand computers in such a deep way that they can do almost anything with them.
- Statistics : The more statistics you know, the more questions you can ask.
- Data Mining / Machine Learning : These are all the algorithms birthed from the union of computer science and statistics. This space is what happens when you can calculate 1,000,000,000s of things very quickly. You can do some truly amazing things here, like teach a computer to identify faces, translate languages, and forecast the weather. This is the sexy part of the job that everyone wants to do, but only the privileged few with access to lots of data get to.
Where To Start?
My favorite book for learning Data Science style Python
Python for Data Analysis Book
The 2nd Edition of my book was released digitally on September 25, 2017, with print copies shipping a few weeks later…
Intro to Statistics Book
Famous beginner book I quite like