How-To Accelerate Your Data Science Career Through Real-World Experience
Six skills you need to make an impact and how to acquire them.
To swim or not to swim
There is no way you can learn how to swim just by reading about it in books and the Internet, not even if you can memorize all the different swimming styles/techniques, you need to get into the water to be able to really learn how to swim.
The same is true for data science (and any other field), you can do all the courses and certifications out there, (and some of them are pretty good), but until you are able to get your hands dirty on real problems in the real world, you won’t learn what is really needed.
In this article, Phenyo Phemelo Moletsane, Arafat Bin Hossain, and I share our experiences of learning data science in the real world.
Start early on to get your hands dirty
In my senior year in college studying Computer Science, I did an internship in a telecommunication company, where I worked in the application’s development area.
It was my first time doing real work outside the classroom.
The first few weeks were pretty boring, to be honest.
Finally, a new project about developing a system for retail stores came in.
Before that, I had never done any real project from scratch, with actual requirements, an actual goal and an actual due date.
I didn’t even know how to start!
Luckily, I had a mentor (and a good one), who taught me all the steps needed for a project like this and, long story short, with his help, we were able to deliver the project in the assigned time and meeting all the requirements.
I was extremely happy about this.
In this project, I learned more than in all the other years in college.
Don’t get me wrong! I used all the knowledge I acquired in classes but I didn’t [really] know how to apply it in the real world. It was all different from that point, now I really knew how to start, develop and deliver a real project.
Six important real-world data science skills
Even when you cannot put together a one-size-fits-all list for data science skills, there are some that are common to most of the scenarios.
All of them were reinforced or learned by being part of Omdena’s AI challenges.
1. Love for reading (or at least discipline to read)
You need to read a lot, and trust me when I say, a lot. Everything from research papers to Google collab / jupyter notebooks, articles and new approaches (often called state-of-the-art), and of course a lot of courses.
If you don’t like reading at least try to set yourself a goal of reading. If you are working in a specific project, let’s say computer vision, you can start by gathering a reading list of the most relevant content you can find, set a deadline and just split all the content in the remaining days, this way you can keep track of what you have left to read.
2. Basic Calculus / Statistics
This one is pretty obvious, all data science work is in some way based in calculus and statistics. If you think you might be lacking specific knowledge in one of these areas (or you think you might be a little bit rusty) it will help you a lot if you go over this again before starting to work on a specific problem. Trust me, if you don’t, the problem itself will ask you for it.
3. How to really collaborate and communicate (The 2 C’s)
Data science is not just about data wrangling and building machine learning models. Communication also forms a big part of it! One of the skills that you acquire and exercise when working with the Omdena community is communication. This includes both written and verbal communication.
The weekly meetings with collaborators from different parts of the world greatly contribute to building communication skills. Despite the complexity of your methodology and algorithms, you need to be able to clearly communicate your findings to both technical and non-technical people.
A good data scientist is able to write up and present results, ideas, and approaches through effective data storytelling. Don’t simply show tables of your data, tell a story with it! Communication also means being able to listen and interpret what others are saying so that you can understand the problem at hand.
Working in Omdena projects in a hyper distributed team (people in more than 3 time zones) comes with the need for collaboration. It will be worth nothing if you are an expert in a specific area if you are not able to work with all your team (which could be more than 20 people).
The best way to improve/learn this is by always asking yourself the following questions:
- “How can I help my team succeed, even if I am not the one doing the work?”
- “What is my team (or a specific team member) expecting from me?” (it could be an email, a piece of code, a message on slack or just a link to an article)
- When you need something from someone, “What is the best format/way to ask for [code/information/links] and who is the best person to provide it to me?”, “Where is this person located?” (you might need to wait until the next day to get what you are asking for.
4. Ability to derive questions from the problem statement
In data science, when you are dealing with a problem statement, what do you at first? Thinking about the solution, right?
Do you ever question the problem itself? For example, if you are dealing with an issue of detecting fake news from a series of emails — our tendency would be inclined towards directly diving into learning several algorithms, gather relevant codes and then organize all these tools in order to glue them together.
Nothing sounds wrong here, right?
Indeed, this is the wrong approach. You gave too much of your initial time on gathering something that is inevitable and not really contextual and rather generic. Learning Anomaly Detection algorithms is not a specific requirement to proceed towards the solution. Rather, learning how to detect anomalies in the context of emails is something I would be interested in. I would read papers related to that, see codes on GitHub. How to decide what specifics you need to work on? It is not difficult: Ask questions!
Rearrange the problem statement in terms of questions and answers like,
Can I detect anomalies if I am given a list of emails?
In this way, you will be more confident about what you want to do before getting overwhelmed with other things eventually getting distracted from the context. Honestly speaking, it is not a big struggle if you are dealing with a problem that can be interpreted as a machine learning problem. But it can come super handy if you are dealing with a problem that requires a more statistical approach than the ML approach to solving.
Let’s briefly discuss such a scenario.
Suppose you have research in your hand with tons of information about the academic results of many students from different areas. In your mind, you are thinking about what to do with the data. Trust me, there are two buzz phrases in data science which is so catchy and handy at the same time:
a) Tell stories through data
b) Ask questions from the data
While I have been talking about (b), it is actually (a) that would give a quick kickstart in the process in order to fabricate the problem statement. In the aforementioned research, it may occur that you have no research question in your hand which is very much usual but what you have is the tool called “tell stories from data”. For example, you can start to tell stories from the data above.
Let’s take a simple walkthrough — “Among all the locations, location X has more prominent students while location Y is lagging behind. Wow, the distribution of GPA is not normal in this location Z, why is that so? Hmm, an interesting fact — the distribution of students is concentrated towards the metro city than the remote areas and the skewness is bizarre. Must be the socio-economic division in several classes of the society. The most interesting fact is that there are certain locations where the quality of students is very good but the areas are remote. Maybe I can use this realization to do something that can help the authority to identify the regions which have students of good potentials but are lagging behind due to infrastructural development. So, what if my problem statement is like — “if I am given academic records of many students from many regions, can I find out the regions, which contain more potential students and are remote? What other categories of regions can I come up with?”
5. Real-world data problems: Messy data
The reality is real-world data is messy and far from perfect. Thus, a considerable amount of time is invested in understanding, cleaning and preparing the data. It is true that 70–80% of your time is spent on collecting and cleaning your data. Working with real data is often more challenging than working with synthetic data. It is very common for students to have important technical skills and knowledge but lack real-world experience because courses often provide synthetic data. Taking on real-world data problems like Omdena challenges gives you the ability to work with real data and answer real-world problems. Real-world experience means understanding the problem, extracting the data, and going through the steps of cleaning the data, generating features, choosing performance metrics, choosing a model and actually building it.
6. Patience
Patience is something inevitable for a data scientist. It’s because most of the time of your project would be about polishing and preparing the dataset and you may easily get demotivated — trust me, it is something each and everyone faced. So, it is something you should never intend to ignore. Trying to bypass this step is something that will bite you back in the long run. So, in order to be good in data science, learn as many data engineering tools as possible.
Learn how to use different databases and using query languages to prepare different types of datasets. Learn the 3 ‘V’s of Big Data and realize why storage, the volume is also an essential concern in data science and how to overcome it. In order to make yourself a polished data scientist, it is always handy to know what’s happening in the world of Big Data Engineering and how large companies are gluing the concept of Big Data Engineering and Data Science together. Data preparation and management is a huge part of data science that can distract you and give you a notion that you are doing something else than DS- but that’s the catch
All of the mentioned skills were reinforced or learned by being part of Omdena’s AI challenges.
Want to become an Omdena Collaborator and join one of our tough AI for Good challenges, apply here.
If you want to receive updates on our AI Challenges, get expert interviews, and practical tips to boost your AI skills, subscribe to our monthly newsletter.