Busting Common Data Science myths

AIQ Episode 4

Mehul Gupta
Data Science in your pocket

--

In this episode of AIQ, an AI Podcast for beginners in ML and AI, we tried debunking some common myths circulating amongst beginners. The entire episode is available here

The summarized version can be read below:

Host: The first and foremost question that I get asked a lot is: In a machine learning job role, does the data scientist or the machine learning engineer just build models, or do they do other tasks as well? What’s your perspective on this?

Guest: I guess since the primary role of a machine learning engineer or a data scientist is to solve problems with data, they must create models or solutions, which involves creating data pipelines and such. So, I think playing with data and using it to create models or predictive analyses should be the majority of their work.

Host: This answer is actually not in sync with how a data science job role works. Building and deploying models is just 10–20% of the entire job. You would also be acquiring data, analyzing it, meeting with the business team to understand metrics, building dashboards, and more. Model building takes the least time because most functions are predefined. So, remember that a data science role is not just about model building.

Host: Moving on to our next question, which is also very interesting: What languages do you think are required to work in a data science role? Do you think all languages like Java, C++, Go, etc., are used, or is there a particular set that is commonly used?

Guest: There should be a basic stack required, starting from C, C++, Java, JavaScript, and also Python. Some other languages might be required occasionally, like Swift for Apple’s ecosystem and Go for Google’s ecosystem. So, a couple of languages should be required to make tasks easier on a day-to-day basis.

Host: This is a complete myth. There are two major languages you need in data science, and you need to choose one: either Python or R. Python has a bigger community, so it’s often preferred, but knowing R is equally beneficial. Knowing either language, even at a basic level with some major libraries, is sufficient for an entire career. It’s not as difficult as people think.

Host: Now, on to our next query that I receive frequently, especially on Reddit: Do you require an MS or PhD degree to enter the data science field, or can a 12th pass, diploma holder, or bachelor’s degree holder get into data science and AI?

Guest: Based on my connections and people in the field, most are pursuing higher degrees like MS, postdocs, or PhDs. So, I guess the higher your education level, the more impactful roles like data scientist or machine learning engineer you can pursue.

Host: This is one of the biggest myths circulating in the field. It is not mandatory to have a higher degree, although it’s beneficial. I’ve seen colleagues from various backgrounds, not even computer science or mathematics, enter data science. For research roles, higher degrees might be required, but most openings are for applied scientists, where even a diploma holder can enter. So, you don’t need to worry about your degree.

Host: Moving on, you must have heard of neural networks. Do you think data scientists use technologies like neural networks frequently, given their complexity?

Guest: Based on my previous experiences and Andrew Ng’s machine learning course, neural networks seem very important. So, I guess using transformers and neural networks would be something people do at least every second day if not daily, to stay impactful.

Host: Before I give the final answer, I’ll share a short story. I’ve been in the field for nearly five years and haven’t deployed a single neural network yet. Most of my models are classification models. Companies avoid deploying neural networks because they’re too complicated to explain. Neural networks sound fancy, but most of the time, you avoid them in real-world scenarios. For beginners, neural networks aren’t typically required, neither in roles nor in interviews.

Host: Moving on to the next section, this is a frequent question on Reddit: How much mathematics do you feel you need to know before jumping into a data science or machine learning role? Do you think you need to know a lot of complex math?

Guest: Initially, I thought I should know differentiation, integration, derivatives, complex algorithms, and probability and stats to a good level. Real-world challenges are complex, so knowing more math would be helpful.

Host: Math is very important, no matter what some influencers say. You need to understand the logic behind algorithms. While Python provides predefined functions to implement algorithms, you need to choose which algorithm to use, requiring a solid understanding of the math behind it. However, this is a one-time read. Once you understand the math behind common algorithms, you won’t need to revisit it frequently. Basic concepts should suffice for most of your job.

Host: Now, moving ahead, how much do you think a data scientist codes? Can you quantify the number of lines you think they write for a project?

Guest: From my learning experience, I understand that some things are made shorter with tools like Jupyter and PyCharm, but I still feel a lot of lines of code are required. I’d guess even a basic model would have at least 100–200 lines of code.

Host: Recently, I built a classification model that took me five to six lines of code in the entire pipeline, and my project was delivered. Using AutoML handles everything, from preprocessing to loading the data, cleaning it, pushing it into the model, getting the metrics, and saving the model. Data science is more theoretical than practical; you code very little but read a lot. A typical project might have 20–30 lines if done manually. The more you use AutoML, the less you code.

Host: One more important question: How do you think the JIRA system works for data scientists? For those who don’t know, a Sprint is a 15-day period where you deliver specific tasks. How do you think Sprints work in data science, and would they be successful?

Guest: I’m aware of Sprints and tickets. The product manager would check with the data scientist about the feasibility of deploying a model, taking perhaps a week for research, another week for data collection, and a third week for deploying the model. So, I think a Sprint might be around three weeks.

Host: Most companies don’t have Sprints for data scientists. In my five years of experience, I haven’t encountered any Sprints. Data science is more research-based, and you can’t expect deliverables in seven days. You need time to check the data, preprocess it, and run multiple experiments. So, most companies don’t have Sprints for data scientists.

Host: Moving on, a common question is: What’s the success rate for converting POCs (Proof of Concepts) to real projects? If you start with 10 POCs, how many do you think get converted into actual projects?

Guest: I guess companies wouldn’t let resources go forward with random POCs. They’d give specific areas of research and problems to solve. Depending on how well the POC works, the success rate might be around 80–90%.

Host: The success rate is actually less than 5%. Most POCs don’t get made into projects. From the start of the year, I’ve worked on 10–15 POCs, and only one or two are going into production. Many random problems get thrown at you, and most POCs are rejected. So, the success rate is very low.

--

--