How to say No to Data Science Projects? — Interview with Saikat, Data Scientist at Fractal Analytics
Saikat explains a framework he uses to determine the value proposition of Data Science projects and saying ‘No’ to them.
Saikat Kumar Dey is a Data Scientist at Fractal Analytics. His story is fascinating how he has gone from a Software Engineer to a Data Scientist. I discovered Saikat through his website http://saikatkumardey.com. He has some great projects on it. Going through the interview with him, he detailed his thought process on how he says ‘no’ to data science projects. His answer was both well organized and actionable. Please read the interview to know more…
For more some similar inspiration:
“Simplicity is the glory of expression” — Interview with Jalaj Thanaki (EBook Giveaway)
Always remember that your data science related skills, your projects and your contributions remain with you forever so…
In conversation with Jesse Steinweg-Woods — Ph.D, Senior Data Scientist at tronc
I would ignore those who say you absolutely need to know big data tools/deep learning right off the bat, because most…
Vimarsh Karbhari (VK): What top three books about AI/ML/DS have you liked the most? What books have had the most impact in your career?
Saikat Kumar Dey(SD):I learn by doing. So, I like to read books writing code along with them. Top three books that I’ve liked the most are: Programming Collective Intelligence, Machine Learning in Action and Think Stats.
Amazon.com: Buying Choices: Programming Collective Intelligence: Building Smart Web 2.0…
Want to tap the power behind search rankings, product recommendations, social bookmarking, and online matchmaking? This…
Machine Learning in Action
SummaryMachine Learning in Action is unique book that blends the foundational theories of machine learning with the…
Think Stats: Exploratory Data Analysis
If you know how to program, you have the skills to turn data into knowledge, using tools of probability and statistics…
VK: What tool/tools (software/hardware/habit) that you have as a Data Scientist has the most impact on your work?
SD: Jupyter notebook. Creating reproducible work that could be shared with the colleagues or with the world, has never been easier. Jupyter notebook makes it really easy to write, code and present your work to anyone. I like to write small notes of my thought process in Markdown while I work on any project. When I need to go back to my old projects, I could easily recollect what I was thinking. It’s all there. Also, while reading a new book on ML, I try to find code on Github for that book. This helps me reproduce the ideas in the book.
VK: Can you share about the Data Science related failures/projects/experiments that you have learned from the most?
SD: I remember working on a competitor analysis project for my previous company. That was my first Data Science project in the company. The idea was to figure out similar companies based on their FourSquare tags. Similar companies in neighbourhood could be competitors. I got the data from FourSquare and used a similarity-based approach to solve the problem.
Now, I never thought of asking who’ll be using this model? How is this model going to be deployed ? As it turned out, most of the SMBs ( small & mid size businesses) who were our customers were not available in FourSquare. Most of them were not present in Google Places either ( in order to establish the proximity of any two companies ). The project was scrapped later on.
I saw an interesting problem and went on to solve it. The accuracy/performance of the solution was deemed irrelevant after we figured out that we couldn’t use it for our customer base. Huge lesson learnt. :)
VK: If you were to write a book what would be the title of the book? What would be the main topics you would cover in the book?
SD: “Applied Machine Learning” — The book would take the readers on a journey of building projects end-to-end. It will take a top-down approach to learning. Most of the books/blogs/MOOCs build proof-of-concepts when demonstrating application of ML. These are useful to beginners. However, most people do not know where to go next. They learn it the hard way. For an advanced learner, it is important to know:
- How to ask the right questions ?
- How to gather quality data ?
- How to build an efficient data-storage strategy (if quality data is not available) ?
- How to build an automated pipeline to train/validate/deploy/monitor ML models?
- How to build an engineering pipeline to allow others to use your ML application?
- How to build a MVP and have strategies to reiterate?
VK: In terms of time, money or energy what are the best investments you have made which have given you compounded rewards in your career?
SD: Approaching a professor of my department to work on some interesting problems under him was the best decision that I’ve made in University. I was preparing myself to be a Software Engineer. However, working on the projects garnered my interest towards ML and with a bit of luck, I got to start my career as a data scientist.
Working on side projects, which I put up on GitHub from time to time.
Volunteering for The Fifth Elephant, a data science conference. I got to meet so many interesting people there. I met people who had deep understanding of ML/DL algorithms which inspired me to strengthen my foundations.
VK: What are some absurd ideas around data science experiments/projects that are not intuitive to people looking from outside in?
SD: Expectations from ML based applications are high, thanks to the AI hype in recent years. People think that ML can create something out of nothing. GIGO (Garbage in, Garbage out) principle is apt in this context.
I remember a particular incident. We were working on building a chatbot. We built it to solve a set of problems constrained to a particular domain. People’s expectations from the chatbot were as high as Siri/Google-Assistant/Alexa. People hardly understood that we were building it from scratch :). I had colleagues ( software engineers) who would occasionally sneak in and advice me to use Deep Learning (LSTMs, particularly). It was important that we convey the capabilities of the system first-hand.
VK: In the last year, what has improved your work life which could benefit others?
SD: Taking notes of decisions taken on various stages of the project life-cycle (mostly on Google docs) and sharing it with the team. This helps keep everyone on the same page regarding the status of the work.
VK: What advice would you give to someone starting in this field? What advice should they ignore?
SD: Focus on building cool stuff. Then drill-down and learn the algorithms/techniques used in building it.
Ignore people/books/videos that promise you:
- Teach Data Science or ML with no math. Math is important. You should learn how an algorithm works, the assumptions made and why it works. Having a solid foundation in linear algebra & statistics will help you in going a long way.
- Teach Data Science in X weeks/months. It’s a huge field that will take years to be really good at. By the time you are close to catching up, the field would have advanced further. I urge you to read Teach yourself programming in 10 years. This article is apt for learning of any kind, in any field.
VK: What is bad recommendations given in data science in your opinion?
SD: More emphasis is given on the algorithms than data. As I mentioned earlier, garbage in = garbage out principle is quite apt. Deep learning is not applicable everywhere. Brute-forcing your way through every available algorithm doesn’t work well, if you don’t stop and think about what’s going on. Many of the problems that you’d solve in your company might have to be solved from scratch where little or no data is available. What would you use then?
VK: How do you determine saying No to experiments/projects?
SD: I start by ordering the projects based on their Value Proposition. Then I follow the framework about asking the important questions listed below. Selecting important task is then easy as you’re rating them objectively. Most of the times it is the business decision to build out a feature/application and you’d have to comply. In those cases, it is important that you communicate the limitations and set the expectations early on.
The following questions should be asked before taking on any data science project:
- Is this problem worth solving?
- Who will be using our application?
- Do we have the required data to solve this problem right now?
- What are our data sources?
- If we don’t have any data right now, can we build a pipeline to collect data now so that we could use Data Science in future?
- Will a heuristic work here instead of ML?
- What kind of engineering effort do we require to support this application?
VK: Do you ever feel overwhelmed by the amount of data or size of the experiment or a data problem? If yes what do you do to clear your mind?
SD: Most of the problems that I’ve tackled so far didn’t have overwhelming data. However, a project could be overwhelming because of too many unknowns ( and if you have to go ahead and do it anyway). In those cases, I try to simplify my design and build out the first workable version (MVP), with the thought that I’ll iterate and improve it further over time.
VK: How do you think about presenting your hypothesis/outcomes once you have reached a solution/finding?
SD: Following a systematic way to build out the project helps in reducing the extra effort in presenting it. I work on Jupyter Notebooks mostly which could be presented as slides at any given point. The framework for presenting the outcomes is:
- Problem Statement
- Value Proposition
- Assumptions made
- Interesting insights from exploratory analysis (in the form of visualizations / aggregations/stats ).
- Explanation of the model predictions ( if a predictive model was built ).
- Examples of False Positives / False Negatives and strategies on how those could be reduced in the next iteration (depending on the business use-case).
- Future considerations.
VK: What is the role of intuition in your day to day job and in making big decisions at work?
SD: Intuition helps you in estimating the effort required to solve a particular problem. Sometimes building an application might seem simple. However, intuition might help you gauge the effort required to manage the application at scale. Intuition also helps in anticipating problems that might arise in future due to the decisions taken in present time. It only gets better with experience, so I brainstorm with my colleagues (who are more experienced than me) before making big decisions. This helps me in seeing things from different perspectives.
VK: In your opinion what is the ideal Organizational placement for a data team?
SD: A data team should report directly to the CEO and work closely with the product and engineering teams. It’s important for the data team (more than any other team) to be aligned with the vision of the company. To put things into perspective, there might not be any data to begin with. Having data science at the centre of a product helps in deciding the automation/intelligence plans early on. This helps in prioritizing the pipeline in such a way that when we’ll have enough data, we’ll make proper use of it.
VK: If you could redo your career today, what would you do?
SD: I would read up more on Statistics (Bayesian & Frequentist) and Linear Algebra. There are plenty of new ideas in Machine Learning expressed in research papers. Having a strong mathematics foundation would help me understand the intuition behind and in reproducing the ideas.
VK: What are your filters to reduce bias in an experiment?
SD: I use stratified sampling to divide my dataset into train/validate/test sets so that the samples in each set are proportional to the distribution of sub-groups in the original dataset. Boosting techniques also help reduce bias.
VK: When you hire Data Scientists or Data Engineers or ML Engineers what are the top three technical/non — technical skills you are looking for?
SD: If I had to hire a Data Scientist, top 3 skills that I am looking for would be:
- Strong Problem Solving/Coding skills.
- Strong statistical foundations.
- Good communication skills — Ability to explain concepts at various levels of abstraction, depending on the audience.
For Data Engineer/ML Engineers, point (2) is good to have whereas, (1) and (3) are must-haves.
I would like to work with people whose skill sets are diverse in nature. Having a curious nature helps, as this makes sure that you get to learn from each other on a day-to-day basis.
VK: What online blogs/people do you follow for getting advice/ learning more about DS?
SD: I read a lot on Arxiv. It’s the best resource for staying up-to-date on the advancement of the field.
Reading kaggle kernels help me in learning ways of analysing diverse datasets.
Datatau, KDnuggets, Reddit ( /r/MachineLearning, /r/DataScience, etc ) also helps in discovering latest resources/tutorials about the field.
People can follow Saikat’s work on his website: http://saikatkumardey.com
Please visit Acing AI Interviews to prepare for Data Science Interviews: Acing AI Interviews
Subscribe to our Acing AI newsletter for such great expert interviews, I promise not to spam and its FREE!
Subscribe to the Acing AI/Data Science Newsletter. It is FREE! Reducing the entropy in data science. Helping you with…
Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.