Could Code-Writing AI (e.g. GitHub’s Copilot) lead to a resurgence of the “art” of Data Science?

Amogh Borkar
CodeX
Published in
4 min readAug 15, 2021
The key focus areas for a Data Scientist

GitHub and OpenAI’s launch of CoPilot (i.e. AI that can write code) has had mixed reactions from people ranging from scepticism to fear. While I’ve read a lot of posts on first impressions or what it could mean for developers, here’s my take on a pathway it could go down and thereby make Data Scientists lives better in the near future.

Data Science Practitioners who have been in the field long enough have seen the field transition from Statistics → Predictive Modelling → Machine Learning→ Deep Learning. The traditional data scientist has been more of an artist or a “jack of all trades, master of some” given that he needed to strike a balance between various fields (as illustrated in the diagram above) to solve business problems. For the more technical people who grew to be data scientists, understanding the business angle has been a hard skill to master.

As the field and industry matured, we ended up with 1000s of great data science models that worked only in a lab but couldn’t be moved to real world business applications. There were numerous issues such as incompatible infrastructure, old ways of working or lack of software engineering skills across programming languages. Due to this, the key pain point for data science for the last couple of years has been ‘MLOps’ or particularly ‘model deployment’.

What is MLOps & model deployment?

In common terms, MLOps (Machine Learning Operations) is this is about managing the entire data science lifecycle better and making it repeatable. As a consequence it also makes the models that work under standard test conditions easier to move to the real world (i.e. deployment). (I found a more detailed overview of MLOps here) However, MLOps has it’s own learning curve. This is partly because of the additional technical complexities and jargon one has to learn such as telemetry, container orchestration and AI operationalization.

Where is MLOps today?

While some organisations developed machine learning engineers to handle the heavy lifting of MLOps, a lot of startups have grown in this area. The main aspect these start-ups look at is to build platforms to reduce MLOps complexity for the Data Scientists & make them more efficient.

This has somewhat shifted the focus of the Data Scientist away from their core expertise and reduced the amount of time they can spend on understanding the business problem and asking the question ‘Why?’ Further, we are seeing a lot of automated ML/ AI models or no-code ML platforms being launched. However, many don’t realise that selecting and running the model is only a small part (roughly 10–20%) of what a Data Scientist does or should do. Major chunks of time are needed on upstream activities such as: understanding the business problem, translating it into a data science problem and exploring and analysing the data thoroughly.

The downside of Data Scientists spending lesser time on upstream aspects is that today, we are seeing data science models being deployed that were trained without understanding or cleaning the data. Sometimes, accurate models are deployed that predict a factor that was already known or has nothing to support the solution to the business question.

Bill Gate’s quote on automation seems apt to describe the risk here: “The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency.”

What is GitHub’s Copilot?

GitHub partnered with Open AI to use their state of art natural language processing model GPT3 and pointed it to the code repositories on GitHub. As a result, we now have a model that has learned from millions of lines of code and can be asked to write code. I am still on the waiting list and haven’t been able to trial it but the initial reactions from people who have used it are very interesting.

It’s ambition is to become a pair programmer i.e. it watches and corrects as you write code and you do the same when it writes/ completes your code. With the launch of GPT 4 in late 2021, this is set to get even better.

What this could mean for Data Science?

This could lead us to the exact thing that is needed to bring the balance back in data science! Code writing AI might reduce the impetus on code and technicalities in Data Science. Due to this , data scientists could possibly go back to the thing they were very good at: “the art” or the fine balance needed to creatively find a path towards a solution to a problem. Self-coding AI might in the near future be able to take a model developed in the lab and scale it to a real world application thereby taking away the need for the heavy-lifting needed for deployment. As a result, Data Scientists would be able to focus more heavily on business and math considerations and solve more complex/ domain specific business problems without worrying too much about shoving the model over the wall and into the real world.

There is of course a lot of scepticism about how well AI can code and I do believe we are not there yet. However, things change fast in the world of AI and this could be one path it might go down in the near future.

--

--