AI replaces Data Analyst?!

Quick EDA using GitHub Copilot

Meg Patakota
Geek Culture
5 min readJun 28, 2022

--

I came across GitHub Copilot a few months ago and I was amazed by this new tool. I quickly joined their waitlist and a few weeks later…I became a member of the GitHub Copilot community!

I realised that for some new coders, it might be slightly overwhelming to use a tool like this. Hence, I am writing this blog to give a quick overview of Copilot by demonstrating how it can be used for exploratory analysis. Hopefully, this will serve as a guide to some new beginners out there, making their coding experience more exciting!

Note: This article is based on my coding experience with python in vs code as a data analyst/data scientist.

  1. Introduction
  2. Overview of the dataset
  3. Further analysis
  4. Conclusion
  5. Pointers

1. Introduction

Let’s begin by understanding a few things about GitHub Copilot. It is the collaborative effort of GitHub and OpenAI. Copilot is a language model trained on billions of lines of code written by human programmers. Due to this training, Copilot is able to generate computer code in several languages, when given instructions written in natural language. For instance, you can type “Write a function to invert a binary search tree”, and it’ll do it for you. This tool is compatible with the following IDEs that allow you to autocomplete your code:

IDEs: PyCharm, VS Code, Neovim , JetBrains

2. Overview of the dataset

This dataset has been downloaded from UCL Machine Learning Repository. It includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition (Fabio, et al 2019).

I find using copilot quite straightforward. As you can see below, every time I type something or just go to the next line, Copilot begins generating code in grey. 95% of the time, I simply hit tab or enter and the code gets auto-filled.

Below, you can see me using Copilot to import a dataset and get a quick overview of it. It can help us make pretty plots. All I did here is type the comment:

I hit tab, then sat back and watched it generate the rest for me, like you’re doing right now ;)

Copilot doesn’t just have knowledge about programming. It also has general knowledge about the world. In this piece of code below, I typed:

Copilot then did the rest, correctly implementing a BMI formula in code! Not only this, you can easily plot distribution charts and heat maps using Copilot’s brilliant suggestions. For these I typed:

and,

3. Further analysis

Let’s try using a lambda function to encode some of our categorical variables. Copilot can generate an almost accurate code for this. However, I had to make a few tweaks such as changing “Yes” to “yes”. The question is, is this a problem with Copilot itself, or with the instructions I have given it? With Copilot, there is a sense of needing to learn to communicate with the system to get the best out of it. Perhaps if I had specified how strings should be formatted, it would have done what I wanted.

The thing I was fascinated by is how Copilot is gender sensitive. Typing “gender” in any cell, does not provide any suggestions anymore! So, I guess you need to avoid the word “gender” when you need Copilot to work.

Now, let’s analyse the relation between obesity groups and high_cal, fam_his. Using this comment:

Copilot gives us the code we need. Also, it is gives us code to neatly plot these results. This is similar to the snippet you see below.

Bar plots & Scatter plots

In the end of our coding session, we try something less informal. With a comment,

We can see a very colorful and interpretable scatter plot (see above). This shows that writing a comment that is not very informative, can still give us precise results.

4. Conclusion

In all, this is just a tiny glimpse into what Copilot can do. It can generate helpful short snippets of code. It can also generate several lines of good quality code. Of course, it is not perfect, and there is a learning curve in knowing how to write good instructions, but it definitely helped me start somewhere instead of getting stuck with my next steps.

Based on my experience, I use Copilot on a daily basis and this definitely saves time for me. I no longer have to spend too much time on Stack Overflow to find answers to minor pandas questions like “how to use .agg on multiple columns”. Therefore, if you think this is the tool for you, then welcome to a new coding experience!

5. Pointers

  • I believe Copilot can be a great tool for beginners as long as you make a real effort to actually LEARN the language. Don’t let Copilot become a crutch.
  • Link to download Visual Studio Code: https://code.visualstudio.com. You can add extensions to python, jupyter notebook, Copilot and so on..
  • You can go Github Copilot’s website to get more understanding on their tool and also to gain access to Copilot: https://github.com/features/copilot/
  • Copilot was free until a few weeks ago. Now, you get a 60 day free trial. After that, it costs $10 a month or $100/year per user.

Check out the YouTube link below, if you’d like a slower version of the gifs above. Do subscribe to my Medium and YouTube channel, as your encouragement would definitely help me make more interesting content so we can learn and share together ❤

References:

Palechor & Manotas, 2019, Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico, Data in Brief, Volume 25, 104344, ISSN 2352–3409, https://doi.org/10.1016/j.dib.2019.104344.

--

--

Meg Patakota
Geek Culture

Data Scientist @Mishcon de Reya & Teaching Assistant @UCL, London. About me: www.linkedin.com/in/megPatakota