A detailed explanation of the coding capabilities of ChatGPT and GitHub Copilot And how they work .

Abdullah Afify
14 min readOct 6, 2023

In the name of Allah, we begin a series of articles discussing Artificial Intelligence (AI) under the title “! Back in my day we had jobs.” The first article is titled “! When code writes code.” If you like the article, please consider reposting it. It may be a bit long, but it is informative and enjoyable, God willing.

We will discuss the article using the following scheme:

  1. General Introduction
  2. Introduction to the Tools
  3. How They Work
  4. Data Used
  5. Current Capabilities
  6. Limitations and Errors
  7. The Future
  8. Does It Replace Programmers…?
  9. Ownership Rights
  10. Conclusion

👨 💻 General Introduction

We have all noticed the entry of Artificial Intelligence into various aspects of life and most professions that use computers. What if Artificial Intelligence (code) is used to create Artificial Intelligence itself (code is also intelligence)? As is clear from the article’s title, we will discuss Artificial Intelligence capable of producing, analyzing, modifying, and converting code from one language to another, from one framework to another, and more, in detail.

The tools I will rely on for explanation and examples are the two most famous tools that excel in this task, despite their differences in purpose and means:

  1. ChatGPT from OpenAI
  2. Copilot from GitHub (Microsoft) & OpenAI

This will allow us to understand the capabilities and reach of OpenAI, the company that dominates most of the current Artificial Intelligence technologies.

As for ChatGPT, I will discuss its role in handling codes and other features and capabilities in a forthcoming article, God willing.

For your information, you can use ChatGPT’s capabilities in code within the VS Code editor through the “codGPT” extension, which is designed for processing and analyzing code based on ChatGPT’s capabilities and OpenAI services.

👨 💻 Introduction to the Tools

  1. GitHub Copilot 🤖

GitHub Copilot is a cloud-based artificial intelligence tool developed by GitHub and OpenAI to assist programmers and users of Integrated Development Environments (IDEs) such as Visual Studio, VS Code, JetBrains, and others. It does this by suggesting modifications and additions to code while it’s being written and by converting natural language comments into code.

The service is currently available for a subscription fee and works efficiently. Its performance varies depending on the programming language, with full support for languages like Python, JavaScript, TypeScript, Go, and others. The technology relies on the capabilities of the Codex Model, which can generate code solutions, understand natural languages, and convert them into code by writing comments in English that describe the code.

It includes an autocomplete feature with an accuracy ranging from 43% to 57%, depending on the code it encounters in the project and the number of attempts made. The more attempts, the higher the accuracy.

2. ChatGPT (codeGPT) 🤖

ChatGPT (codeGPT) is a cloud-based chatbot powered by artificial intelligence, launched by OpenAI and built on top of the OpenAI’s GPT-3 family. It offers various services, including assistance in code production, understanding, analysis, and other processes that help programmers in general.

It represents an evolution from its predecessor, InstructGPT, with improved content quality and reduced harmful and deceitful responses. In just over a month, it has garnered more than 100 million users, making it one of the fastest consumer applications to reach this milestone in such a short period, not exceeding two months.

👨 💻 How They Work

If I were to explain how they work in precise detail, it would require more than a complete article for the subject. However, I will summarize the explanation into three key points that help us understand the fundamental principles that enable us to obtain this type of AI. I will explain each one briefly and simplistically, God willing.

This type of AI primarily relies on three main technologies. I will explain each of them briefly, followed by a summary of how they are used together to efficiently produce AI.

  1. Natural Language Processing (NLP): NLP is responsible for enabling computers to understand human language. It aims to build machines capable of comprehending and responding to text and voice data, which is often converted to text. NLP technologies power various tools we use in our daily lives, from search engines to language translators to AI systems like ChatGPT and GitHub Copilot. To achieve this, NLP involves complex processes on text data, such as lemmatization, morphological segmentation, POS tagging, stemming, and more. One crucial advancement in NLP is the Transformer architecture, which can handle long text sequences accurately by using self-attention mechanisms to focus on specific parts of the text while considering the overall context.
  2. Supervised Learning in AI: Supervised learning is a branch of machine learning used to solve problems where data is available and consists of labeled examples. Each data point includes features and corresponding labels. This approach involves building a function that maps feature vectors (inputs) to output labels based on the input-output pairs present in the dataset. Fine-tuning is a critical step in supervised learning. Instead of starting from scratch for each new task or dataset, fine-tuning allows reusing pre-existing models, which significantly speeds up the process.
  3. Reinforcement Learning: Reinforcement learning is a more complex branch of AI that aims to achieve true intelligence. It operates through a reward mechanism, where a specialist creates a reward algorithm to provide feedback to the machine at each step toward the correct solution. The machine learns by receiving positive rewards for moving closer to the correct answer and negative rewards for moving away from it. The key challenge is to design accurate reward functions tailored to specific tasks, making reinforcement learning a complex and expert-driven field.

In summary, this type of AI combines NLP, supervised learning, and reinforcement learning technologies to generate sophisticated AI models. It starts with a baseline model fine-tuned through supervised learning, followed by the creation of a reward model based on human preferences. This reward model is constructed by comparing outputs from multiple baseline models and is used to train a policy model through proximal policy optimization. This process is iterative and helps align the AI’s responses with human-like language, allowing the AI to understand context, improve its responses, and adapt to various tasks. With human trainers, supervised learning, reinforcement learning techniques, and some programming skills, this AI produces highly capable models that combine the strengths of various models and technologies.

👨 💻 Data Used

Regarding GitHub Copilot, it was trained on snippets of the English language, public GitHub repositories, and any publicly available source code. Additionally, it was trained on a filtered dataset of around 159 gigabytes of Python code extracted from 54 million public GitHub repositories. In essence, it has been trained on millions, if not billions, of lines of code from various programming languages that are publicly available online, whether on GitHub or elsewhere.

As for ChatGPT, data for it was collected from all over the internet, including social media, Wikipedia, forums, comments, and virtually everything on the internet. This data is obtained through web scraping on a massive scale, to the extent that it can be considered a snapshot of nearly all the data available on the internet.

There’s a project called “Common Crawl” that has been working on this, collecting data since 2008. The volume of data collected by Common Crawl is measured in petabytes, and interestingly, this data is available for free, and you can access a portion of it at no cost.

👨 💻 Current Capabilities

I’ll list the key features related to coding and programming in bullet points:

✔️ It can assist in efficiently and quickly writing algorithms and complete functions.

✔️ It can provide quick code snippets for specific tasks while you are coding.

✔️ It can help troubleshoot your code, identify issues, remove redundant code, fix errors, and improve overall accuracy.

✔️ It can assist you in learning programming and coding by explaining code in detail and providing comments not only on each function but also on individual lines of code. You can ask it about any programming technique, and it will explain it with detailed examples.

✔️ It can generate comprehensive documentation for your code and assist in creating a README file.

✔️ It can help with code refactoring. However, there are limitations, and I will discuss them later.

✔️ It can write unit tests for the software you are working on, saving you a lot of time and effort.

✔️ It can learn from your code and provide code that aligns with your coding style and the nature of your project.

✔️ It can translate code from one programming language to another and from one framework to another.

✔️ One of the most valuable aspects of these tools, in my personal opinion, is the feeling that you have a partner in the programming process. You’re not working alone, and the tool can easily understand all parts of the code, identify issues, and tell you what libraries should be installed and what they depend on. All this is done smoothly without the need to constantly search on Google, Stack Overflow, or even GitHub. All the information you need is within these AI tools. Moreover, they understand your specific project and the errors you face, helping you write code that significantly saves you time.

👨 💻 Limitations and Errors

Just like this AI represents a significant advancement with many advantages, it also comes with several challenges that anyone relying on these tools for software development may encounter. I will discuss some of the most important issues, both those faced by most people and those I have personally experienced during my interaction with these tools.

❌ These tools do not work in real-time and require periodic updates with new data. This means that any new tool or algorithm introduced after the data these models were trained on may not be usable or modifiable because they simply have no knowledge of it.

❌ This issue is related to data filtering that feeds the model. After using these tools for some time, it became apparent that if a specific library is no longer in use, its installation method changes, or its dependencies are updated, the tools may continue to suggest outdated, ineffective solutions and use them in examples. There are two main cases for this problem after extensive testing. The first case is when the new information about the update has not yet been incorporated into the tool’s knowledge base, causing it not to recognize the information. This is similar to problem number 1. The second case is when the tool has both the old and new information but lacks a priority index to choose between them. As a result, it might suggest the old and the new methods interchangeably, causing confusion. Whether you ask it again or tell it that the old method didn’t work, it may switch to the new method, but it doesn’t take into account the historical data effectively.

❌ Limited ability to handle complex and lengthy code. Sometimes, when dealing with long and complex code, these tools may not handle it properly. Errors may appear, or they may not provide relevant responses, no matter how you rephrase your question. Their processing capabilities are limited, even if they are quite substantial.

❌ This issue is related to session storage. Ideally, ChatGPT should remember all the information you’ve entered within a session, just like natural human conversations. However, I’ve noticed that if the session continues for too long, it starts forgetting information entered at the beginning of the session. It only seems to focus on the most recent information. This may be related to session cache management, even though it doesn’t provide any warning or message indicating that the session has exceeded a certain limit.

❌ These tools do not work uniformly with all programming technologies or languages. They may provide solutions with errors or bugs that can, in some cases, slow down your work, contrary to what is expected.

❌ They do not execute code. Whether it’s GitHub Copilot or ChatGPT, neither of them acts as an integrated development environment (IDE) that can fully run and test your code like coding platforms such as Codeforces. They apply specific rules to your code, help you correct it, or make changes, essentially providing debugging assistance but not full code execution or testing capabilities.

❌ Requires extensive text input. Since ChatGPT only understands text, you often need to write detailed paragraphs to convey exactly what you want. This can be tedious, especially if it doesn’t grasp your request on the first try, requiring you to input the same information multiple times.

❌ Responses are limited to text and code. Sometimes, these tools don’t provide a sufficiently detailed response. ChatGPT, for example, is programmed to reward the machine when the user likes the answer. Sometimes, it provides detailed responses, while other times, it offers shorter ones. This can be problematic when a detailed answer is required but the tool offers a brief response, or vice versa.

❌ GitHub Copilot’s functioning is not effective when starting a project and relying on your own coding style, naming conventions, and styling. The model needs to see your code first to understand your coding style and approach, making it more of an aid to your natural coding process by suggesting effective solutions during code writing or responding to direct natural language comments.

❌ Research has indicated that the code generated by these tools can be accurate but may lack the necessary security measures. Security is not always a top priority for these tools, which can be a concern.

These are some of the limitations of AI tools of this type. It’s important to note that not all of these limitations apply to every tool in the same way, but they give an overview of how these tools are not perfect and come with various shortcomings despite their remarkable features and capabilities.

👨 💻 The Future

In this section, we’ll discuss the near future of this type of AI, which could unfold within a year or so.

First and foremost, these services will come at a financial cost. For example, GitHub Copilot is already a paid service, but it offers a two-month free trial, and if you’re a student with a registered educational email, you can use it for free. Similarly, the more advanced version of ChatGPT (ChatGPT Plus) is a paid service that offers full support, updates, and faster response times.

We are heading towards a future where software and internet services will mostly be offered through paid subscriptions. The era of free services in exchange for data is coming to an end. Tech giants like Google, Apple, and their investors will rely heavily on paid services. While this concept is better in principle, it may be less favorable in practice.

Now, let’s talk about advancements. ChatGPT receives regular updates, and you can check its version under the input section. With each update, it gains new information, corrections for issues reported by developers, and enhanced capabilities. GitHub Copilot, for instance, will add code refactoring capabilities in an upcoming update, addressing user demands. With every iteration and each use of these tools, they become smarter, more capable of providing assistance, understanding, and analysis.

To illustrate the scale of progress in the next major update of ChatGPT (GPT-4), consider that while GPT-3 was trained on 175 billion parameters, GPT-4 has been trained on about 100 trillion parameters. This represents a massive leap, approximately 570 times the capacity of the original model. By the way, you can currently access GPT-4’s capabilities through the paid ChatGPT Plus service.

Furthermore, tech giants like Google are entering the competition with services like Bard, which will compete with ChatGPT and others. Every company with the ability to develop this kind of AI will strive to claim its share of the market. This situation is reminiscent of the dot-com bubble that occurred with the advent of the internet in the early 1990s.

Due to fierce competition, Microsoft has announced the integration of its services, including the Edge browser, Office suite, and Bing search engine, with ChatGPT capabilities. This integration represents a significant advancement beyond ChatGPT itself.

Many new tools will emerge in the coming period, and some have already started to appear. They rely on the capabilities of powerful tools like ChatGPT and GitHub Copilot to offer specific services quickly, accurately, and creatively. This trend is especially noticeable after the availability of the OpenAI API, which allows companies to incorporate ChatGPT services into their offerings.

By the way, GitHub Copilot recently released an update called Copilot X, which incorporates the capabilities of GPT-4. You can see what it can do from there. ChatGPT has also recently added two impressive features. The first is the integrated IDE (Integrated Development Environment), although it currently supports only Python. Imagine how far its capabilities have come in such a short time. The second is the addition of information sources that ChatGPT suggests to you. You can explore these sources yourself. These two features are part of the paid subscription.

👨 💻 Does It Replace Programmers…!?

In short, no, it doesn’t replace them.

In reality, after my experience with these tools for a sufficient period, I’m convinced that only programmers can effectively use them to produce code. For instance, a programmer might need a specific data structure for a project and be unsure of the best-suited data structure for the task. AI can assist in selecting the optimal choice or suggest the most efficient algorithm to use based on the written code. However, this assistance won’t be effective if the person doesn’t even understand what a data structure is or the concepts of space and time complexity. In such cases, it means that the programmer’s skills in terms of speed, efficiency, and organization are not replaceable.

However, there are two key points to consider:

  1. In companies with a high concentration of programmers, it might lead to a reduced need for some programmers. When AI is present, even a small number of programmers can accomplish a lot in less time and with greater efficiency.
  2. The possibility of entirely replacing programmers arises when software development becomes self-contained. If AI can choose colors, design, UI/UX for the front-end, select database types, create schemes, choose APIs, and handle back-end aspects more efficiently and effectively according to all programming standards, it could minimize the programmer’s role to merely pushing buttons here and there.

But let’s not get ahead of ourselves; we’re not there yet. AI in this form and quality doesn’t exist currently, and a tool that combines all these functionalities doesn’t exist either. The demand for website design and development, especially for front-end development, remains strong, with new frameworks emerging regularly. This shows that human creativity and artistic sensibility are still highly valued.

In the near future, the demand for programmers who can use artificial intelligence tools proficiently will likely increase compared to those who have no knowledge of these tools. AI will become just another tool used by developers, but it will be a tool with more advanced capabilities and quality.

👨 💻 Intellectual Property Rights (Ownership)

About a year ago, when GitHub Copilot was first introduced as a premium service with a monthly fee, it sparked significant controversy among developers. Many developers who had open-sourced their code on GitHub, which was used in training this tool, were not compensated. There were boycott campaigns, legal issues, and other protests. However, no clear resolution has been reached in this matter so far, especially after the emergence of ChatGPT and other similar tools. Currently, there is no single company or tool being boycotted; rather, new tools that are making breakthroughs are challenging the status quo. It’s difficult to stand against such a tide, and no laws currently prohibit or regulate this practice.

👨 💻 Conclusion

We are currently in the era of technological breakthroughs happening in a very short span of time. It’s essential for us to form our own impressions about each aspect of it and strive to be a part of this evolution. If we don’t actively contribute to shaping the future, we may find ourselves living in a future that doesn’t meet our expectations, crafted by others, as is the case now. So, it’s important for everyone to get involved, try things firsthand, and develop their own insights.

Stay tuned for the upcoming article soon, God willing,

titled ‘When AI Gets Creative!’

#Back_In_My_Day_We_Had_Jobs

#When_Code_Writes_Code

#chatGPT #copilot #AI #NLP #Reinforcement_Learning

#machine_learning #supervised_learning #coding

--

--

Abdullah Afify

Passionate about science & tech, studied programming, bioinformatics in college. Experienced in data science (NLP, CV) & web dev (front-end, back-end)