DeepSeek-Coder: When the LLM Meets Programming — Better than GPT 3.5 ?

Aditya Raghuvanshi
7 min readMar 29, 2024

--

The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, They introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens.

Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5.

Introduction

These models have the potential to automate and streamline many aspects of coding, from bug detection to code generation, thereby enhancing productivity and reducing the likelihood of human error.

Each model in the series has been trained from scratch on 2 trillion tokens sourced from 87 programming languages, ensuring a comprehensive understanding of coding languages and syntax

Data Collection

The training dataset of DeepSeek-Coder is composed of 87% source code, 10% English code- related natural language corpus, and 3% code-unrelated Chinese natural language corpus. The English corpus consists of materials from GitHub’s Markdown and StackExchange1, which are used to enhance the model’s understanding of code-related concepts and improve its ability to handle tasks like library usage and bug fixing. Meanwhile, the Chinese corpus consists of high-quality articles aimed at improving the model’s proficiency in understanding the Chinese language.

This process involves data

  1. crawling
  2. rule-based filtering
  3. dependency parsing
  4. repository- level deduplication
  5. quality screening,

In the above figure, The data creation process have been visualised step by step.

What all languages did they used ?

LIst of langauges with the percentage proportion

Training Policy

Next Token Prediction

The first training objective for their model is known as next token prediction. In this process, various files are concatenated to form a fixed-length entry. Then, these entries are used to train the model, enabling it to predict the subsequent token based on the provided context.

Fill-in-the-Middle

The second training objective for their model is known as fill-in-the-middle. In the code pre-training scenario, it is often necessary to generate corresponding inserted content based on the given context and subsequent text. Due to specific dependencies in a programming language, relying solely on next token prediction is insufficient to learn this fill-in-the-middle capability. Therefore, several approaches propose the pretraining method of Fill-in-the-Midlle (FIM).

This approach involves randomly dividing the text into three parts, then shuffling the order of these parts and connecting them with special characters. This method aims to incorporate a fill-in-the-blank pretraining task during the training process.

The effectiveness of using FIM Objective

Tokenizer

For the tokenization process, They employ the HuggingFace Tokenizer library2 to train Byte Pair Encoding (BPE) tokenizers, as outlined in on a subset of our training corpus. Ultimately, They utilize a tokenizer configured with a vocabulary size of 32,000.

Model Architecture

They develop a range of models with varying parameters to cater to diverse applications, including models with 1.3B, 6.7B, and 33B parameters. These models are built upon the same framework as the DeepSeek Large Language Model (LLM) outlined by DeepSeek-AI (2024). Each model is a decoder-only Transformer, incorporating Rotary Position Embedding (RoPE)

Long Context

To enhance the capabilities of DeepSeek-Coder in handling extended contexts, particularly for scenarios like repository-level code processing, They have reconfigured the RoPE (Su et al., 2023) parameters to extend the default context window.

Theoretically, these modifications enable their model to process up to 64K tokens in context. However, empirical observations suggest that the model delivers its most reliable outputs within a 16K token range

Instruction Tuning

They develop DeepSeek-Coder-Instruct by enhancing the DeepSeek-Coder-Base through instruction- based fine-tuning using high-quality data. This data comprises helpful and impartial human instructions, structured by the Alpaca Instruction format (Taori et al., 2023). To demarcate each dialogue turn, They employed a unique delimiter token <|EOT|> to signify the conclusion of each segment. For training, we use a cosine schedule with 100 warm-up steps and an initial learning rate 1e-5. They also use a batch size of 4M tokens and 2B tokens in total.

Results

HumanEval and MBPP Benchmarks

To evaluate the model’s multilingual capabilities, we expanded the Python problems of Humaneval Benchmark to seven additional commonly used programming languages, namely C++, Java, PHP, TypeScript (TS), C#, Bash, and JavaScript (JS)

DS-1000 Benchmark

HumanEval and MBPP have a significant drawback in that they rely heavily on straightforward programming tasks that may not accurately represent the kind of code most programmers typically write. In contrast, the DS-1000 benchmark, as introduced in the work by Lai et al. (2023), offers a comprehensive collection of 1,000 practical and realistic data science workflows across seven different libraries. This benchmark evaluates code generation by executing it against specific test cases.

LeetCode Contest Benchmark

To further validate the model’s capability in real-world pro- gramming problems, we construct the LeetCode Contest benchmark3. LeetCode4 presents competition-level problems, offering significant challenges that test the model’s problem under- standing and code generation skills.

Link for the published benchmark

Fill-in-the-Middle Code Completion

DeepSeek-Coder models are trained with a 0.5 FIM (Fill-In-the-Middle) rate during their pre- training phase. This specialized training strategy empowers the model to proficiently generate code by filling in blanks based on the surrounding context, both prefix and suffix, of the given code snippet.

Cross-File Code Completion

we will evaluate the performance of existing open-source models in cross-file code completion tasks. Unlike code generation discussed in the previous section, cross-file code completion requires the model to access and understand repositories that span multiple files with numerous cross-file dependencies.

Program-based Math Reasoning

Program-based math reasoning involves evaluating a model’s ability to understand and solve mathematical problems through programming. This type of reasoning is critical in fields such as data analysis and scientific computing.

Conclusion

In this technical report, They introduce a series of specialized Large Language Models (LLMs) for coding, named DeepSeek-Coder, available in three distinct scales:

  1. 3B parameters
  2. 6.7B parameters
  3. 33B parameters.

These models are uniquely trained on a meticulously curated project-level code corpus, utilizing a “fill-in-the-blank” pre-training objective to enhance code infilling capabilities. A significant advancement is the extension of the models’ context window to 16,384 tokens, thereby greatly improving their effectiveness in handling extensive code generation tasks. Our evaluations reveal that the most advanced model in our series, DeepSeek-Coder-Base 33B surpasses existing open-source code models across a variety of standard tests.

Impressively, the DeepSeek-Coder- Base 6.7B model, despite its smaller scale, delivers performance on par with the 34B parameter CodeLlama, a testament to the high quality of our pretraining corpus.

To augment the zero-shot instruction capabilities of the DeepSeek-Coder-Base models, we have fine-tuned them with high-quality instructional data. This has led to the DeepSeek-Coder- Instruct 33B model outperforming OpenAI’s GPT-3.5 Turbo in a range of coding-related tasks, showcasing its exceptional proficiency in code generation and understanding.

To further improve the natural language understanding capabilities of the DeepSeek-Coder- Base models, we have conducted additional pretraining based on the DeepSeek-LLM 7B check- point. This additional training involved processing a diverse dataset comprising 2 billion tokens, including natural language, code, and mathematical data.

The result is the creation of a new and improved code model, DeepSeek-Coder-v1.5. Our observations indicate that DeepSeek- Coder-v1.5 not only maintains its predecessor’s high-level coding performance but also exhibits enhanced natural language comprehension. This advancement underscores our belief that the most effective code-focused Large Language Models (LLMs) are those built upon robust general LLMs. The reason is evident: to effectively interpret and execute coding tasks, these models must also possess a deep understanding of human instructions, which often come in various forms of natural language.

Important links

  1. Huggingface
  2. Github
  3. Website
  4. Technical Report

I feel extremely happy sharing all this knowledge and do let me know if this article has helped you.

Thank you for reading, I hope this article helped you

Aditya Raghuvanshi ( IIIT Hyderabad, INDIA )

Connect me on the following :

Github | linkedin | Medium | Gmail : tanalpha.aditya@gmail.com

--

--

Aditya Raghuvanshi

AI researcher | NLP and Speech processing | IIIT Hyderabad