FineTuning StarCoder2 on Google Colab T4 GPU

Published in

CodeX

10 min readMay 1, 2024

StarCoder is a Code LLM (Large Language Model) by BigCode in collaboration with Hugging Face. It was trained on the Stack dataset containing more than 600 programming languages. StarCoder2 was trained with 3B, 7B and 15B parameters on 3.3 to 4.4 trillions tokens. The Starcoder2–3B outperforms Code LLMS of similar size. The largest model, StarCoder2–15B outperforms CodeLlama-34B, a model more than twice its size, and also DeepSeekCoder-33B, the best-performing model at code completion for high-resource languages. In this blog,we will learn to fine-tune the smallest LLM model of BigCode, StarCoder2–3B, exclusively in the SQL programming language.

Large Language Models for Code have rapidly emerged as powerful assistants for writing and editing code. As of January 30, 2024, GitHub CoPilot has garnered over 1.3 million paying subscribers, with over 50,000 organizations opting for the enterprise version, estimated to increase developer productivity by up to 56% as well as developer satisfaction. ServiceNow recently disclosed that their “text-to-code” solution, built from fine-tuning StarCoderBase models, resulted in a 52% increase in developer productivity. Code LLMs exhibit the potential to enhance all phases of the software development cycle. The BigCode project was established in September 2022 as an open scientific collaboration focused on the open and responsible development of Code LLMs. BigCode is stewarded by ServiceNow and Hugging Face in the spirit of open governance. The community previously released The Stackv1, a 6.4TB dataset of permissively licensed source code in 384 programming languages. In December 2022, the BigCode community released SantaCoder, a strong-performing 1.1B parameter model trained on Java, JavaScript, and Python code from TheStack v1. Building upon the success of StarCoder Base, the community further scaled up its effort and released StarCoder2 and Stackv2 on Feb 29th, 2024.

● The StarCoder2–3B model outperforms other Code LLMs of similar size (StableCode-3B and DeepSeekCoder-1.3B) on most benchmarks.

● The StarCoder2–15B model significantly outperforms other models of comparable size (CodeLlama-13B), and matches or outperforms CodeLlama-34B. DeepSeekCoder-33B is the best model at code completion benchmarks for high-resource languages. However, StarCoder2–15B matches or outperforms DeepSeekCoder-33B on low-resource programming languages (e.g., D, Julia, Lua, and Perl). Moreover, when we consider benchmarks that require models to reason about code execution or mathematics, we find that StarCoder2–15B outperforms DeepSeekCoder-33B.

● The StarCoder2–7B model outperforms CodeLlama-7B but is behind DeepSeekCoder-6.7B. It is not clear why StarCoder2–7B does not perform as well as StarCoder2–3B and StarCoder2–15B for their size.

What is StarCoder2?

StarCoder2 is a family of open LLMs for code and comes in 3 different sizes with 3B, 7B and 15B parameters. The flagship StarCoder2–15B model is trained on over 4 trillion tokens and 600+ programming languages from The Stack v2. All models use Grouped Qery Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and were trained using the Fill-in-the-Middle objective.

● StarCoder2–3B was trained on 17 programming languages from The Stack v2 on 3+ trillion tokens.

● StarCoder2–7B was trained on 17 programming languages from The Stack v2 on 3.5+ trillion tokens.

● StarCoder2–15B was trained on 600+ programming languages from The Stack v2 on 4+ trillion tokens.

StarCoder2–15B is the best in its size class and matches 33B+ models on many evaluations. StarCoder2–3B matches the performance of StarCoder1–15B:

What is StackV2?

The Stack v2 is the largest open code dataset suitable for LLM pretraining. The Stack v2 is larger than The Stack v1, follows an improved language and license detection procedure, and better filtering heuristics. In addition, the training dataset is grouped by repositories, allowing to train models with repository context.

This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history. Software Heritage, launched by Inria in partnership with UNESCO, is an open, non-profit initiative to collect, preserve, and share the source code of all publicly available software.

Base Model Architecture

The architecture of the base model discussed in the paper “StarCoder 2 and The Stack v2: The Next Generation” is designed to enhance code completion tasks and is part of the BigCode project’s efforts to develop Large Language Models for Code (Code LLMs). Here is an overview of the architecture:

● Model Variants: The StarCoder2 models come in different sizes, including 3B, 7B, and 15B parameters, each trained on a vast amount of tokenized data ranging from 3.3 to 4.3 trillion tokens.

● Training Data: The models are trained on a comprehensive dataset that is times larger than the original StarCoder dataset. This dataset is carefully curated from various high-quality sources, such as Software Heritage repositories, GitHub pull requests, Kaggle notebooks, and code documentation.

● Performance: The small model, StarCoder2–3B, outperforms other models of similar size on most benchmarks and even surpasses the larger StarCoderBase-15B model. The large model, StarCoder2–15B, significantly outperforms models of comparable size and even matches or outperforms larger models like CodeLlama-34B.

● Evaluation: The models are thoroughly evaluated on a wide range of Code LLM benchmarks, showcasing their capabilities in code completion tasks, math, code reasoning, and performance across different languages, including low-resource languages.

● Transparency and Accessibility: The model weights are made available under an OpenRAIL license, ensuring transparency regarding the training data by releasing Software Heritage persistent Identifiers (SWHIDs) of the source code data.

Training Details of StarCoder2 Base Models

Purpose of Fine-Tuning

The purpose of fine tuning starcoder2 in SQL Language is for building a Text2SQL model is to enhance the model’s ability to generate SQL queries from natural language prompts. By adapting the starcoder2 model through finetuning, it becomes specialized in code completion tasks related to SQL data. This process allows the model to suggest relevant completions, facilitate data exploration and analysis, and bridge the gap between users and databases by enabling non-programmers to interact with databases using natural language queries that are converted into SQL queries. The fine tuned model aims to improve the accessibility of databases for users without programming expertise and enhance the efficiency of data analysis tasks.

Training Procedure

1) Loading the Base Model:

a) Choosing the right model: StarCoder2 offers a range of models to suit your needs:

i) 3 Billion Parameter Model: Lightweight and powerful, ideal for limited computational resources.
ii) 7 Billion Parameter Model: Provides enhanced capabilities for complex projects.
iii) 15 Billion Parameter Model: The ultimate in power for cutting-edge innovation.

Due to limited computational resources, I decided to proceed with the 3 Billion Parameter Model. It is trained on 17 programming languages from The Stack v2. The model uses Grouped Query Attention, a context window of 16,384 tokens with a sliding window attention of 4,096 tokens, and was trained using the Fill-in-the-Middle objective on 3+ trillion tokens. This model was chosen due to its size, performance and also low resource availability.

model = AutoModelForCausalLM.from_pretrained(
        "bigcode/starcoder2-3b",
        quantization_config=bnb_config,
        device_map={"": PartialState().process_index}
    )

b) Using 4-bit quantization for a reduced memory footprint

bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

2) Loading the Dataset:

The Stack-Smol is a small subset (~0.1%) of the-stack dataset comprising 2.6 GB of text (code), which are available under permissive licenses and span 30 programming languages. This dataset was created as a part of the BigCode Project, a collaborative initiative focused on the ethical advancement of Large Language Models for Code (Code LLMs). The Stack serves as a foundational dataset for Code LLMs. For the purposes of fine tuning our base model, only SQL programming data was used from this dataset. The split from the dataset contained 158MB of Data and around 50,000 examples of R codes.

i) Load the bigcode/the-stack-smol dataset using the Hugging Face Datasets library.

ii) Filter for the specified subset (data/sql) and split (train).

parser.add_argument("--dataset_name", type=str, default="the-stack-smol")
parser.add_argument("--subset", type=str, default="data/sql")
parser.add_argument("--split", type=str, default="train")
parser.add_argument("--dataset_text_field", type=str, default="content")

3) Preprocess Data:

a) Tokenize the code text using the appropriate tokenizer for the chosen model.

b) Apply necessary cleaning or normalization (e.g., removing comments, handling indentation).

c) Create input examples suitable for the model’s architecture (e.g., with masked language modeling objectives).

4) Defining the Arguments

Set training arguments based on the provided args:

i. learning_rate: 0.0002

ii. train_batch_size: 1

iii. eval_batch_size: 8

iv. seed: 0

v. gradient_accumulation_steps: 4

vi. total_train_batch_size: 4

vii. optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08

viii. lr_scheduler_type: cosine

ix. lr_scheduler_warmup_steps: 100

x. training_steps: 1000

xi. mixed_precision_training: Native AMP

def get_args():
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_id", type=str, default="bigcode/starcoder2-3b")
    parser.add_argument("--dataset_name", type=str, default="the-stack-smol")
    parser.add_argument("--subset", type=str, default="data/sql")
    parser.add_argument("--split", type=str, default="train")
    parser.add_argument("--dataset_text_field", type=str, default="content")

    parser.add_argument("--max_seq_length", type=int, default=512)
    parser.add_argument("--max_steps", type=int, default=1000)
    parser.add_argument("--micro_batch_size", type=int, default=1)
    parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
    parser.add_argument("--weight_decay", type=float, default=0.01)
    parser.add_argument("--bf16", type=bool, default=True)

    parser.add_argument("--attention_dropout", type=float, default=0.1)
    parser.add_argument("--learning_rate", type=float, default=2e-4)
    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
    parser.add_argument("--warmup_steps", type=int, default=100)
    parser.add_argument("--seed", type=int, default=0)
    parser.add_argument("--output_dir", type=str, default="finetune_starcoder2-3b")
    parser.add_argument("--num_proc", type=int, default=None)
    parser.add_argument("--push_to_hub", type=bool, default=True)

    # Parse arguments
    args = parser.parse_args([])
    return args

5) Setting up LoRA:
LoRA (Low-Rank Adaptation) is a technique used to fine-tune large language models (LLMs) for specific tasks. It works by making small adjustments to key parts of the LLM, like the attention mechanism, instead of retraining the entire model. This allows for improved performance on a particular task while keeping the model efficient.

lora_config = LoraConfig(
        r=8,
        target_modules=[
            "q_proj",
            "o_proj",
            "k_proj",
            "v_proj",
            "gate_proj",
            "up_proj",
            "down_proj",
        ],
        task_type="CAUSAL_LM",
    )

i) lora_config = LoraConfig(…): This configuration object will hold information on how to apply LoRA to the fine-tuning process.

ii) r=8: This parameter sets the rank for LoRA. The rank controls the size and complexity of the additional parameters introduced by LoRA.
target_modules:

ii.a) “q_proj”: Projection for query vectors in the attention layer.

ii.b) “o_proj”: Projection for output vectors in the attention layer.

ii.c) “k_proj”: Projection for key vectors in the attention layer.

ii.d) “v_proj”: Projection for value vectors in the attention layer.

ii.e) “gate_proj”: Projection for the gate in the feed-forward network.

ii.f) “up_proj”: Projection for the output of the feed-forward network.

ii.g) “down_proj”: Projection for the input to the feed-forward network.

iii) These modules are all related to the attention mechanism and feed-forward network, which are crucial parts of the LLM for processing information and generating text.

iv) task_type=”CAUSAL_LM”: This parameter specifies the task type for which the LLM is being fine-tuned. Here, “CAUSAL_LM” indicates a Causal Language Modeling task, where the model predicts the next word in a sequence based on the previous words.

6) Setting the Trainer using Supervised Fine-tuning Trainer (SFTTrainer):

# setup the trainer
trainer = SFTTrainer(
        model=model,
        train_dataset=data,
        max_seq_length=args.max_seq_length,
        args=transformers.TrainingArguments(
            per_device_train_batch_size=args.micro_batch_size,
            gradient_accumulation_steps=args.gradient_accumulation_steps,
            warmup_steps=args.warmup_steps,
            max_steps=args.max_steps,
            learning_rate=args.learning_rate,
            lr_scheduler_type=args.lr_scheduler_type,
            weight_decay=args.weight_decay,
            bf16=args.bf16,
            logging_strategy="steps",
            logging_steps=10,
            output_dir=args.output_dir,
            optim="paged_adamw_8bit",
            seed=args.seed,
            run_name=f"train-{args.model_id.split('/')[-1]}",
            report_to="wandb",
        ),
        peft_config=lora_config,
        dataset_text_field=args.dataset_text_field,
    )

7) Start Training:

# launch
print("Training...")
trainer.train()

print("Saving the last checkpoint of the model")
model.save_pretrained(os.path.join(args.output_dir, "final_checkpoint/"))

8) Pushing it to Hugging Face Space

if args.push_to_hub:
    trainer.push_to_hub("Upload model")

9) Building an Interface

After deploying our fine-tuned let’s build an interface.

a) Load the saved model from Hugging Face space

config = PeftConfig.from_pretrained("Sayan18/finetune_starcoder2")
base_model = "bigcode/starcoder2-3b"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    load_in_4bit=True,
    torch_dtype=torch.float16,
    device_map="cuda",
)
model = PeftModel.from_pretrained(model, "Sayan18/finetune_starcoder2")
tokenizer = AutoTokenizer.from_pretrained("Sayan18/finetune_starcoder2")

Here replace the Sayan18/finetune_starcoder2 with your own HuggingFace repository name

b) Design the Interface

For our end-users, who will potentially be using our model we will be building an interface with our saved fine-tuned StarCoder2 model using Gradio.

def generate_sql(question):
    eval_prompt = f"""You are a powerful text-to-SQL model. Your job is to answer questions about a database. You are given a question convert it into a SQL query.

    You must output the SQL query that answers the question.
    ### Input:
    {question}

    ### Response:
    """
    model_input = tokenizer(eval_prompt, return_tensors="pt").to("cuda")

    model.eval()
    with torch.no_grad():
        output = model.generate(
            **model_input,
            max_length=300,
            eos_token_id=tokenizer.eos_token_id,
        )
        response = tokenizer.decode(output[0], skip_special_tokens=True)
        return response.split("### Response:")[-1].strip()

iface = gr.Interface(
    fn=generate_sql,
    inputs=[gr.Textbox(label="Enter a question about the database")],
    outputs="text",
    title="Text-to-SQL Generator",
    description="Ask a question in natural language, and I'll generate the corresponding SQL query.",
)

iface.launch(debug=True)

Result Analysis

a) Training Results:

b) GPU Power Usage

For a more detailed Result Analysis visit my Weights&Bias report

Wrapping Up

The development and fine-tuning of the Text-to-SQL model, represent a significant advancement in bridging the gap between users and databases. By leveraging the base model, bigcode/starcoder2–3b, and specializing it for code completion tasks using the bigcode/the-stack-smol dataset on SQL data, this project has successfully created a tool that allows non-programmers to interact with databases through natural language queries converted into SQL queries.

In conclusion, the development of the fine tuned Text-to-SQL model represents a significant step forward in democratizing database interactions and data analysis tasks. The project’s success in fine-tuning the base model to address specific tasks related to SQL data underscores the potential of natural language processing models in simplifying complex tasks and empowering users with varying levels of technical expertise.

For more details please have a look at my Hugging Face Space: Sayan18/Finetune_StarCoder2.