Monster API’s Instruction Synthesizer API: A New Frontier in Language Model Development

Raagulbharatwaj K
5 min readJul 11, 2024

--

The recent advancements in language models (LMs) have been largely driven by unsupervised multitask pre-training. However, the potential of supervised multitask learning, particularly in the post-training phase, cannot be overlooked. This method has shown promising results in enhancing model generalization. In a groundbreaking paper, researchers have introduced a novel framework called Instruction Pre-Training, which aims to elevate language model pre-training by integrating instruction-response pairs into the learning process.

The Instruction Synthesizer is a critical component in the process of Instruction Pre-Training, designed to generate diverse and high-quality instruction-response pairs from raw corpora. This synthesizer is built on open-source models, making it a cost-effective solution compared to large, closed-source models. By converting a wide range of existing datasets into the required format, the synthesizer can produce instruction-response pairs that are used to augment the raw text. This augmentation significantly enhances the pre-training process of language models, allowing them to achieve better generalization and performance.

One of the key features of the Instruction Synthesizer is its ability to generalize to unseen data, thanks to its fine-tuning on a diverse set of tasks. The tuning data includes contexts from various domains, such as encyclopedias, social media, and academic tests, covering tasks like commonsense reasoning and sentiment analysis. This high diversity ensures that the synthesizer can create relevant and accurate instruction-response pairs for any given raw text. The result is a scalable and efficient method for enhancing the pre-training of language models, which has been demonstrated to improve the performance of models like Llama3–8B to levels comparable to or even surpassing much larger models.

The Instruction Synthesizer offers a transformative approach for enhancing the pre-training of large language models (LLMs), making it an invaluable tool for researchers and developers. One of the most compelling reasons to use the Instruction Synthesizer is its demonstrated ability to improve model performance significantly. In experiments, synthesizing 200 million instruction-response pairs across more than 40 task categories resulted in substantial gains in both generalization and task-specific performance. Models pre-trained with these synthesized pairs not only showed enhanced capabilities from the outset but also responded better to further instruction tuning, proving the robustness and adaptability of the pre-training process.

Results obtained on pre-training from scratch

Moreover, the effectiveness of the Instruction Synthesizer in continual pre-training scenarios is particularly noteworthy. For instance, smaller models like Llama3–8B, when augmented with instruction-response pairs generated by the synthesizer, achieved performance levels comparable to or even exceeding those of much larger models like Llama3–70B. This implies that using the Instruction Synthesizer can lead to more efficient utilization of computational resources, enabling smaller models to achieve high performance without the need for extensive scaling. This efficiency is crucial for practical applications where computational resources and time are limited. By leveraging the Instruction Synthesizer, developers can create highly capable language models with improved generalization, adaptability, and performance, all while optimizing resource use.

Results obtained on continual pre-training

Now, with the deployment of Monster API’s Instruction Synthesizer API, users can effortlessly generate their own instruction-response datasets. This service leverages the instruction synthesizer model to create datasets suitable for both instruction pre-training and instruction fine-tuning, simplifying the process significantly. This no-code solution spares users from the laborious tasks of data scraping and running a language model locally. Additionally, it offers substantial cost savings, generating labeled datasets at a fraction of the cost compared to models like GPT and Claude.

MonsterAPI is the easiest and cost effective LLMOps platform designed for developers to quickly fine-tune, eval and deploy Large Language Models for their business applications. With their robust LLMOps APIs, you will be able quickly launch your workloads that are orchestrated on low cost GPU cloud with in-built optimizations. In this blog, we’ll walk you through a simple example of how to use the Monster API’s Instruction Synthesizer API.

Let’s start by converting a PDF document into an instruction-response dataset. We will start by installing the required dependencies to chunk the PDF and convert it into a Huggingface dataset.

!pip install langchain langchain_community pypdf
!pip install monsterapi
!pip install huggingface-hub datasets

Once our dependencies are installed we can now convert our PDF into chunks.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = PyPDFLoader("<PATH TO YOUR PDF>")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

Once we have converted our datasets into chunks we can simply use these chunks to create our hugging-face dataset. Remember to create a Huggingface API key with write permission before proceeding.

import datasets
chunks = [text.page_content for text in texts]
dataset_dict = {
'text': chunks
}
hf_dataset = datasets.Dataset.from_dict(dataset_dict)
from huggingface_hub import notebook_login
notebook_login()
hf_dataset.push_to_hub("<YOUR INPUT DATASET NAME>")

Once this simple step is done, we can simply call Monster API’s Instruction Dataset creation API as follows:

import requests
url = "https://api.monsterapi.ai/v1/deploy/instruction-synthesizer"
payload = {
"model_name": "instruction-pretrain/instruction-synthesizer",
"temperature": 0,
"max_tokens": 400,
"batch_size": 2,
"seed": 42,
"input_dataset_name": "<INPUT DATASET PATH>(For example: RaagulQB/quantum-field-theory)",
"output_dataset_name": "<OUTPUT DATASET PATH>(For example: RaagulQB/quantum-field-theory-instruct)",
"hf_token": "<HF TOKEN>"
}
headers = {
"accept": "application/json",
"content-type": "application/json",
"authorization": "Bearer <MONSTER API TOKEN>"
}
response = requests.post(url, json=payload, headers=headers)print(response.text)

Once the job is deployed you will see a success message like the one below:

{
"message":"Deployment Launched",
"servingParams":{
"model_name":"instruction-pretrain/instruction-synthesizer",
"temperature":0.0,
"max_tokens":400,
"input_dataset_name":"RaagulQB/quantum-field-theory",
"output_dataset_name":"RaagulQB/quantum-field-theory-instruct",
"batch_size":2,
"seed":42,
"deployment_id":"4f42fcd1-b8a2-41a6-bdbb-cb4f210a0f85"
},
"deployment_id":"4f42fcd1-b8a2-41a6-bdbb-cb4f210a0f85"
}

Get started with the instruction synthesizer API by signing up on MonsterAPI and receive free 5000 credits to explore the LLMOps pipelines from Dataset creation to finetuning and deployments of LLMs.

Important points to note:

  1. Make sure you have enough credits in your account before launching the job.
  2. Create an empty dataset on the name of your output dataset in Hugging-face before starting the job.
  3. You can skip the whole chunking part and proceed to go with generating a supervised dataset for an existing dataset make sure your column containing the chunks are named as “text”.

--

--