How I created an instruction dataset using GPT 3.5 to fine-tune Llama 2 for news classification

Using LLM to fine-tune LLM

Kshitiz Sahay
15 min readJul 30, 2023
Source: seo.ai

News articles play a pivotal role in machine learning research for several reasons. They contain a wealth of information, covering a wide range of topics like politics, economics, technology, and more. Moreover, they often contain complex language constructs, including metaphors, analogies, and domain-specific terminology. Utilizing this diverse and rich textual data in research and industry serves as an excellent resource for training and evaluating machine learning models, thus helping advance the field of natural language understanding and other related domains.

With the diverse applications of news articles in machine learning research, from sentiment analysis to text summarization, it becomes crucial to systematically classify them into distinct categories. Not only does it help organize and structure this vast amount of data, but it also allows users to quickly access relevant news based on their research or business use case. Whether building sentiment analysis models for cryptocurrency or stock market news or conducting research in any other domain, having a well-categorized dataset is fundamental for building accurate and effective machine learning models.

However, curating such a dataset manually or through keyword searches can be laborious and imprecise. In this blog, we will explore an innovative solution to this challenge — creating a labeled dataset, specifically an instruction dataset, easily and efficiently, which we can use to fine-tune or instruction-tune the recently launched Meta’s Llama 2, a powerful open-source Large Language Model (LLM), for the news classification task.

An instruction dataset could be created in one of the following ways:

  1. Use an existing dataset and convert it into an instruction dataset.
  2. Use existing LLMs to create an instruction dataset.
  3. Manually create an instruction dataset.

Given our requirements for a high-quality dataset in a limited time and budget, we will use OpenAI’s GPT 3.5, an existing LLM that powers ChatGPT, to create an instruction dataset (covered in this blog) and instruction-tune Llama 2 (to be covered in an upcoming blog) to categorize news articles across 18 pre-defined categories, such as business, technology, sports, etc.

Check out my Google Colab Notebook.

Let’s get started.

Installing Required Libraries

The first step is to install the latest version of the openai library to access the OpenAI API and build our news classification instruction dataset. We will also install datasets from Hugging Face to view a sample instruction dataset.

!pip install --upgrade openai --progress-bar off
!pip install -Uqqq datasets --progress-bar off

Loading Required Libraries

The next step is to import all the required libraries.

import pandas as pd
import numpy as np
import openai
import time
import random
from random import randrange
from tenacity import retry, stop_after_attempt, wait_random_exponential, retry_if_exception_type
from datasets import load_dataset
import warnings
warnings.filterwarnings('ignore')

# To read and write data files in Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount = True)

tenacity is a general-purpose retrying library, written in Python, to simplify the task of adding retry behavior to just about anything.

In this notebook, I have used tenacity to implement exponential back-off to bypass RateLimitError. This error message comes from exceeding the API’s rate limits.

You can read more about RateLimitError and tenacity usage over here.

A Sample Instruction Dataset

Before creating an instruction dataset for the news classification task, let’s look at a popular open instruction dataset, Databricks Dolly 15K. It contains 15,000 high-quality human-generated prompt / response pairs specifically designed for instruction-tuning large language models. Read more about this dataset here.

# Sample instruction dataset
instruction_dataset_name = "databricks/databricks-dolly-15k"

# Loading Databricks Dolly 15K from Hugging Face Datasets
dataset = load_dataset(instruction_dataset_name, split = "train")

print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

# Displaying a random prompt / response pair from the dataset
print(dataset[randrange(len(dataset))])
Number of prompts: 15011
Column names are: ['instruction', 'context', 'response', 'category']

{'instruction': 'What is AWS ECS?', 'context': '', 'response': 'Amazon Elastic Container Service (ECS) is a highly scalable, high performance container management service that supports Docker containers and allows you to easily run applications on a managed cluster of Amazon Elastic Compute Cloud (Amazon EC2) instances.', 'category': 'open_qa'}

Each prompt is a dictionary composed of four keys or fields.

instruction: A question or instruction entered by the user.

context: Text entered by the user to help interpret the instructions.

response: Response to the instruction.

category: Category of the instruction, such as Open Q&A, Closed Q&A, Creative writing, etc.

We will generate an instruction dataset for news classification, modeled after the Databricks Dolly 15K dataset structure. Our dataset will contain instructions for the model to classify news articles or instruction, news articles or input, and their actual category or output.

Variable Definitions

Let’s move on to creating the instruction dataset for the news classification task. In the following cell, we will define all the static variables, including file path, file names, API key, and OpenAI model name.

You can create your secret OpenAI API key at OpenAI’s official website.

You can select one of the many models offered by OpenAI for prompting, such as gpt-4, gpt-3.5-turbo, text-davinci-003, etc. Check out the complete list over here. I have used gpt-3.5-turbo, which powers the widely popular ChatGPT, to create my instruction dataset for news classification.

#### Input and output data file names ####
path = "/content/drive/MyDrive/"
input_data_filename = "signalmedia-1m.jsonl.gz"
preprocessed_data_filename = "signalmedia_news_dataset_sample.csv"
processed_data_filename = "signalmedia_news_dataset_sample_classified.csv"
output_data_json_filename = "news_classification.json"
output_data_csv_filename = "news_classification.csv"

#### OpenAI API Key ####
openai.api_key = "Your OpenAI API Key"

#### OpenAI model ####
model_name = "gpt-3.5-turbo"

Preprocessing Raw Data

To create a news classification dataset for instruction-tuning Llama 2, we can use an open-source dataset named Signal 1 Million News Articles Dataset by Signal AI. This dataset, available as a zipped JSONL file, contains 1 million news articles and blogs from a variety of data sources for a period of 1 month (September 2015). There are approximately 735K news articles and 265K blog articles. We will select only 1000 news articles to tune Llama 2, as research shows that creating a high-quality, low quantity (~1000 samples) dataset can achieve the same performance as less-quality and high quantity datasets.

Data description:

id: a unique identifier for the article

title: the title of the article

content: the textual content of the article (which may occasionally contain HTML and JavaScript content)

source: the name of the article source (e.g., Reuters)

published: the publication date of the article

media-type: either “News” or “Blog”

# Reading zipped JSONL data as a Pandas DataFrame
raw_news_df = pd.read_json(f"{path}{input_data_filename}", lines = True)

# Selecting "News" records
raw_news_df2 = raw_news_df[raw_news_df['media-type'] == "News"]

# Shuffling the dataset
raw_news_df3 = raw_news_df2.sample(frac = 1)

# Selecting top 1000 records/news articles
raw_news_df4 = raw_news_df3.head(1000)

# Saving the preprocessed data as a CSV file
raw_news_df4.to_csv(f"{path}{preprocessed_data_filename}", index = False)

Viewing our preprocessed news articles dataset.

# Loading the preprocessed data as a Pandas DataFrame
prep_news_df = pd.read_csv(f"{path}{preprocessed_data_filename}")

display(prep_news_df)
Signal 1 Million News Articles Dataset (Preprocessed)

Although we can combine title and content together, we will only use the content column in subsequent cells to create the instruction dataset for the purpose of this tutorial.

Creating a Custom Prompt Template

In the following cell, we will create a custom prompt template to interact with GPT 3.5. It would define bot behavior and instruct it to categorize news articles provided by the user into one of the 43 pre-defined categories, which I found from the News Category Dataset on Kaggle. This dataset contains 210K news headlines and their categories extracted from HuffPost between 2012 and 2021.

We will also use Few Shot Prompting to guide the model to respond in a specific way by providing two news articles and their expected output as examples as part of the prompt template.

# Defining bot behavior and instructing
SYSTEM_PROMPT = """You are ChatGPT, an intelligent bot. I will give you a news article. You have to classify the news into one of the 43 categories."""

USER_PROMPT_1 = """Are you clear about your role?"""

ASSISTANT_PROMPT_1 = """Sure, I'm ready to help you with your news classification task. Please provide me with the necessary information to get started."""

# Few Shot Prompting
PROMPT = (
"""
Categories:

U.S. NEWS
COMEDY
PARENTING
WORLD NEWS
CULTURE & ARTS
TECH
SPORTS
ENTERTAINMENT
POLITICS
WEIRD NEWS
ENVIRONMENT
EDUCATION
CRIME
SCIENCE
WELLNESS
BUSINESS
STYLE & BEAUTY
FOOD & DRINK
MEDIA
QUEER VOICES
HOME & LIVING
WOMEN
BLACK VOICES
TRAVEL
MONEY
RELIGION
LATINO VOICES
IMPACT
WEDDINGS
COLLEGE
PARENTS
ARTS & CULTURE
STYLE
GREEN
TASTE
HEALTHY LIVING
THE WORLDPOST
GOOD NEWS
WORLDPOST
FIFTY
ARTS
DIVORCE
ESG

If you don't know the category, response "OTHERS".

Output Format:
Category name

Examples:
1. News: New Product Gives Marketers Access to Real Keywords, Conversions and Results Along With 13 Months of Historical Data

SAN FRANCISCO, CA -- (Marketwired) -- 09/17/15 -- Jumpshot, a marketing analytics company that uses distinctive data sources to paint a complete picture of the online customer journey, today announced the launch of Jumpshot Elite, giving marketers insight into what their customers are doing the 99% of the time they're not on your site. For years, marketers have been unable to see what organic and paid search terms users were entering, much less tie those searches to purchases. Jumpshot not only injects that user search visibility back into the market, but also makes it possible to tie those keywords to conversions -- for any web site.

"Ever since search engines encrypted search results, marketers have been in the dark about keywords, impacting not only the insight into their own search investments, but also their ability to unearth high converting keywords for their competitors," said Deren Baker, CEO of Jumpshot. "Our platform eliminates the hacks, assumptions, and guesswork that marketers are doing now and provides real data: actual searches tied to actual conversions conducted by real people with nothing inferred."

Unlike other keyword research tools that receive data through the Adwords API or send bots to cobble together various data inputs and implied metrics, Jumpshot leverages its panel of over 115 million global consumers to analyze real search activity. As a result, Jumpshot is able to provide companies with actionable data to improve the ROI of their search marketing campaigns, SEO tactics and content marketing initiatives.

Available today, Jumpshot Elite provides 13 months of backward-looking data as well as:

Access to real queries used by searchers

Paid and organic results for any website

Visibility into organic keywords, eliminating the "not provided" outcome in web analytics

Real user queries, clicks and transactions instead of machine-generated clicks with inferred results

Ability to tie keywords to real transactions on any website

Variable attribution models and lookback windows

Launched in January, 2015, Jumpshot grew out of the ambitions of a group of smart marketers and data scientists who were frustrated about the limitations of the data they had access to, and excited about the opportunity to provide new insights into online behavior.

The company uses distinctive data sources to paint a complete picture of the online world for businesses, from where customers spend time online to what they do there and how they get from place to place. By tracking the online customer journey down to each click, Jumpshot reveals how and why customers arrive at purchase decisions. The company tracks more data in more detail than other services, tracking 160 billion monthly clicks generated by its extensive data panel.

About Jumpshot

Jumpshot is a marketing analytics platform that reveals the entire customer journey -- from the key sources of traffic to a site, to browsing and buying behavior on any domain. With a panel of 115 million users, Jumpshot provides marketers with the insight to understand what their customers are doing the 99% of the time they're not on their own site -- a scope of information never before attainable. Jumpshot was founded in 2015 and is headquartered in San Francisco.

For more information, please visit www.jumpshot.com.

Image Available: http://www2.marketwire.com/mw/frame_mw?attachid=2889222

Kelly Mayes

The Bulleit Group

615-200-8845

Published Sep. 17, 2015

Copyright © 2015 SYS-CON Media, Inc. — All Rights Reserved.

Syndicated stories and blog feeds, all rights reserved by the author.

Output: TECHNOLOGY

2. News: SOURCE Harwood Feffer LLP

NEW YORK

On July 21, 2015

On this news, VASCO stock nearly 33% and has not recovered.

Our investigation concerns whether the Company board of directors has breached its fiduciary duties to shareholders, grossly mismanaged the Company, and/or committed abuses of control in connection with the foregoing.

If you own VASCO shares and wish to discuss this matter with us, or have any questions concerning your rights and interests with regard to this matter, please contact:

Robert I. Harwood, Esq.

Harwood Feffer

The law firm responsible for this advertisement is Harwood Feffer LLP (www.hfesq.com). Prior results do not guarantee or predict a similar outcome with respect to any future matter.

Logo - http://photos.prnewswire.com/prnh/20120215/MM54604LOGO

To view the original version on PR Newswire, visit:http://www.prnewswire.com/news-releases/harwood-feffer-llp-announces-investigation-of-vasco-data-security-international-inc-300149371.html

©2015 PR Newswire. All Rights Reserved.

Output: BUSINESS

3. {}
Output:
"""
)

Generating Model Inference

In the following cells, we will define the chat_completion_with_backoff and openai_chat_completion_response functions to send user prompts and receive responses using OpenAI's Chat Completion API.

We will also add tenacity.retry decorator to implement automatic retry requests with a random exponential back-off to avoid rate limit errors. Retrying with exponential back-off means performing a short sleep when a rate limit error is hit, then retrying the unsuccessful request. If the request is still unsuccessful, the sleep length is increased, and the process is repeated. This continues until the request is successful or a maximum number of retries is reached.

# Decorator for automatic retry requests
@retry(
retry = retry_if_exception_type((openai.error.APIError, openai.error.APIConnectionError, openai.error.RateLimitError, openai.error.ServiceUnavailableError, openai.error.Timeout)),
# Function to add random exponential backoff to a request
wait = wait_random_exponential(multiplier = 1, max = 60),
stop = stop_after_attempt(10)
)

# Function to invoke Open AI's Chat Complete AI
def chat_completion_with_backoff(**kwargs):
return openai.ChatCompletion.create(**kwargs)

# Function to pass model name and user prompts and receive response
def openai_chat_completion_response(USER_PROMPT_2):
response = chat_completion_with_backoff(
model = model_name,
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_PROMPT_1},
{"role": "assistant", "content": ASSISTANT_PROMPT_1},
{"role": "user", "content": USER_PROMPT_2}
]
)

return response['choices'][0]['message']['content'].strip(" \n")

Next, we will define the predict_news_category function that accepts a news article from the preprocessed dataset, appends it to the user prompt, and sends the prompt to the openai_chat_completion_response function for classification. The output would be one of the 43 pre-defined news categories if the request went through successfully; otherwise, it would be “NA”. One of the other reasons, apart from the rate limit, that might interrupt the API call is exceeding the token limit. In such cases, we could trim news articles with a high token count to get a valid response.

predict_news_category would be called through a lambda function that iterates over every row of the content column in the preprocessed dataset.

# Function to classify news articles
def predict_news_category(news_body):
# Add news article to the prompt
NEWS = news_body
FINAL_PROMPT = PROMPT.format(NEWS)
# Send prompt for inference
try:
classify_news = openai_chat_completion_response(FINAL_PROMPT)
except:
# Output "NA" if the request fails
classify_news = "NA"
time.sleep(20)
return classify_news

For the purpose of this tutorial, we will select the first 100 records for generating news categories and view the results.

# Selecting 100 records at a time for inference
prep_news_df2 = prep_news_df.iloc[0:100,:].copy()

# Lambda function to iterate over news articles and save response as a new column
prep_news_df2['predicted_category'] = prep_news_df2['content'].apply(lambda x: predict_news_category(x))

display(prep_news_df2[['content', 'predicted_category']].head())
Predicted news categories by GPT 3.5

Looking at the results, GPT 3.5 could accurately classify most of the news into one of the 43 categories. The predicted categories look perfect!

Saving the results as a CSV file.

# Saving output file
prep_news_df2.to_csv(f"{path}{processed_data_filename}", index = False)

Creating an Instruction Dataset

Now that we have a small sample of ground truth, we will move on to analyzing it further, resolving any issues, and converting it into an instruction dataset with a structure similar to the Databricks Dolly 15K dataset for instruction-tuning Llama 2.

As a first step, we can take a quick look at the frequency distribution of the predicted news category.

# Loading processed data as a Pandas DataFrame
prep_news_df2 = pd.read_csv(f"{path}{processed_data_filename}")

# Frequency distribution of predicted news categories
pred_cat_freq_dist = prep_news_df2['predicted_category'].value_counts(dropna = False).sort_values(ascending = False).reset_index()
pred_cat_freq_dist = pred_cat_freq_dist.rename(columns = {"index": "predicted_category", "predicted_category": "count"})
display(pred_cat_freq_dist)
Frequency distribution of predicted news categories (Top 10)

Due to the small sample size, we could only capture 23 news categories out of 43. As per the frequency distribution, BUSINESS, POLITICS, SPORTS, and ENTERTAINMENT are the top four news categories.

Looking at the results carefully, we can also notice that the model generated a few new categories, such as TECHNOLOGY, SPACE, MARKETING & ADVERTISING, and FINANCE. We will resolve this bug by merging these categories with relevant existing categories. For example: TECHNOLOGY will be merged with TECH.

# Merging new news categories with existing ones
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "TECHNOLOGY", "TECH", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "SPACE", "SCIENCE", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "FINANCE", "MONEY", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "MARKETING & ADVERTISING", "OTHERS", prep_news_df2['predicted_category'])
prep_news_df2['predicted_category'] = np.where(prep_news_df2['predicted_category'] == "ARTS & CULTURE", "CULTURE & ARTS", prep_news_df2['predicted_category'])

Looking at the frequency distribution once again.

Updated predicted news categories (Top 10)

Excluding the null category, there are now 18 news categories in the dataset that we will use to instruction-tune Llama 2.

We will now create a constant column named instruction, akin to the instruction column in the Databricks Dolly 15K dataset, that contains the instruction to classify the news article into one of the 18 categories. Then, we will filter out records with the “NA” news category, if any, and rename content to input (the equivalent of context in Databricks Dolly 15K) and predicted_category to output (the equivalent of response in Databricks Dolly 15K) before saving the DataFrame as a JSON file and a CSV file.

# Creating instruction against each news article / news category pairs
prep_news_df2['instruction'] = """Categorize the news article into one of the 18 categories:

WORLD NEWS
COMEDY
POLITICS
TECH
SPORTS
BUSINESS
OTHERS
ENTERTAINMENT
CULTURE & ARTS
FOOD & DRINK
MEDIA
RELIGION
MONEY
HEALTHY LIVING
SCIENCE
EDUCATION
CRIME
ENVIRONMENT

"""
# Removing null news category records
prep_news_df3 = prep_news_df2[~prep_news_df2['predicted_category'].isna()]

# Renaming and selecting relevant columns
prep_news_df4 = prep_news_df3.rename(columns = {'content': 'input', 'predicted_category': 'output'})
output_news_df = prep_news_df4[['instruction', 'input', 'output']]

display(output_news_df)
Instruction dataset for news classification

Converting the output dataset to a list of dictionaries.

# Converting to list of dictionaries
news_json = output_news_df.to_json(orient = 'records', lines = True).splitlines()

print(news_json[0])
{"instruction":"Categorize the news article into one of the 18 categories:\n\nWORLD NEWS\nCOMEDY\nPOLITICS\nTECH\nSPORTS\nBUSINESS\nOTHERS\nENTERTAINMENT\nCULTURE & ARTS\nFOOD & DRINK\nMEDIA\nRELIGION\nMONEY\nHEALTHY LIVING\nSCIENCE\nEDUCATION\nCRIME\nENVIRONMENT\n\n","input":"SANTIAGO DE CUBA, Cuba - Pope Francis wraps up his visit to Cuba on Tuesday and heads to the United States, figuratively connecting the two longtime Cold War adversaries who have reached detente with the help of his mediation. \n\nThe 78-year-old Argentine pope will celebrate Mass at the sanctuary of the Virgin of Charity of El Cobre, the country's holiest shrine and one also venerated by non-believers and practitioners of Afro-Cuban religions infused with varying degrees of Catholicism. \n  \nAt El Cobre on Monday, Francis prayed for reconciliation among all Cubans, both at home and around the world. \n\nAn estimated 2 million Cubans have left the island since the 1959 revolution with some 1.3 million currently living abroad, most of them in the United States, where many exiles remain bitterly estranged from their homeland. \n\nThere is great anticipation for what Francis will say in the United States, where he will meet with U.S. President Barack Obama, deliver the first address by a pope before Congress, and speak at the United Nations. \n\nThe pope avoided making overt political statements in Cuba, as dissidents had hoped he would, but used his homilies to send messages laced in spirituality about the need for change in the one-party Communist country. \n\nHe urged Cubans to think out of the box and be tolerant of other people's ideas. At a Mass on Monday for tens of thousands of people in the eastern city of Holguin, he urged his listeners \"not to be satisfied with appearances or with what is politically correct.\" \n\nThe gentler approach, a contrast to the tack taken by his two immediate predecessors when they visited, seems driven by a desire to quietly encourage Cubans at a delicate time following the resumption of diplomatic ties with the United States. Meanwhile the Cuban Church is discreetly negotiating greater space for its mission. \n\n\"He has spoken with clarity, discretion and restraint,\" Vatican spokesman Federico Lombardi told reporters, when asked why the pope had not spoken out directly about issues such as Cuba's human rights record and the U.S. trade embargo, which the Vatican opposes. \n\n\"The pope wants to make a contribution but the responsibility lies with the leaders of nations. He does not want to exaggerate his role, he just wants to contribute by making suggestions, promoting dialogue, justice and the common good of people,\" he said. REUTERS","output":"WORLD NEWS"}

Finally, saving the dataset as a JSON file and a CSV file that we load back using Hugging Face Datasets for fine-tuning any LLM.

# Saving as a JSON file
with open(f"{path}{output_data_json_filename}", 'w') as f:
for line in news_json:
f.write(f"{line}\n")

# Saving as a CSV file
output_news_df.to_csv(f"{path}{output_data_csv_filename}", index = False)

Conclusion

In summary, news articles play a crucial role in machine learning research, providing valuable information and intricate language constructs that aid in training and evaluating natural language understanding models. Categorizing news articles is crucial for efficient access to pertinent information, though it can be challenging. However, a game-changing solution lies in creating an instruction dataset using existing Large Language Models (LLMs) and fine-tuning them on this dataset for news categorization.

In this blog, we looked at how to leverage GPT 3.5, a powerful LLM, to create an instruction dataset for news categorization. This dataset consists of approximately 100 high-quality records and was produced with minimal human intervention.

In an upcoming Google Colab notebook, I will demonstrate how to build a custom news classifier by fine-tuning or instruction-tuning Meta’s Llama 2 on this dataset and categorizing news articles into one of the several categories.

Stay tuned!

References: Philipp Schmid

--

--

Kshitiz Sahay

Kshitiz Sahay is a senior data scientist at Dun & Bradstreet Inc.