Geek Culture
Published in

Geek Culture

Photo by freestocks on Unsplash

Fine-Tune EleutherAI GPT-Neo to Generate Netflix Movie Descriptions in Only 47 Lines of Code

Recently, released their GPT-3-like model and a few days ago, it was as a part of the framework. At the time of writing, this model is available only at the master branch of the repository, so you need to install it like this:

pip install git+https://github.com/huggingface/transformers@master

The main goal is to show you the simplest way to fine-tune the GPT-Neo model to generate new movie descriptions using of Netflix movies and tv shows.

CUDA device used for this project, please note that the GPT-Neo is a very VRAM demanding model!

First, we need to download and prepare the GPT-Neo model:

torch.manual_seed(42)
tokenizer = GPT2Tokenizer.from_pretrained(“”,
bos_token=’’,
eos_token=’’,
pad_token=’’)
model = GPTNeoForCausalLM.from_pretrained("")

model.resize_token_embeddings(len(tokenizer))

The next step is to read the Netflix dataset and calculate the maximum possible length of the movie description in the dataset:

descriptions = pd.read_csv(‘’)[‘description’]
max_length = max([len(tokenizer.encode(description)) for description in descriptions])

This custom class is handy for fine-tuning using the tool:


def __init__(self, txt_list, tokenizer, max_length):
self.input_ids = []
self.attn_masks = []
self.labels = []
for txt in txt_list:
encodings_dict = tokenizer(‘
+ +
’,
,
,
)
input_ids = torch.tensor(encodings_dict[‘’])
self.input_ids.append(input_ids)
mask = torch.tensor(encodings_dict[‘’])
self.attn_masks.append(mask)

return len(self.input_ids)

return self.input_ids[idx], self.attn_masks[idx]

Now initialize the dataset:

dataset = NetflixDataset(, , )

Next, you need to split the whole dataset into the and sets:

 = int( * len(dataset))
= random_split(dataset,
[train_size, len(dataset) — train_size])

The Hugging Face toolkit provides a useful tool that helps users fine-tune pre-trained models in most standard use cases. All the training parameters should be configured using :


training_args = TrainingArguments(output_dir=’’,
num_train_epochs=,
logging_steps=,
save_steps=
per_device_train_batch_size=,
per_device_eval_batch_size=,
warmup_steps=,
weight_decay=,
logging_dir=’’)

Finally, all that’s left is to fine-tune our model and check the results!

trainer = Trainer(model=, args=,  
train_dataset=,
eval_dataset=,
data_collator=lambda data:
{‘’: ,
’: ,
’: })
trainer.train()
Training process

After training, we can evaluate the results using a built-in function:


generated = tokenizer(““,
return_tensors=”pt”).input_ids.cuda()
sample_outputs = model.generate(generated,

do_sample=True,
top_k=50,

max_length=300,

top_p=0.95,
temperature=1.9,

num_return_sequences=20)
for i, sample_output in enumerate(sample_outputs):
print(“{}: {}”.format(i, tokenizer.decode(sample_output,
skip_special_tokens=True)))

Here are a few generated samples:

As you can see, the Hugging Face framework provides an incredibly friendly API for various NLP tasks and allows us to work with many pre-trained powerful models —

This code is also available on my GitHub.

Also, it’s possible to fine-tune themodel usingof fine-tuning this quite a large model with batch size 15 on a single !

Some samples generated by the GPT-Neo-2.7B model

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store