Now Convert you free text to python code : Poor Man’s AI Programmer

Suraj Kumar Gorai
7 min readJul 21, 2021

--

I am sure most of you have heard about GitHub Copilot , if you have not, please go and check their website and try to register for it , but all of us are not lucky may be you will not get the access.

Don’t worry, we can also develop something like that , but how we do not have a lot of GPU’s and also we do not have access to such large amounts of data ?

Here comes our savior : T5 Transformer. Yes you heard it right .

I am sure most of you have heard about OpenAI’s GTPT-3 and its insane text-generation capabilities. GPT3 is a text-to-text transformer model , But GPT-3 is not open source and it would cost very high to use that API. Now being aware of the text-to-text capabilities of T5 transformer by Google I decided to push T5 to Fine tune on a Text to Python dataset and see the results, Yes you can think of as a mini GitHub Copilot, which is only capable of generating python codes based on your free text.

Every Day we hear about Cutting edge Deep Learning Language models like Bert , XLnet , GPT etc. and with the advancement in Transfer Learning and Deep Learning specially in Natural Language Processing has achieved wonders with the rise Transformers.

In Transfer Learning basically we use the pretrained architecture and weights which has been trained in huge amount of data like ( Wikipedia and other open source which can include text , numbers etc) , and rest is the fine tuning of those models on specific tasks like ( Sentiment Classification , Named Entity Recognition , Text Summarization , Next Sentence Prediction and many other ).

Here are the few Examples of Input and Output of the final model which I have trained using text to python dataset on T5 Transformer :

Input to AI Programmer
Output of AI Programmer

Now , As you have seen the examples , Let’s try to Create this AI Programmer from scratch . We will be using Pytorch Lighting to create this Ai Programmer and will use Google Colab to train it , Please do not forget to use GPU .

1.1 Installing Dependencies

!pip install --quiet transformers==4.5.0
!pip install --quiet pytorch-lightning==1.2.7

1.2 Import all Necessary Libraries

import json
import pandas as pd
import numpy as np
import torch

# dataset and dataloader for functions
from torch.utils.data import Dataset, DataLoader
# lightning for data class
import pytorch_lightning as pl
# leveraging the model checkpoints
from pytorch_lightning.callbacks import ModelCheckpoint
# we can visualize performance of model
from pytorch_lightning.loggers import TensorBoardLogger
# splitting the data
from sklearn.model_selection import train_test_split
# color formatting in ANSII code for output in terminal
from termcolor import colored
# wraps the paragraph into a single line or string
import textwrap
# installing multiple utilities
# including optimizer , tokenizer and generation module
from transformers import (
AdamW,
T5ForConditionalGeneration,
T5TokenizerFast as T5Tokenizer
)
# showing bars for processes in notebook
from tqdm.auto import tqdm
# seaborn for visualizing
import seaborn as sns
# procedural import to matplotlib
from pylab import rcParams
# graphs
import matplotlib.pyplot as plt
# rcParams for setting default values to all plots
from matplotlib import rc
pl.seed_everything(42)

1.3 Load Your own data for Training

Here is one important thing to note that you can create your own dataset of this type with minimal effort , You can use web scraping to create a text to code dataset for any language . In this Blog I am using a open source dataset but you can also add to this dataset as well to improve your model.

df = pd.read_csv("dataset_text_python.csv")
df.head()
Custom dataset for Training

1.4 Split your data and make your Custom Dataset Class

This class contains functions required for initializing objects of the arguments we have to input for the Pytorch Lighting model. We define the data , tokenizer , lengths of input and output sequences , take care of encoding the data and add paddings and special tokens.

class CodeDataset(Dataset):def __init__(
self,
data:pd.DataFrame,
tokenizer:T5Tokenizer,
text_max_token_len: int = 100,
code_max_token_len: int = 128
):
self.tokenizer = tokenizer
self.data = data
self.text_max_token_len = text_max_token_len
self.code_max_token_len = code_max_token_len
def __len__(self):
return len(self.data)
def __getitem__(self, index : int):
data_row = self.data.iloc[index]
text = data_row["text"]text_encoding = tokenizer(
text,
max_length = self.text_max_token_len,
padding = "max_length",
truncation=True,
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt"
)
code_encoding = tokenizer(
data_row["code"],
max_length = self.code_max_token_len,
padding = "max_length",
truncation=True,
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt"
)
labels = code_encoding["input_ids"]
labels[labels ==0] = -100
return dict(
text = text,
code = data_row["code"],
text_input_ids=text_encoding["input_ids"].flatten(),
text_attention_mask=text_encoding["attention_mask"].flatten(),
labels=labels.flatten(),
labels_attention_mask=code_encoding["attention_mask"].flatten()
)

1.5 Create Your Data Module

This Pytorch Lightning Data Module will help us to consume our Custom Dataset Class and will feed to the Core model to run even more faster because of its data plugin utilities.

class CodeDataModule(pl.LightningDataModule):def __init__(
self,
train_df: pd.DataFrame,
test_df: pd.DataFrame,
tokenizer: T5Tokenizer,
batch_size: int = 8,
text_max_token_len: int = 100,
code_max_token_len: int = 128
):
super().__init__()self.train_df = train_df
self.test_df = test_df
self.batch_size = batch_size
self.tokenizer = tokenizer
self.text_max_token_len = text_max_token_len
self.code_max_token_len = code_max_token_len
def setup(self,stage=None):
self.train_dataset = CodeDataset(
self.train_df,
self.tokenizer,
self.text_max_token_len,
self.code_max_token_len
)
self.test_dataset = CodeDataset(
self.test_df,
self.tokenizer,
self.text_max_token_len,
self.code_max_token_len
)
def train_dataloader(self):
return DataLoader(
self.train_dataset,
batch_size = self.batch_size,
shuffle = True,
num_workers = 2
)
def val_dataloader(self):
return DataLoader(
self.test_dataset,
batch_size = self.batch_size,
shuffle = False,
num_workers = 2
)
def test_dataloader(self):
return DataLoader(
self.test_dataset,
batch_size = self.batch_size,
shuffle = False,
num_workers = 2
)

1.6 Loading the Base T5 model and Tokenizer

In this step we are going to load our base T5 model and Tokenizer which will help us to use transfer learning and will fine tune our model on this text to code specific task. We will also apply our data module to prepare our data .

MODEL_NAME = "t5-base"
tokenizer =T5Tokenizer.from_pretrained(MODEL_NAME)
N_EPOCHS = 20
BATCH_SIZE = 8
data_module = CodeDataModule(train_df, test_df , tokenizer , batch_size= BATCH_SIZE)

1.7 Create our Pytorch Lighting Module to Fine our T5 model

class TextCodeModel(pl.LightningModule):def __init__(self):
super().__init__()
self.model = T5ForConditionalGeneration.from_pretrained(MODEL_NAME, return_dict=True)
def forward(self, input_ids, attention_mask, decoder_attention_mask, labels=None):output = self.model(
input_ids,
attention_mask = attention_mask,
labels = labels,
decoder_attention_mask = decoder_attention_mask
)
return output.loss, output.logitsdef training_step(self,batch, batch_ids):
input_ids = batch["text_input_ids"]
attention_mask = batch["text_attention_mask"]
labels = batch["labels"]
labels_attention_mask = batch["labels_attention_mask"]
loss, outputs = self(
input_ids = input_ids,
attention_mask = attention_mask,
decoder_attention_mask=labels_attention_mask,
labels= labels
)

self.log("train_loss",loss, prog_bar=True, logger=True)
return loss
def validation_step(self,batch, batch_ids):
input_ids = batch["text_input_ids"]
attention_mask = batch["text_attention_mask"]
labels = batch["labels"]
labels_attention_mask = batch["labels_attention_mask"]
loss, outputs = self(
input_ids = input_ids,
attention_mask = attention_mask,
decoder_attention_mask=labels_attention_mask,
labels= labels
)

self.log("val_loss",loss, prog_bar=True, logger=True)
return loss
def test_step(self,batch, batch_ids):
input_ids = batch["text_input_ids"]
attention_mask = batch["text_attention_mask"]
labels = batch["labels"]
labels_attention_mask = batch["labels_attention_mask"]
loss, outputs = self(
input_ids = input_ids,
attention_mask = attention_mask,
decoder_attention_mask=labels_attention_mask,
labels= labels
)

self.log("test_loss",loss, prog_bar=True, logger=True)
return loss
def configure_optimizers(self):
return AdamW(self.parameters(), lr=0.0001)

1.8 Training and Logging model logs in TensorBoard

## Create an instance of the model class
model = TextCodeModel()
##clear up some unused memory
import gc
gc.collect()
##Logging model training into Tensor Board
%load_ext tensorboard
%tensorboard --logdir ./lightning_logs
## saving model checkpoints in a directory
checkpoint_callback = ModelCheckpoint(
dirpath = "checkpoints",
filename = "best-checkpoint",
save_top_k=1,
verbose=True,
monitor = "val_loss",
mode= "min"
)
logger = TensorBoardLogger("lightning_logs",name="text-code")trainer = pl.Trainer(
logger= logger,
checkpoint_callback= checkpoint_callback,
max_epochs= N_EPOCHS,
gpus=1,
progress_bar_refresh_rate= 30
)
############ Training the model
trainer.fit(model, data_module)
## Loading the trained model from checkpoint
trained_model = TextCodeModel.load_from_checkpoint(
trainer.checkpoint_callback.best_model_path
)
## Freezing the model
trained_model.freeze()

1.9 Create function use the trained model to generate python code from text

def text_to_code(text):
text_encoding = tokenizer(
text,
max_length=100,
padding="max_length",
truncation=True,
return_attention_mask=True,
add_special_tokens=True,
return_tensors="pt"
)

generated_ids = trained_model.model.generate(
input_ids=text_encoding["input_ids"],
attention_mask=text_encoding["attention_mask"],
max_length= 100,
num_beams = 2,
repetition_penalty=2.5,
length_penalty=1.0,
early_stopping=True
)
preds = [
tokenizer.decode(gen_id, skip_special_tokens=True,clean_up_tokenization_spaces=True)
for gen_id in generated_ids]
return "".join(preds)
Use the function to create python code from free text

1.10 Saving and Loading the trained model for future use

import pickle
filename = open('text_python_model.pkl', 'wb')
pickle.dump(trained_model.model, filename)
# saving the model
model = pickle.load(open('text_python_model.pkl', 'rb'))

End Notes:

I have got this idea of doing Text to python conversion from Github Copilot and also from one of the medium blog Poor man’s GPT-3: Few shot text generation with T5 Transformer and I also want draw you attention to one of site which will help you to learn lot Interesting AI blogs by Shivanand Roy .

Last but not the least , get ready for my future Blogs . Lot of Mlops and other AI Blog are on its way . If you liked this article follow me on linkedin and share your thoughts and feedbacks.

Happy Learning!!!! See you in next blog.

--

--

Suraj Kumar Gorai

I am Data Scientist who love to experiment new AI stuff and write blogs ... connect with me on linkedin : https://www.linkedin.com/in/suraj-gorai-564086114/