End-to-end text prediction — zero local footprint
Skills needed to survive in the real world
Many in the Data Science field only worked with ‘toy datasets’ which are small enough to be downloaded into and processed on a personal laptop.
On their first day of work, many data scientists are surprised to find that the real world operates differently. Companies typically have dedicated GPUs, and models are trained on servers. This process is managed programmatically via the terminal, with no download buttons to click nor user-friendly interfaces to navigate.
For those who have yet to encounter this environment, there’s good news — you can prepare yourself well in advance! You don’t have to wait until you secure a full-time job or internship to experience working with large datasets on remote servers. Today, anyone with internet access can do so!
Objectives
In this article, I will provide a detailed guide showing how you can:
- Source for useful datasets with known benchmarks
- Download data programmatically with api token
- Train a decent NLP model, without adding to your disk space
Cloud computing is not new. However, I know that many people have no idea how the above can be done, or even how to perform CRUD (Create, Read, Update, Delete) on files and folders without any mouse or touchpad.
Using company servers will require you to learn bash scripting. However, you can practice many of the basic skills using Colab.
With this article, newcomers to the field will be able to get the much needed help and support that I wished for many years ago.
Disclaimer: I will be recommending Google Colab and Google Drive. I do not own any shares in Alphabet, and do not benefit in any way from you using Google.
[1] Source For Datasets
You can find useful datasets by visiting paperswithcode, which can easily be downloaded and are licensed under the CC BY-SA licence.
For example, suppose you want to explore developing models on medical images. By searching something as straightforward as ‘Xray’, you will see many different datasets, such as the following on Chest X-ray:
This is a fairly popular dataset and has a moderate size of 40+GB which is way larger than the typical toy examples. You will need to have at least the 100 GB plan on Google Drive (which costs $1.99 per month and is the most basic paid tier).
For the purpose of this article, we will work with something within the free tier, so that anyone can replicate it without incurring any costs.
Let’s go with the Yahoo! Answers dataset, which has been used in a number of papers. It has a total of 1.4 million samples and 10 possible classes, making it more interesting to work with than those containing just binary labels like spam-or-not-spam.
This dataset is available on Kaggle (more on this in section 2), under the Open Data Commons Public Domain Dedication and License (PDDL) which allows even commercial use.
[2] Download And Process Data
We will perform the entire process on Google Colab, which offers GPU usage even for free-tier users. In this article, CPU would suffice. Either way, doing everything on the cloud is beneficial as it lets your computer remain cool and quiet.
Apart from training deep neural networks with pytorch/tensorflow, GPU can also be used to accelerate training on classical approaches like Random Forests and even Logistic Regression via RAPIDS cuML.
RAPIDS, part of NVIDIA CUDA-X, is an open-source suite for executing data science pipelines on GPUs. But, let’s leave that for some other day; there is already more than enough contents to be covered in this article.
[2a] Set up Google Colab
We want to download the data into Google Drive directly. You can actually open a terminal on Colab by clicking on the icon on the bottom-left.
In your Colab notebook, first connect to Google Drive, so that you can write to and read from it.
from google.colab import drive
drive.mount('/content/drive')
Spend some time navigating your Google drive using commands like cd
to change directory and ls
to list your files/folders. Simply do cd ..
to return to a parent directory. There are much more that can be done, but knowing these are sufficient for now.
As many beginners prefer to work exclusively on notebooks, let’s do just that. It is actually the same, and terminal commands can be executed in notebook cells simply by adding !
before the commands.
If you do not want to repeat the installation of all the libraries each time a new Colab notebook is opened, you can save everything to your personal Google Drive. First, create or choose a folder (eg. Colab/library
or any name you like). Doing so via the UI is fine, though using mkdir
is nicer. Next, append the Python path so that the interpreter will search here when importing libraries.
import sys
sys.path.append('/content/drive/MyDrive/Colab/library')
When you perform pip install
, remember to save the libraries here. For example, if you needopenai
, do the following:
!pip install -t /content/drive/MyDrive/Colab/library openai
[2b] Download data without any buttons
If you have the link to the data file, wget
does the job.
For learning purpose, I will be demonstrating how to programmatically download the data from Kaggle. This is because many datasets are available here, and in some cases (like Chest X-ray14), the original source (https://nihcc.app.box.com) points to a web page and not to a file, hence rendering wget
unusable.
Instead of downloading via the big, enticing ‘Download’ button, let’s do so with code.
To obtain the dataset from Kaggle, you need to be able to authenticate yourself. Sign in to your kaggle account. Under your user icon, go to settings (alright, it is okay to use your mouse/touchpad here) and click ‘Create New Token’.
Your credentials are stored in the kaggle.json file. These should be used in os.environ
. The kaggle
package should already be installed on Colab, but either way you can install it just by using !pip install kaggle
.
import os
os.environ['KAGGLE_USERNAME'] = 'your_username'
os.environ['KAGGLE_KEY'] = 'your_key'
!kaggle datasets download -d yacharki/yahoo-answers-10-categories-for-nlp-csv
With this, you will be download the entire dataset as a zip file. How do we know what to type after download -d
? Well, you can infer it from https://www.kaggle.com/datasets/yacharki/yahoo-answers-10-categories-for-nlp-csv.
[2c] Unzip data
The data can be extracted using unzip [source] -d [destination]
, replacing the square brackets with the appropriate addresses. For example,
!unzip yahoo-answers-10-categories-for-nlp-csv.zip -d /content/drive/MyDrive/Data/yahooanswers
Note that if you do not already have a [Data] folder, create the folder using mkdir /content/drive/MyDrive/Data
.
You’re on the right track when you see something like this:
[2d] Read data
Obtain the train and test data from train.csv
and test.csv
. Remember to indicate the full path correctly.
import pandas as pd
df_train = pd.read_csv('/content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/test.csv')
You should already know where the files reside, from the output displayed after unzipping. Nonetheless, here’s a useful command for finding any file in your directories. Suppose you are looking for a train.csv
somewhere within MyDrive
:
!find /content/drive/MyDrive -name 'train.csv'
If you do not like to see a long file path within multiple folders, you can rename and move the file using mv
. For housekeeping, you can also use rm
to delete files, or rm -r
to delete entire folders and their contents within (use with caution!)
!mv /content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/train.csv train.csv
The data looks like the following:
The labels are represented as integers from 1 to 10 inclusive. You should never make any assumptions, though. This should be verified using:
df_train['class_index'].unique()
You can create a dictionary to map each integer to the corresponding class label using classes.txt
.
with open('/content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/classes.txt', 'r') as file:
content = file.read()
mapper = dict(
x.split(' = ') for x in content.split('\n')[:-1]
)
# {'1': 'Society & Culture',
# '2': 'Science & Mathematics',
# '3': 'Health',
# '4': 'Education & Reference',
# '5': 'Computers & Internet',
# '6': 'Sports',
# '7': 'Business & Finance',
# '8': 'Entertainment & Music',
# '9': 'Family & Relationships',
# '10': 'Politics & Government'}
[3] Build and train model
We will create a model to predict the correct category given a pair ofquestion_title
and question_context
. We will be using pre-trained feature extractors with an ensemble of classical supervised learning methods.
[3a] Get feature vector from pre-trained BERT
BERT takes in sentence(s) of varying length, and we can extract the feature vector which would have been trained to represent the overall sentence semantics. It is straightforward to use, with tokenization and other preprocessing handled automatically for you. I choose the RoBERTa (Robustly Optimized BERT Approach) variant, and will use the CLS embedding together with the mean embeddings of all tokens.
from transformers import RobertaTokenizer, RobertaModel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base').to(device)
def get_bert_embeddings(text):
inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
cls_embedding = outputs.last_hidden_state[:,0,:].cpu().numpy()
mean_embedding = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
return np.concatenate((cls_embedding, mean_embedding), axis=1).flatten()
In the interest of time (and so that you can replicate everything), I will just train on less than 10% of the 1.4 million train data. Naturally, performance would be lower, but we can at least have a conservative estimate of what to expect.
from tqdm.notebook import tqdm
df_train['x_raw'] = df_train['question_title'] + ' ' + df_train['question_content']
df_test['x_raw'] = df_test['question_title'] + ' ' + df_test['question_content']
tqdm.pandas()
df_train_sub = df_train.iloc[::11] # use df_train_sub['class_index'].value_counts() to check the labels
df_train_sub['x'] = df_train_sub['x_raw'].astype(str).progress_apply(get_bert_embeddings)
df_test['x'] = df_test['x_raw'].astype(str).progress_apply(get_bert_embeddings)
For the purpose of experiments, I performed the feature extraction using both CPU and GPU (NVIDIA L4, which is less powerful than the A100).
Performing feature extraction via CPU allows us to obtain ~24 embeddings per second on average. The L4 GPU does the same thing at 4 times the speed, giving us ~96 embeddings per second.
[3b] Train on extracted embeddings
Moving on, we will extract the relevant columns from the pandas DataFrame, namely x
and class_index
.
import numpy as np
X_train = np.stack(df_train_sub['x'].values.tolist()) # (127273, 768)
y_train = df_train_sub['class_index'] # (127273,)
X_test = np.stack(df_test['x'].values.tolist()) # (59999, 768)
y_test = df_test['class_index'] # (59999,)
Finally, a simple fully-connected neural network with a single hidden layer will be used. We will keep things as simple as possible, and not use any dropouts, learning rate scheduler, early stopping nor hyperparameter tunings.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
class FullyConnected(nn.Module):
def __init__(self, input_size, output_size):
super(FullyConnected, self).__init__()
self.fc1 = nn.Linear(input_size, 256)
self.relu1 = nn.ReLU()
self.fc2 = nn.Linear(256, output_size)
def forward(self, x):
out = self.fc1(x)
out = self.relu1(out)
out = self.fc2(out)
return out
train_loader = DataLoader(
TensorDataset(
torch.tensor(X_train, dtype=torch.float32),
torch.tensor(y_train-1, dtype=torch.long) # change from [1,10] to [0,9]
), batch_size=64, shuffle=True
)
X_test = torch.tensor(X_test, dtype=torch.float32).to(device)
model = FullyConnected(input_size=1536, output_size=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)
for epoch in tqdm(range(10)):
model.train()
for i, (inputs, labels) in enumerate(train_loader):
inputs = inputs.to(device)
labels = labels.to(device)
outputs = model(inputs)
loss = criterion(outputs, labels)
optimizer.zero_grad()
loss.backward()
optimizer.step()
model.eval() # Set the model to evaluation mode
with torch.no_grad():
outputs = model(X_test)
_, y_hat = torch.max(outputs.data, 1)
print(classification_report(y_test-1, y_hat.cpu().numpy()))
[3c] Results
Using the L4 GPU, the 10 training epochs can be completed in about one minute.
The performance here is notably lower than the benchmark which is above 75% accuracy on the 10 categories. There are many other things which could be done here. Aside from tuning or having a more complex neural network with tuning, we can also use an ensemble of classical supervised learning methods. And, of course, to use the entire training set of 1.4 million samples.
However, a competitive accuracy was never the objective of this article today. Instead, this article is shared to show you the overall end-to-end process.
Conclusion
From this article, you have learnt
- That PapersWithCode is a good source of datasets which comes with established benchmarks.
- That libraries installed in Colab can be retained (using
pip install -t [library/path/] [library_name]
andsys.path.append
). - To download data programmatically via api.
- To use simple bash commands like
mkdir
,wget
,unzip
andfind
. Of course there’sls
,cd
andrm
which you may already be using frequently. - To obtain pre-trained embeddings from sentences (and see that GPU significantly speeds up the extraction process).
- To build an ensemble in just a couple of lines.
Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.
Editor’s note: Find out more about the Singapore Management University’s Master of IT in Business (MITB) programme at https://smu.edu.sg/mitb