End-to-end text prediction — zero local footprint

Skills needed to survive in the real world

Published in

MITB For All

10 min readJul 13, 2024

Many in the Data Science field only worked with ‘toy datasets’ which are small enough to be downloaded into and processed on a personal laptop.

On their first day of work, many data scientists are surprised to find that the real world operates differently. Companies typically have dedicated GPUs, and models are trained on servers. This process is managed programmatically via the terminal, with no download buttons to click nor user-friendly interfaces to navigate.

Image created by DALL·E 3 depicting a simple GPU rack.

For those who have yet to encounter this environment, there’s good news — you can prepare yourself well in advance! You don’t have to wait until you secure a full-time job or internship to experience working with large datasets on remote servers. Today, anyone with internet access can do so!

Objectives

In this article, I will provide a detailed guide showing how you can:

Source for useful datasets with known benchmarks
Download data programmatically with api token
Train a decent NLP model, without adding to your disk space

Cloud computing is not new. However, I know that many people have no idea how the above can be done, or even how to perform CRUD (Create, Read, Update, Delete) on files and folders without any mouse or touchpad.

Using company servers will require you to learn bash scripting. However, you can practice many of the basic skills using Colab.

With this article, newcomers to the field will be able to get the much needed help and support that I wished for many years ago.

Disclaimer: I will be recommending Google Colab and Google Drive. I do not own any shares in Alphabet, and do not benefit in any way from you using Google.

[1] Source For Datasets

You can find useful datasets by visiting paperswithcode, which can easily be downloaded and are licensed under the CC BY-SA licence.

For example, suppose you want to explore developing models on medical images. By searching something as straightforward as ‘Xray’, you will see many different datasets, such as the following on Chest X-ray:

Dataset found when searching for ‘Xray’ data.

This is a fairly popular dataset and has a moderate size of 40+GB which is way larger than the typical toy examples. You will need to have at least the 100 GB plan on Google Drive (which costs $1.99 per month and is the most basic paid tier).

For the purpose of this article, we will work with something within the free tier, so that anyone can replicate it without incurring any costs.

Let’s go with the Yahoo! Answers dataset, which has been used in a number of papers. It has a total of 1.4 million samples and 10 possible classes, making it more interesting to work with than those containing just binary labels like spam-or-not-spam.

Screenshot from https://paperswithcode.com/sota/text-classification-on-yahoo-answers showing the benchmark performance.

This dataset is available on Kaggle (more on this in section 2), under the Open Data Commons Public Domain Dedication and License (PDDL) which allows even commercial use.

[2] Download And Process Data

We will perform the entire process on Google Colab, which offers GPU usage even for free-tier users. In this article, CPU would suffice. Either way, doing everything on the cloud is beneficial as it lets your computer remain cool and quiet.

Apart from training deep neural networks with pytorch/tensorflow, GPU can also be used to accelerate training on classical approaches like Random Forests and even Logistic Regression via RAPIDS cuML.
RAPIDS, part of NVIDIA CUDA-X, is an open-source suite for executing data science pipelines on GPUs. But, let’s leave that for some other day; there is already more than enough contents to be covered in this article.

[2a] Set up Google Colab

We want to download the data into Google Drive directly. You can actually open a terminal on Colab by clicking on the icon on the bottom-left.

In your Colab notebook, first connect to Google Drive, so that you can write to and read from it.

from google.colab import drive
drive.mount('/content/drive')

Screenshot you would see upon running drive.mount(‘/content/drive’).

Spend some time navigating your Google drive using commands like cd to change directory and ls to list your files/folders. Simply do cd .. to return to a parent directory. There are much more that can be done, but knowing these are sufficient for now.

Screenshot from Colab. Notice that performing `!ls -lt` on the notebook cell is the same as `ls -lt` on the terminal.

As many beginners prefer to work exclusively on notebooks, let’s do just that. It is actually the same, and terminal commands can be executed in notebook cells simply by adding ! before the commands.

If you do not want to repeat the installation of all the libraries each time a new Colab notebook is opened, you can save everything to your personal Google Drive. First, create or choose a folder (eg. Colab/library or any name you like). Doing so via the UI is fine, though using mkdiris nicer. Next, append the Python path so that the interpreter will search here when importing libraries.

import sys
sys.path.append('/content/drive/MyDrive/Colab/library')

When you perform pip install, remember to save the libraries here. For example, if you needopenai, do the following:

!pip install -t /content/drive/MyDrive/Colab/library openai

[2b] Download data without any buttons

If you have the link to the data file, wget does the job.

For learning purpose, I will be demonstrating how to programmatically download the data from Kaggle. This is because many datasets are available here, and in some cases (like Chest X-ray14), the original source (https://nihcc.app.box.com) points to a web page and not to a file, hence rendering wget unusable.

Screenshot taken from https://www.kaggle.com/datasets/yacharki/yahoo-answers-10-categories-for-nlp-csv. Seeing a dataset uploaded by a Kaggle Grandmaster gives an additional level of security.

Instead of downloading via the big, enticing ‘Download’ button, let’s do so with code.

To obtain the dataset from Kaggle, you need to be able to authenticate yourself. Sign in to your kaggle account. Under your user icon, go to settings (alright, it is okay to use your mouse/touchpad here) and click ‘Create New Token’.

Screenshot from your Kaggle profile page.

Screenshot after clicking on ‘Create New Token’.

Your credentials are stored in the kaggle.json file. These should be used in os.environ. The kaggle package should already be installed on Colab, but either way you can install it just by using !pip install kaggle.

import os

os.environ['KAGGLE_USERNAME'] = 'your_username'
os.environ['KAGGLE_KEY'] = 'your_key'

!kaggle datasets download -d yacharki/yahoo-answers-10-categories-for-nlp-csv

With this, you will be download the entire dataset as a zip file. How do we know what to type after download -d? Well, you can infer it from https://www.kaggle.com/datasets/yacharki/yahoo-answers-10-categories-for-nlp-csv.

Screenshot from Colab after downloading.

[2c] Unzip data

The data can be extracted using unzip [source] -d [destination], replacing the square brackets with the appropriate addresses. For example,

!unzip yahoo-answers-10-categories-for-nlp-csv.zip -d /content/drive/MyDrive/Data/yahooanswers

Note that if you do not already have a [Data] folder, create the folder using mkdir /content/drive/MyDrive/Data.

You’re on the right track when you see something like this:

[2d] Read data

Obtain the train and test data from train.csv and test.csv. Remember to indicate the full path correctly.

import pandas as pd

df_train = pd.read_csv('/content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/train.csv')
df_test = pd.read_csv('/content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/test.csv')

You should already know where the files reside, from the output displayed after unzipping. Nonetheless, here’s a useful command for finding any file in your directories. Suppose you are looking for a train.csv somewhere within MyDrive:

!find /content/drive/MyDrive -name 'train.csv'

If you do not like to see a long file path within multiple folders, you can rename and move the file using mv. For housekeeping, you can also use rm to delete files, or rm -r to delete entire folders and their contents within (use with caution!)

!mv /content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/train.csv train.csv

The data looks like the following:

Screenshot from Colab showing first 5 rows.

The labels are represented as integers from 1 to 10 inclusive. You should never make any assumptions, though. This should be verified using:

df_train['class_index'].unique()

You can create a dictionary to map each integer to the corresponding class label using classes.txt.

with open('/content/drive/MyDrive/Data/yahooanswers/10_categories_of_yahoo_answers_for_nlp_tasks_csv/classes.txt', 'r') as file:
    content = file.read()

mapper = dict(
    x.split(' = ') for x in content.split('\n')[:-1]
)
# {'1': 'Society & Culture',
#  '2': 'Science & Mathematics',
#  '3': 'Health',
#  '4': 'Education & Reference',
#  '5': 'Computers & Internet',
#  '6': 'Sports',
#  '7': 'Business & Finance',
#  '8': 'Entertainment & Music',
#  '9': 'Family & Relationships',
#  '10': 'Politics & Government'}

[3] Build and train model

We will create a model to predict the correct category given a pair ofquestion_title and question_context. We will be using pre-trained feature extractors with an ensemble of classical supervised learning methods.

[3a] Get feature vector from pre-trained BERT

BERT takes in sentence(s) of varying length, and we can extract the feature vector which would have been trained to represent the overall sentence semantics. It is straightforward to use, with tokenization and other preprocessing handled automatically for you. I choose the RoBERTa (Robustly Optimized BERT Approach) variant, and will use the CLS embedding together with the mean embeddings of all tokens.

from transformers import RobertaTokenizer, RobertaModel
import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base').to(device)

def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True).to(device)
    with torch.no_grad():
        outputs = model(**inputs)

    cls_embedding = outputs.last_hidden_state[:,0,:].cpu().numpy()
    mean_embedding = outputs.last_hidden_state.mean(dim=1).cpu().numpy()
    return np.concatenate((cls_embedding, mean_embedding), axis=1).flatten()

In the interest of time (and so that you can replicate everything), I will just train on less than 10% of the 1.4 million train data. Naturally, performance would be lower, but we can at least have a conservative estimate of what to expect.

from tqdm.notebook import tqdm

df_train['x_raw'] = df_train['question_title'] + ' ' + df_train['question_content']
df_test['x_raw'] = df_test['question_title'] + ' ' + df_test['question_content']

tqdm.pandas()
df_train_sub = df_train.iloc[::11]  # use df_train_sub['class_index'].value_counts() to check the labels
df_train_sub['x'] = df_train_sub['x_raw'].astype(str).progress_apply(get_bert_embeddings)
df_test['x'] = df_test['x_raw'].astype(str).progress_apply(get_bert_embeddings)

For the purpose of experiments, I performed the feature extraction using both CPU and GPU (NVIDIA L4, which is less powerful than the A100).

Screenshot from Colab showing time taken to get the BERT embeddings using CPU.

Performing feature extraction via CPU allows us to obtain ~24 embeddings per second on average. The L4 GPU does the same thing at 4 times the speed, giving us ~96 embeddings per second.

Screenshot from Colab showing time taken to get the BERT embeddings using L4 GPU.

[3b] Train on extracted embeddings

Moving on, we will extract the relevant columns from the pandas DataFrame, namely x and class_index.

import numpy as np

X_train = np.stack(df_train_sub['x'].values.tolist()) # (127273, 768)
y_train = df_train_sub['class_index']                 # (127273,)

X_test = np.stack(df_test['x'].values.tolist())  # (59999, 768)
y_test = df_test['class_index']                  # (59999,)

Finally, a simple fully-connected neural network with a single hidden layer will be used. We will keep things as simple as possible, and not use any dropouts, learning rate scheduler, early stopping nor hyperparameter tunings.

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class FullyConnected(nn.Module):
    def __init__(self, input_size, output_size):
        super(FullyConnected, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.relu1 = nn.ReLU()
        self.fc2 = nn.Linear(256, output_size)
    
    def forward(self, x):
        out = self.fc1(x)
        out = self.relu1(out)
        out = self.fc2(out)
        return out

train_loader = DataLoader(
    TensorDataset(
        torch.tensor(X_train, dtype=torch.float32),
        torch.tensor(y_train-1, dtype=torch.long)  # change from [1,10] to [0,9]
    ), batch_size=64, shuffle=True
)
X_test = torch.tensor(X_test, dtype=torch.float32).to(device)

model = FullyConnected(input_size=1536, output_size=10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

for epoch in tqdm(range(10)):
    model.train()
    for i, (inputs, labels) in enumerate(train_loader):
        inputs = inputs.to(device)
        labels = labels.to(device)
        
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    model.eval()  # Set the model to evaluation mode
    with torch.no_grad():
        outputs = model(X_test)
        _, y_hat = torch.max(outputs.data, 1)

print(classification_report(y_test-1, y_hat.cpu().numpy()))

[3c] Results

Using the L4 GPU, the 10 training epochs can be completed in about one minute.

Screenshot from Colab showing the output from `classification_report`.

The performance here is notably lower than the benchmark which is above 75% accuracy on the 10 categories. There are many other things which could be done here. Aside from tuning or having a more complex neural network with tuning, we can also use an ensemble of classical supervised learning methods. And, of course, to use the entire training set of 1.4 million samples.

However, a competitive accuracy was never the objective of this article today. Instead, this article is shared to show you the overall end-to-end process.

Conclusion

From this article, you have learnt

That PapersWithCode is a good source of datasets which comes with established benchmarks.
That libraries installed in Colab can be retained (using pip install -t [library/path/] [library_name]and sys.path.append).
To download data programmatically via api.
To use simple bash commands like mkdir, wget, unzip and find. Of course there’sls,cd and rm which you may already be using frequently.
To obtain pre-trained embeddings from sentences (and see that GPU significantly speeds up the extraction process).
To build an ensemble in just a couple of lines.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

Editor’s note: Find out more about the Singapore Management University’s Master of IT in Business (MITB) programme at https://smu.edu.sg/mitb