NLP 實戰教學－BERT 情緒分析（中）

13 min readOct 7, 2023

本節延續上節的資料處理，繼續往下講 NLP 必不可少的 transformer ～

首先先安裝 transformer。

!pip install transformers #安裝 transformers 套件，它也可以直接用pip裝👍

還記得先前提過，電腦看不懂文字嗎，他們只能讀懂數字，因此我們需要將文本轉換成機器看得懂的語言～那就會需要使用 tokenzier 了。

在 huggigface 上，你會發現有很多模型都會有一個自己的 Tokenzier，例如BertTokenizer, RobertaTokenzier …… 等，那你可能會想問，我怎麼知道它把它的 tokenzier 名字取成什麼了（喔當然你可以自己去查）。

但這邊要推薦的是 Auto Classes 系列的使用，你只要知道該模型的 config 名稱就可以呼叫那個模型的 tokenzier （當然啦，如果那個作者有寫得完整的話）。

from transformers import AutoTokenizer
import transformers

# 如果你發現在使用時常常有一長串的 warning 跳出來，可以用這行指令把它關掉
# transformers.logging.set_verbosity_error() # Close the warning message

config_name = 'bert-base-uncased' # 假設我們用 bert（base是比較少層的模型，uncased是不調整大小寫）
# .from_pretrained() 就是用現有的模型繼續做
tokenizer = AutoTokenizer.from_pretrained(config_name)

你可以看到 AutoTokenzier 下方有一堆的 function，還有一些藏在這裡（畢竟是繼承這個類別去寫的）：

tokenize
encode
decode
convert_ids_to_tokens
convert_tokens_to_ids
convert_tokens_to_string
encode_plus
……

以＂How’s everything going?＂為例，可以看到幾種不同函式簡單的使用方法。這邊回傳的都是 list 哦。詳細參數使用請自行翻閱原始文檔

sample_s = "How's everything going?"

# tokenize，通常會採用空白/標點切字（也可以自己切好再做轉換，需改參數設定）
token = tokenizer.tokenize(sample_s)
print(token)
'''[Output]
['how', "'", 's', 'everything', 'going', '?']
'''

# encode，將文字轉為數字（透過該 tokenzier 的 vocab 去做轉換）
# 什麼參數都沒改的話，會自動幫你加上 [CLS] 和 [SEP] （以 BERT 來說）
ids = tokenizer.encode(sample_s)
print(ids)
'''[Output]
[101, 2129, 1005, 1055, 2673, 2183, 1029, 102]
'''

# decode，將數字轉回文本
tokenizer.decode(ids)
'''[Output]
[CLS] how's everything going? [SEP]
'''

# 純粹去對 vocab 做轉換
print(tokenizer.convert_ids_to_tokens(ids))
'''[Output]
['[CLS]', 'how', "'", 's', 'everything', 'going', '?', '[SEP]']
'''

# 純粹對單詞做轉換
print(tokenizer.convert_tokens_to_ids(token))
'''[Output]
[2129, 1005, 1055, 2673, 2183, 1029]
'''

# 將 token list 中的所有元素使用空白做 join
print(tokenizer.convert_tokens_to_string(token))
'''[Output]
how ' s everything going?
'''

這邊會額外補充 tokenizer.encode_plus()，因為它可以幫我們生成 BERT 要的三種參數 input_ids、token_type_ids 和 attention_mask～ pytorch模型通常會需要輸入 torch.tensor 型態的資料，如果沒有在後期轉換型態，可以考慮在 encode 時就加入 return_tensors = 'pt'。注意這個函式回傳的是一個 dict 哦

由於在模型訓練時，輸入的資料長度需要一致，因此通常在這裡我們會去做 padding 以及 truncation 的動作。

# 不改任何參數
sample_s = "How's everything going?"
es = tokenizer.encode_plus(sample_s)
print(es)
'''[Output]
{
 'input_ids': [101, 2129, 1005, 1055, 2673, 2183, 1029, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]
}
'''

# 固定文本長度
## truncated
sample_s = "How's everything going?"
es = tokenizer.encode_plus(
    sample_s,
    max_length = 7,
    truncation = True,
    padding = 'max_length'
)
print(es)
'''[Output]
{
 'input_ids': [101, 2129, 1005, 1055, 2673, 2183, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1]
}
'''

## padding
sample_os = "How are you?"
os = tokenizer.encode_plus(
    sample_os,
    max_length = 7,
    truncation = True,
    padding = 'max_length'
)
print(os)
'''[Output]
{
 'input_ids': [101, 2129, 2024, 2017, 1029, 102, 0], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 0]
}
'''

# 回傳 tensor 型態
es = tokenizer.encode_plus(
    sample_s,
    max_length = 7,
    truncation = True,
    padding = 'max_length',
    return_tensors = 'pt'
)
print(es)
'''[Output]
{
 'input_ids': tensor([[ 101, 2129, 1005, 1055, 2673, 2183,  102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])
}
'''

那基本上 tokenzier 的使用就講到這，剩下的同學可以自由探索～

接下來要進入重點了！怎麼把資料包起來？畢竟不可能一筆一筆餵給模型吧，那太耗時了。所以我們要來學 PyTorch 的 Dataset 以及 DataLoader 來打包資料。

首先它會繼承 Dataset class，裡面比較需要注意的有三個函式需要複寫，__init__()、__len__() 及 __get_item__()。

有寫過 class 的應該都知道 __init__ 代表什麼吧，簡單來說就是定義要使用該物件時會傳入那些參數，然後物件裡會有哪些變數（要記得如果要在其他函式使用該變數，需要將其賦值到一個前面有加 ‘ self.’ 的變數才能使用喔）

__len__ 顯而易見就是這筆資料的總長，也就是你在包資料時，資料的總筆數。__get_item__ 就是我們使用 Dataset + DataLoader 去包的關鍵了！

它的定義會是，你想怎麼回傳單筆資料且該資料型態符合模型的輸入 。也就是說，假設我們的 input 是原始文本，我們可以在這裡面把 tokenzie 做完，然後回傳 tensor 。注意資料維度喔～

import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoTokenizer
import torch
import torch.nn.functional as Fun

# Using Dataset to build DataLoader
class CustomDataset(Dataset):
    def __init__(self, mode, df, specify, args):
        assert mode in ["train", "val", "test"]  # 一般會切三份
        self.mode = mode
        self.df = df
        self.specify = specify # specify column of data (the column U use for predict)
        if self.mode != 'test':
          self.label = df['label']
        self.tokenizer = AutoTokenizer.from_pretrained(args["config"])
        self.max_len = args["max_len"]
        self.num_class = args["num_class"]
        
    def __len__(self):
        return len(self.df)

    # transform label to one_hot label (if num_class > 2)
    def one_hot_label(self, label):
        return Fun.one_hot(torch.tensor(label), num_classes = self.num_class)
    
    # transform text to its number
    def tokenize(self,input_text):
        inputs = self.tokenizer.encode_plus(
            input_text,
            max_length = self.max_len,
            truncation = True,
            padding = 'max_length'
        )
        ids = inputs['input_ids'] # (512)
        mask = inputs['attention_mask'] # (512)
        token_type_ids = inputs["token_type_ids"] # (512)
        
        return ids, mask, token_type_ids

    # get single data
    def __getitem__(self, index):
        
        sentence = str(self.df[self.specify][index])
        ids, mask, token_type_ids = self.tokenize(sentence)

        if self.mode == "test":
            return torch.tensor(ids, dtype=torch.long), torch.tensor(mask, dtype=torch.long), \
                torch.tensor(token_type_ids, dtype=torch.long)
        else:
            # 回傳 input_ids, attention_mask, totken_type_ids, labels
            # 需回傳 tensor 型態，其維度為 torch.Size([self.max_len])
            if self.num_class > 2:
              return torch.tensor(ids, dtype=torch.long), torch.tensor(mask, dtype=torch.long), \
                torch.tensor(token_type_ids, dtype=torch.long), self.one_hot_label(self.label[index])
            else:
              return torch.tensor(ids, dtype=torch.long), torch.tensor(mask, dtype=torch.long), \
                torch.tensor(token_type_ids, dtype=torch.long), torch.tensor(self.label[index], dtype=torch.long)

將自己的 Dataset 定義完成後，就可以丟到 DataLoader 設定好 batch_size 的大小，就能讓它幫我們把批次化做好了！

import transformers
import pandas as pd

# load training data
# 你可以先 sample 部分資料去跑模型，有助於快速調整模型架構，畢竟資料愈多跑愈久
train_df = pd.read_csv('./train.tsv', sep = '\t').sample(4000, random_state=parameters['seed']).reset_index(drop=True)
train_dataset = CustomDataset('train', train_df, 'text', parameters)
train_loader = DataLoader(train_dataset, batch_size=parameters['batch_size'], shuffle=True)

# load validation data
val_df = pd.read_csv('./val.tsv', sep = '\t').sample(500, random_state=parameters['seed']).reset_index(drop=True)
val_dataset = CustomDataset('val', val_df, 'text', parameters)
val_loader = DataLoader(val_dataset, batch_size=parameters['batch_size'], shuffle=True)ers['batch_size'], shuffle=True)

資料處理完，讓我們休息一下，下節繼續講模型的部分！

如果喜歡這一系列文章，別忘了留下掌聲 👏，後續可以追蹤作者查看更多系列文章喔～

NLP 實戰教學－BERT 情緒分析（中）

參考資源

Written by HsiuChun