TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial…

Member-only story

The Ultimate Guide to Training BERT from Scratch: The Tokenizer

Dimitris Poulopoulos
TDS Archive
Published in
13 min readSep 6, 2023

--

Photo by Glen Carrie on Unsplash

Part I, Part III, and Part IV of this story are now live.

Did you know that the way you tokenize text can make or break your language model? Have you ever wanted to tokenize documents in a rare language or a specialized domain? Splitting text into tokens, it’s not a chore; it’s a gateway to transforming language into actionable intelligence. This story will teach you everything you need to know about tokenization, not only for BERT but for any LLM out there.

In my last story, we talked about BERT, explored its theoretical foundations and training mechanisms, and discussed how to fine-tune it and create a questing-answering system. Now, as we go further into the intricacies of this groundbreaking model, it’s time to spotlight one of the unsung heroes: tokenization.

Part III of this story is now live.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Dimitris Poulopoulos
Dimitris Poulopoulos

Written by Dimitris Poulopoulos

Machine Learning Engineer. I talk about AI, MLOps, and Python programming. More about me: www.dimpo.me

No responses yet