NLP Intro: Tokenizing your text
Texts (from famous fiction to archival data, court decisions and social media conversations) offer exciting opportunities for (cross-disciplinary) research and data analysis. Yet, they also pose a major challenge.
Verbal information is what we call unstructured data: unlike structured numerical data, it cannot be easily fitted into a pre-defined model. That does not mean it is impossible to analyse data from text sources. It just means some additional pre-processing steps you need to take first.
What is tokenization in NLP?
A key initial step in pre-processing unstructured data is Tokenization. It’s the technical term for the process of breaking your text down into smaller chunks called tokens. Extracting the building blocks of your unstructured data is an essential step toward introducing structure and organising your data in a way that can be modelled and analysed.
- Tokens can be full sentences or individual words but also numbers or dates, depending on your goals and level of analysis. It can use spaces between words or punctuation as indication of the end of one token and the beginning of another.
Example:
I love rock music: it's fun.
['I', 'love', 'rock','music', ':', 'it', "'", 's', 'fun', '.']
- Common tokenization types include word tokenization, character tokanization and even sub-word tokenization.
- Typically, during tokenization, you also define stop words. They are commonly used words that can be removed as they will not contribute to a meaningful analysis: for example, the, an, in, etc. If there are specific words you prefer to exclude, you can create your own list or use existing corpora of stop words for that particular language.
- For proper text comprehension, it is important to understand not just individual tokens but also their relationship to other tokens and underlying structures. For example, the token slate will have a different meaning and part-of-speech tag (noun) if encountered in the phrase blank slate as opposed to the sentence We decided to slate the meeting for next month (verb).
- The context provided by other tokens in the surrounding text is crucial for proper understanding and later stages of more complex text processing like sentiment analysis or topic modelling.
Token Classification
We, humans, can intuitively use context to grasp complex meaning. Machines need a lot more guidance.
- Once a text is broken down into tokens, they are given various labels and assigned to a class based on core characteristics. For example, what part of speech they are, or whether they are associated with positive or negative sentiment.
- Named Entity Recognizers (NERs) can identify that one or multiple tokens are related to a named entity: based on a combination of words or the context surrounding a single word/token. For example, Apple with capital A and in a business context will be classified as the company apple and the separate words Steve and Jobs can be recognised as the name of a single entity.
- LLMs or Large Language Models like ChatGPT use tokenising and vast amounts of text data to understand the statistical relationships between tokens. Simply put, they learn to recognise patterns associated with tokens based on the likelihood of encountering them in different types of texts.
Tokenizing in R & Python
Depending on your preferred weapon of choice for NLP analysis, here are some useful resources: