MSc Data Science in the UK — Applied NLP — Week4

6 min readJan 12, 2024

Hi everyone, welcome to my week 4 Master’s in the UK session.

For the past week, we’ve looked at some background knowledge and challenges in modern NLP applications. Refer to the link, if you haven’t read it yet.

This week, we will look into some techniques that will help us piece an NLP project together.

Hope you guys enjoy and let’s dive in.

Data Preparation:

At the beginning of an NLP project, we will need to deal with the source from which we get our data. This could be from public datasets, web scraping, APIs, internal databases, or even Python libraries like NLTK.

Data cleaning and preprocessing:

Text Normalization:

This involves converting text into a more uniform format. Steps include lowercasing, removing punctuations, correcting misspellings, and noise removal (HTML tags in web scraped data).

Also, handling contractions like “don’t” to “do not” is important for text analysis since it can help reduce variability and enhance model interpretation.

Tokenization:

Word Tokenization: Breaking down text into individual elements (tokens), typically words or phrases. Split text into words using nltk, spaCy, or custom regular expressions.
Sentence Tokenization: Divide the text into sentences, important for context in tasks like sentiment analysis.
Stop Word Removal: Removing common words that may not add significant meaning to the text (e.g., “the”, and “is”).
Stemming: Reduces words to their base form, though it might not result in actual words (e.g., ‘running’ to ‘run’).
Lemmatization: More sophisticated than stemming, it reduces words to their dictionary form, considering the context (e.g., ‘better’ to ‘good’).

Advanced Preprocessing Techniques

Part-of-Speech (POS) Tagging: Using nltk or spaCy to tag words with their grammatical roles. This is useful for disambiguating words based on their usage in sentences.
Named Entity Recognition (NER): Employ spaCy or specialized models to identify and classify named entities (people, organizations, locations) in the text. NER is crucial for extracting specific information from large corpora and for tasks like information extraction and content classification.
Syntactic Parsing: Analyzing sentence structure using dependency parsing in spaCy or nltk. This is important for understanding the grammatical structure of sentences. It refers to the process of analyzing sentences to reveal their grammatical structure, which involves identifying various components such as nouns, verbs, and adjectives, and how they relate to each other within a sentence

Data Transformation:

Vectorization

Bag-of-Words (BoW): Represent text as a frequency count of words using libraries like scikit-learn.
TF-IDF (Term Frequency-Inverse Document Frequency): Use scikit-learn to weight terms based on their frequency and inverse document frequency, highlighting the importance of certain words within a corpus.

Vectorization methods do not capture the meaning or semantic relationships between words and often have high dimensionality.

Word Embeddings

Word2Vec: Use gensim for context-based dense vector representations.
GloVe (Global Vectors for Word Representation): Utilize pre-trained GloVe vectors for word embeddings.
BERT (Bidirectional Encoder Representations from Transformers): Leverage transformers library for state-of-the-art contextual embeddings.

Word embeddings capture semantic relationships and meanings. Usually provides lower-dimensional representations, while vectorization methods, especially BoW, often result in high-dimensional sparse matrices.

Additional Transformation Techniques

Positive Pointwise Mutual Information (PPMI): A method for computing word associations based on their co-occurrence probabilities. PPMI can be used to create word vectors that capture semantic relationships more effectively than simple frequency counts. It’s commonly used in semantic similarity tasks and for building co-occurrence matrices.
n-grams: Create n-grams (pairs, triplets of consecutive words) to capture context beyond individual words. Useful in BoW or TF-IDF vectorization to incorporate local word order

Data Augmentation (Optional)

The primary goal of data augmentation is to increase the size and diversity of the dataset. It creates modified versions of existing data or generates new data based on the characteristics of the existing data.

In NLP, this could mean creating variants of text data through methods like synonym replacement, back translation, or using advanced models for text generation.

This process is particularly beneficial in scenarios where the amount of available data is limited or when the model needs to be robust against various forms of input. It helps in reducing overfitting and improving the generalization of models.

Synonym Replacement: Use WordNet (through nltk) to replace words with synonyms.
Back Translation: Translate text to another language and back to augment the dataset.
Text Generation: Use models like GPT (from transformers library) to generate synthetic text data.

Feature Engineering

Feature engineering is focused on extracting and selecting the most relevant information from the raw data. It involves transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy and performance.

Data Splitting

Train-Test Split: Segregating data into training and testing sets, and possibly a validation set, to ensure that the model can be trained and evaluated effectively.

Data Exploration (Exploratory Data Analysis)

After splitting the data into training, validation, and testing sets, it’s essential to ensure that the split is representative of the entire dataset. Data exploration at this stage can help verify that the distribution of key features (like word frequencies, sentence lengths, etc.) is consistent across all sets.

Statistical Analysis: Understanding distributions, frequency of words, sentence lengths, etc.
Visualization: Using tools like word clouds, frequency histograms, etc., to get a visual sense of the data characteristics.

Machine Learning and Deep Learning Techniques for Model Training

After data splitting, you’d proceed with training your NLP model. Depending on the nature of your NLP task (like text classification, sentiment analysis, or language translation), different machine learning and deep learning techniques can be employed:

Machine Learning Techniques

Naive Bayes Classifier: Often used for text classification tasks due to its simplicity and efficiency.
Support Vector Machines (SVM): Effective for high-dimensional data, which is common in NLP.
Random Forest: A robust classifier that works well for various NLP tasks
Gradient Boosting: Gradient Boosting works by sequentially adding weak learners (usually decision trees) to the model, where each new learner corrects the errors of the previous ones.

Deep Learning Techniques

Recurrent Neural Networks (RNN): Suitable for tasks involving sequential data, like language modeling and text generation.
Long Short-Term Memory (LSTM): A type of RNN that is particularly effective at capturing long-term dependencies in text data, useful for tasks like sentiment analysis.
Convolutional Neural Networks (CNN): Though primarily known for image processing, CNNs can also be effective for NLP, especially in text classification.
Transformers: Models like BERT, GPT, and T5 have revolutionized NLP with their ability to handle a wide range of tasks with state-of-the-art performance.

Training Process

Feature Selection: Based on the exploration, select the most relevant features for your model.
Model Selection: Choose a model based on the task, data size, and complexity.
Hyperparameter Tuning: Adjust the model parameters to find the best performance. (Grid Search: This technique involves exhaustively searching through a manually specified subset of the hyperparameter space of the chosen model.)
Cross-Validation: Use techniques like k-fold cross-validation on the training set to validate the model.
Performance Evaluation: Assess the model on the validation set and adjust your approach as needed.

Data Quality Assessment

Consistency and Reliability Checks: Ensuring the data accurately represents the problem domain and is reliable.
Bias Assessment: Evaluating and mitigating biases in the dataset to prevent them from propagating through the model.

Feel free to drop me a question or comment below.

Cheers, happy learning. I will see you in chapter 5.

The data journey is not a sprint but a marathon.

Medium: MattYuChang

LinkedIn: matt-chang

Facebook: Taichung English Meetup

(I created this group four years ago for people who want to hone their English skills. Events are held regularly by our awesome hosts every week. Follow the FB group link for more information!)