Natural Language Processing

Text augmentation techniques for phishing email detection

Image for post
Image for post
Photo by James Wheeler on Unsplash

Information security is very important for any organization. Lost money is a minor problem, the serious one is that the enterprise system. However, fraud email and phishing email occupy a small set of data when comparing to normal email. Augmenting fraud and phishing email is a way to tackle this problem.

Image for post
Image for post
Example of CEO fraud email (Regina et al., 2020)

Therefore, Regina et al. proposed three different approaches to generate synthetic data for model training. As synthetic data is a kind of “fake” data, some low-quality data may hurt model performance. Validations are needed to keep a high-quality synthetic data. Also, there are some assumptions which are:

  • Synthetic data should share the same label as the original text. For example, synthetic data should be change label from positive to negative (for binary classifier). …


Natural Language Processing

What is the difference between ELECTRA and BERT?

Image for post
Image for post
Photo by Edward Ma on Unsplash

BERT (Devlin et al., 2018) is the baseline of NLP tasks recently. There are a lot of new models released based on BERT architecture such as RoBERTA (Liu et al. 2019) and ALBERT (Lan et al., 2019). Clark et al. released ELECTRA (Clark et al., 2020) which target to reduce computation time and resource while maintaining high-quality performance. The trick is introducing the generator for Masked Langauge Model (MLM) prediction and forwarding the generator result to the discriminator

.MLM is one of the training objectives in BERT (Devlin et al., 2018). However, it is being criticized because of misaligned between the training phase and the fine-tuning phase. In short, the MLM mask token by [MASK] and model will predict the real world in order to learn the word representation. On the other hand, ELECTRA (Clark et al., 2020) contains two models which are generator and discriminator. The masked token will be sent to the generator and generating alternative inputs for discriminator (i.e. ELECTRA model). After the training phase, the generator will be thrown away while we only keep the discriminator for fine-tuning and inference. …


Natural Language Processing

Data augmentation for NLP — generate synthetic data by back-translation in 4 lines of code

Image for post
Image for post
Photo by Edward Ma on Unsplash

English is one of the languages which has lots of training data for translation while some language may not has enough data to train a machine translation model. Sennrich et al. used the back-translation method to generate more training data to improve translation model performance.

Given that we want to train a model for translating English (source language) → Cantonese (target language) and there is not enough training data for Cantonese. Back-translation is translating target language to source language and mixing both original source sentences and back-translated sentences to train a model. …


Natural Language Processing

An Introduction to Retrieval-Augmented Language Model Pre-Training

Image for post
Image for post
Photo by Edward Ma on Unsplash

Since 2018, the transformer-based language model has been proven to achieve good performance in lots of NLP downstream tasks such as Open-domain Question Answer (Open-QA). To achieve better results, models intend to increase model parameters (e.g. more heads, larger dimensions) in order to stored world knowledge in the neural network.

Guu et al. (2020) from Google Research released the state-of-the-art model (Retrieval-Augmented Language Model Pre-Training, aks REALM) which leverages knowledge retriever augmented data from other large corpora such as Wikipedia. Given an extra signal, it helped the model to deliver a better result. …


Machine Learning

Image for post
Image for post
Photo by Edward Ma on Unsplash

Everyone can fit data into any model machine learning or deep learning frameworks easily. Following the best practices may help you to distinguish others. Also, you may consider the following tricks. Here are some methods that I applied during my data scientists’ journey.

Table of Content

Data Preparation

  • Process Your Own Data
  • Use Tensor
  • Data Augmentation
  • Sampling Same Data

Model Training

  • Saving Intermediate Checkpoint
  • Virtual Epoch
  • Simple is Beauty
  • Simplifying Problem

Debugging

  • Simplifying Problem
  • Using Eval Mode for Training
  • Data Shifting
  • Addressing Underfitting
  • Addressing Overfitting

Production

  • Meta Data Association
  • Switch to Inference Mode
  • Scaling Cost
  • Stateless
  • Batch Process
  • Use C++

Data Preparation

Process Your Own Data

Image for post
Image for post
Photo by Oliver Hale on Unsplash

It will be suggested to handle data processing within a model (or within prediction service). The reason is a consumer may not know how to do that and making feature engineering transparent to them. …


Data Science

Why metrics need to be defined at the very beginning

Image for post
Image for post
Photo by Edward Ma on Unsplash

If you do not know how to justify whether model is good or not, it is similar to you want to get something but you do not know what it is. After working as a Data Scientist for a few years, I strongly believe that metrics are very important things to define at the earlier stage.

This story will cover several textual metrics. You may also check out the following stories to understand other evaluation metrics

Textual Evaluation Metrics

In the natural language processing (NLP) field, we have lots of downstream tasks such as translation, text recognition, and translation. …


Data Science

Why metrics need to be defined at the very beginning

Image for post
Image for post
Photo by Edward Ma on Unsplash

If you do not know how to justify whether model is good or not, it is similar to you want to get something but you do not know what it is. After working as a Data Scientist for a few years, I strongly believe that metrics are very important things to define at the earlier stage.

This story will cover several regression metrics. You may also check out the following stories to understand other evaluation metrics

Regression Metrics

One of the differences in regression that it contains continuous values. Other than the confusion matrix, you can use another set of calculations to understand your model. …


Data Science

Why metrics need to be defined at the very beginning

Image for post
Image for post
Photo by Edward Ma on Unsplash

If you do not know how to justify whether model is good or not, it is similar to you want to get something but you do not know what it is. After working as a Data Scientist for a few years, I strongly believe that metrics are very important things to define at the earlier stage.

This story will cover several classification metrics. You may also check out the following stories to understand other evaluation metrics

Classification Metrics

Introduction to Confusion Matrix

Confusion matrix has to been mentioned when introducing classification metrics. True positive (TP), true negative (TN), false positive (FP) and false negative (FN) are the basic elements. …


Statistics

Why metrics need to be defined at the very beginning

Image for post
Image for post
Photo by Edward Ma on Unsplash

If you do not know how to justify whether model is good or not, it is similar to you want to get something but you do not know what it is. After working as a Data Scientist for a few years, I strongly believe that metrics are very important things to define at the earlier stage.

In this series of stories, I will cover some common metrics we should use when measuring model how good it is. Precision and Mean Squared Error (MSE) may come up immediately when talking about metrics. However, I want to highlight that our audience may not understand what it is. …


Image for post
Image for post
Photo by Edward Ma on Unsplash

Developing general-purpose multilingual representations is a trend in recent years. Most of the earlier models are developed based on English while we have several thousand languages all over the world. Previous studies include mBERT and XLM. Although those wonderful models are designed for general-purpose, evaluations of them are often limited to translation and classification and similar languages.

XTREME

XTREME (Hu et al., 2020) is introduced to overcome the aforementioned limitations. The full name of EXTREME is Cross-lingual TRansfer Evaluation of Multilingual Encoders. It covers 40 languages and able to support up to 9 tasks. Also, XTREME focus on the zero-shot cross-lingual transfer scenario. …

About

Edward Ma

Focus in Natural Language Processing, Data Science Platform Architecture. https://makcedward.github.io/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store