Unleashing the Power of Deep Neural Networks for Language Proficiency Assessment — Using DNNs for Regression

Rohaank
6 min readJun 9, 2023

Are you curious about how deep neural networks can revolutionize the assessment of language proficiency? In this article, we will explore an exciting Kaggle competition that aims to evaluate the language skills of English Language Learners (ELLs). We will dive into an intriguing code implementation using a deep neural network (DNN) and a sequence-to-sequence (seq2seq) model. Buckle up as we embark on a fascinating journey to develop proficiency models that will better support ELLs and expedite the grading cycle for teachers

Understanding the Competition

The objective of this Kaggle competition is to assess the language proficiency of 8th-12th grade ELLs. By utilizing a dataset of essays written by ELLs, participants like you are challenged to develop proficiency models that provide more accurate feedback on language development. These models have the potential to revolutionize the education system by enabling ELLs to receive more appropriate learning tasks, ultimately enhancing their English language proficiency.

The Power of Deep Neural Networks

Deep neural networks (DNNs) have emerged as a powerful tool in natural language processing (NLP) tasks. They have demonstrated remarkable performance in tasks such as text classification, sentiment analysis, and machine translation. Leveraging the strength of DNNs, we will tackle the language proficiency assessment task in this Kaggle competition.

The Code Implementation

Let’s dive into the code implementation, step by step, to unravel the magic behind our DNN-based language proficiency model.

Importing the Required Libraries and Loading the Dataset

We begin by importing the necessary libraries, such as spacy, numpy, pandas, and nltk. Additionally, we load the dataset, which consists of essays written by ELLs, from the provided CSV files.

import spacy
import numpy as np
import random
import math
import time
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

train_csv = ‘/kaggle/input/feedback-prize-english-language-learning/train.csv’
test_csv = ‘/kaggle/input/feedback-prize-english-language-learning/test.csv’

train_data = pd.read_csv(train_csv)
test_data = pd.read_csv(test_csv)

train_data = train_data.drop([‘text_id’], axis=1)
train_data.info()

Data Preprocessing

Effective data preprocessing is essential for building robust and accurate models. In this step, we perform various preprocessing techniques to prepare our text data for training the DNN. Let’s take a closer look at the preprocessing steps.

Tokenization and Text Cleaning

To tokenize English text and clean it for further processing, we utilize the powerful spacy library. We also download necessary resources from the nltk library, such as stopwords and WordNet lemmatizer.

import nltk

spacy_en = spacy.load(‘en_core_web_sm’)
nltk.download(‘stopwords’)
nltk.download(‘wordnet’)
!unzip /usr/share/nltk_data/corpora/wordnet.zip -d /usr/share/nltk_data/corpora/

from collections import Counter
stop_words = set(stopwords.words(‘english’))
lemmatizer = WordNetLemmatizer()
vocab = Counter()

Tokenization Function

We define a tokenization function tokenize_en that takes a text string as input and performs the following steps:

  1. Converts the text to lowercase.
  2. Removes any tabs from the text.
  3. Removes non-alphanumeric characters from the text.
  4. Tokenizes the text using the spacy_en tokenizer.
  5. Removes stopwords from the tokens.
  6. Lemmatizes the tokens to their base form.
  7. Replaces the word “u” with “you” to handle common abbreviation.

def tokenize_en(text):
text = text.lower()
text = re.sub(r’\t’, ‘ ‘, text)
text = re.sub(r’\W+’, ‘ ‘, text)
text = [tok.text for tok in spacy_en.tokenizer(text)]
text = [word for word in text if word not in stop_words]
text = [lemmatizer.lemmatize(word, ‘v’) for word in text]
text = [word.replace(‘ u ‘, ‘you’) for word in text]
text = [‘<sos>’] + text + [‘<eos>’]
return text

Preprocessing the Essays

To prepare the essays in the dataset for further processing, we apply the tokenize_en function to each paragraph. The resulting tokenized paragraphs are stored in the paragraphs list.

paragraphs = []
for paragraph in train_data[‘full_text’]:
paragraphs.append(tokenize_en(paragraph))

Building the Vocabulary

Next, we create a word index to facilitate tokenization. We iterate through each paragraph in the dataset and tokenize the text. Additionally, we build a word index by assigning a unique index to each unique token encountered.

word_index = {‘<pad>’: 0} # Add padding token
for paragraph in train_data[‘full_text’]:
tokens = paragraph.lower().split()
for token in tokens:
if token not in word_index:
word_index[token] = len(word_index)

Defining the DNN Architecture

Now, let’s define the architecture of our DNN for language proficiency assessment. We will leverage the power of BERT, a pre-trained transformer-based model, in our architecture.

Importing the Required Libraries

We import the necessary libraries for building our DNN architecture, including tensorflow.keras and transformers.

import numpy as np
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Embedding, LSTM, Dense, Attention, Concatenate
from transformers import TFBertModel, BertTokenizer
from tensorflow.keras.layers import GlobalAveragePooling1D
from tensorflow.keras.optimizers import Adam

BERT Model Initialization

We initialize the BERT model (bert_model) and the BERT tokenizer (tokenizer) using the bert-base-uncased variant.

bert_model = TFBertModel.from_pretrained(‘bert-base-uncased’)
tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)

Model Parameters

We define several important parameters for our model, such as batch_size, output_dim, and max_seq_length. These parameters will influence the training process and the model's performance.

batch_size = 32
output_dim = 1 # Output dimensionality for cohesion scores
max_seq_length = 100 # Maximum sequence length for padding
y = train_data[‘cohesion’]

Model Architecture

We proceed to define the architecture of our DNN. The architecture consists of several essential components, including input sequences, input masks, BERT embeddings, an LSTM layer, multi-head attention, and a dense layer for prediction.

# Create model architecture
input_seq = Input(shape=(max_seq_length,), dtype=’int32')
input_mask = Input(shape=(max_seq_length,), dtype=’int32')

bert_output = bert_model(input_seq, attention_mask=input_mask)[0]
encoder_lstm = LSTM(256, return_sequences=True, dropout=0.5)(bert_output)

# Multi-Head Attention
num_heads = 7
attention_heads = []
for _ in range(num_heads):
attention_head = Attention()([encoder_lstm, encoder_lstm])
attention_heads.append(attention_head)

merged_attention_heads = Concatenate()(attention_heads)
global_avg_pool = GlobalAveragePooling1D()(merged_attention_heads)
decoder_dense = Dense(output_dim)(global_avg_pool)

model = Model([input_seq, input_mask], decoder_dense)
model.compile(optimizer=Adam(lr=1e-4), loss=’mean_squared_error’)

model.summary()

Data Generation

To efficiently train our DNN, we implement a data generator that generates batches of input data and corresponding targets. This generator will allow us to train the model on-the-fly without loading the entire dataset into memory.

from transformers import BertTokenizer

class DataGenerator():
def __init__(self, paragraph, cohesion, batch_size, output_dim, max_seq_length, word_index, shuffle=True):
# Initialization
self.output_dim = output_dim
self.batch_size = batch_size
self.paragraph = paragraph
self.cohesion = cohesion
self.max_seq_length = max_seq_length
self.tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’)
self.word_index = word_index
self.shuffle = shuffle
self.on_epoch_end()

def __len__(self):
# Denotes the number of batches per epoch
return int(np.floor(len(self.paragraph) / self.batch_size))

def __getitem__(self, index):
# Generate indexes of the batch
indexes = self.indexes[index * self.batch_size: (index + 1) * self.batch_size]

paragraph_temp = np.array(self.paragraph)[indexes]
cohesion_temp = np.array(self.cohesion)[indexes]

# Generate data
input_seqs, input_masks, y = self.generate_data(paragraph_temp, cohesion_temp)

return [input_seqs, input_masks], y

def on_epoch_end(self):
# Updates indexes after each epoch
self.indexes = np.arange(len(self.paragraph))
if self.shuffle == True:
np.random.shuffle(self.indexes)

def tokenizer_text(self, paragraph, max_seq_length):
# Tokenize paragraph into sequence of word indices
tokens = self.tokenizer.encode_plus(paragraph, max_length=max_seq_length, padding=’max_length’, truncation=True)
input_ids = tokens[‘input_ids’]
attention_mask = tokens[‘attention_mask’]

return input_ids, attention_mask

def generate_data(self, paragraph_temp, cohesion_temp):
input_seqs = np.zeros((self.batch_size, self.max_seq_length), dtype=np.int32)
input_masks = np.zeros((self.batch_size, self.max_seq_length), dtype=np.int32)
y = np.zeros((self.batch_size,), dtype=float)

for i, paragraph in enumerate(paragraph_temp):
tokens = self.tokenizer_text(paragraph, self.max_seq_length)
input_seqs[i, :] = np.array(tokens[0])
input_masks[i, :] = np.array(tokens[1])
y[i] = cohesion_temp[i]

return input_seqs, input_masks, y

Training the Model

We are now ready to train our DNN model! We define the number of epochs and other training parameters. Additionally, we set up early stopping and model checkpointing callbacks to monitor the validation loss and save the best model during training.

from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

early_stopping = EarlyStopping(monitor=’val_loss’, patience=5)
model_checkpoint = ModelCheckpoint(‘best_model.h5’, monitor=’val_loss’, save_best_only=True)

train_generator = DataGenerator(paragraphs, y, batch_size, output_dim, max_seq_length, word_index)
history = model.fit(train_generator, epochs=50, validation_split=0.2, callbacks=[early_stopping, model_checkpoint])

Conclusion

In this article, we explored how deep neural networks can revolutionize the assessment of language proficiency for English Language Learners (ELLs). We walked through a code implementation using a DNN architecture that incorporates BERT embeddings, LSTM layers, multi-head attention, and dense layers for prediction. By leveraging the power of DNNs, we can develop more accurate models to evaluate the language skills of ELLs, providing them with better learning tasks and supporting their English language proficiency journey.

--

--