Week 5 — This is the way!

Baran Orhan
AIN311 Fall 2022 Projects
2 min readDec 20, 2022

by Erdem Korhan Erdem and Baran Orhan

Hey, wake up !

This is the week we scraped data, tagged entities, and got an output from the model! There are better weeks to shut off.

As mentioned last week, we already have the course outcomes dataset. With some magic(basic code cells), we separate each word and print them to a .xlsx file with adding Sentence and Course features.

Tagged example from Backend course dataset

We tagged the Machine Learning and Backend courses at the moment. For ML, we have 651 unique words in the corpus and 406 for the backend.

Model

The introduction of the model is explained in Week 4. Today, we want to show the output of the bi-directional LSTM model.

A bidirectional LSTM combines two LSTMs — one runs forwards from right to left, and one runs backwards from left to right. This can provide additional context to the network and result in faster and even fuller learning on the problem to improve model performance on sequence classification problems. We will try bidirectional LSTM with Keras to solve extracting skill entities from the course outcomes tasks.

Bi-directional LSTM Source:Modelling Radiological Language with Bidirectional Long Short-Term Memory Networks
The output of our model. In progress…

’ENDPAD’ is to set all the phrases in the dataset to the same length.

We are getting high validation accuracies since most of the sentences in the dataset include 3–4 skill entities at most, and the rest is O-tagged.

Right now, we are looking for extra datasets to increase the performance of NER. If we can prove that our model is working fine with enough corpus, we will go to the next step and calculate the similarity with engineers' skills.

Until then, don't shut off because we are near to end.

Hope so.

--

--