Stories by Bruce Yang on Medium

NLP and Deep Learning All-in-One Part III: Transformer, BERT, and XLNet

Bruce Yang — Sat, 22 Feb 2020 07:15:12 GMT

Attention Based Models Interview Questions

1. Why choose Attention based models over Recurrent based ones?

2. What is Attention? What’s wrong with seq2seq model?

3. What’s Self-Attention?

4. How to implement Self-Attention?

5. What is Transformer? What is Multi-head self-attention?

6. What is BERT? Why choose Bert over Embedding models?

7. What’s the difference between BERT and other traditional language models? or Why is Masked Language Modeling more effective than Sequential Language Modeling?

8. What’s the flaw of Transformer? How does BERT solve that problem?

9. How does BERT do classification?

10. What type of classification tasks can BERT do?

11. How to do BERT fine-tuning? what hyper-parameters does BERT use?

12. What’s wrong with BERT? What’s its limitations?

13. What is XLNet? Why Choose XLNet over BERT?

14. What’s the difference between BERT and XLNet [CLS] and [SEP] pattern?

NLP and Deep Learning All-in-One Part II: Word2vec, GloVe, and fastText

Bruce Yang — Fri, 21 Feb 2020 04:42:28 GMT

Embedding Based Models Interview Questions

1. What are some of the traditional ways to represent words in numeric vectors?

2. What is Word Embedding or Word2vec?

3. What are the 2 architectures of Word2vec?

4. How to train Word2vec (Skip-gram)?

5. What are the pros and cons for Word2vec?

6. What is GloVe? How is GloVe different from Word2vec?

7. What is fastText? How is fastText different from Word2vec?

8. Why choose fastText over Word2vec?

9. How to handle Out-of-Vocabulary words?

NLP and Deep Learning All-in-one Part I: RNN and LSTM

Bruce Yang — Thu, 20 Feb 2020 07:23:34 GMT

NLP and Deep Learning All-in-One Part I: RNN and LSTM

Recurrent Based Models Interview Questions

1. What is Recurrent Neural Network (RNN)?

2. How to train RNN?

3. What are the Advantages of RNN?

4. What are the Disadvantages of RNN?

5. What is Gradient Vanishing and Exploding Problem?

6. What are some the general ways to handle the vanishing gradient problem?

7. What is Long Short-Term Memory (LSTM)?

8. What is Cell State?

9. What is Gating Mechanism?

10. How to update cell state?

11. How does LSTMs forget gate structure avoid vanishing gradient? how does the forget gate manipulate these internal vectors?

12. What is a Batch? What is an Epoch? What’s the difference?

13. How to choose the Number of Hidden Layers?

14. How to choose the Number of Neurons in the Hidden Layers?

15. What is Dropout?

Crack the Machine Learning Phone Interview Guide

Bruce Yang — Tue, 16 Jul 2019 19:47:31 GMT

Machine Learning Interview Questions

According to Medium’s policies, I can only put the interview questions here, not the answers. For the answers to these interview questions, you have to refer to my personal blog:

Bruce Yang's Blog

Machine Learning is the crucial part of Data Scientist’s interview, because this skill is like the Force for a Jedi master.

Different types of machine learning problems can be asked during an interview. I am only focusing on how to crack the phone interview in this blog. For an On-Site interview, you probably need to hand-write some of the machine learning algorithms in Python. For example, implement a Logistics Regression in Python.

Machine Learning Interview Questions

What is Bias-Variance Tradeoff?
What is the curse of dimensionality?
What is Multi-Collinearity? Why is it an issue?
How do you detect multicollinearity in Model Data?
What are the remedies of problem of Multi-collinearity?
What are Outliers? Leverage & Influential points?
How can we deal with Outliers?
If you have set of data and do regression, what if you duplicate all the data and do regression on the new data set?
What is the primary difference between R square and adjusted R square?
What’s Gradient Descent?
How to implement Gradient Descent?
What’s the difference between confidence interval and predicted interval?
Why do you need Cross-Validation?
How to detect overfitting?
How do you handle overfitting problem?
How to handle missing data?
You have a train-set, a dev-set and a test-set, how did you manage the distribution of those sets?
How to handle imbalanced data?
What are the methods for over-sampling?
What is Confusion Matrix? What’s TP, TN, FP, FN, Precision and Recall?
What is ROC and AUC?
How to handle categorical variables?
What is the difference between Generative models and Discriminative models?
What is linear regression? What are the steps of the algorithm?
What are the assumptions of linear regression?
What plot is best suited to test the linear relationship of independent and dependent continuous variables?
Let’s assume that the errors break the normality assumption, it’s not a Gaussian distribution, can you still use linear regression?
What is the cost function of linear regression?
How do you interpret a Q-Q plot in a linear regression model?
What is the importance of the F-test in a linear model?
What are the Advantages/Disadvantages of linear regression?
What is regularization?
What’s the difference between Lasso and Ridge? What’s the cost function?
Why does Lasso can make coefficients to 0 but Ridge can’t?
What is lambda λ?
What are the Advantages/Disadvantages of Lasso and Ridge?
What’s Logistic Regression? How to implement Logistic Regression?
What’s the cost function of logistic regression?
What does Cross-Entropy measure?
What’s an Entropy?
How to tune hyperparameters of logistic regression? What does the parameter mean?
What are the Advantages/Disadvantages of Logistic Regression?
What are the basic concepts of Naïve Bayes? What problem does it solve?
What are the assumptions of Naïve Bayes?
What is Bayes’ Theorem?
How to implement Naïve Bayes?
What are the Advantages/Disadvantages of Naïve Bayes?
What are the variations of Naïve Bayes?
What are the applications of NaïveBayes?
What’s KNN? How to implement KNN?
What are the Advantages/Disadvantages of KNN?
What’s K-means?
How to implement K-means?
How to choose K?
How to evaluate clusters?
When to stop the iteration?
How to do categorical data clustering?
What about a data set considering both numerical and categorical values?
What are the Advantages/Disadvantages of K-means?
What’s Decision Tree?
How to implement Classification Tree? What if the features are categorical values? What if the features are numerical values?
How to implement Regression Tree?
How to decide which feature is more important? What should be at the top of our decision tree? The root node?
When to stop splitting?
What is tree pruning?
What metrices to use to evaluate classification tree?
What are the Advantages/Disadvantages of Decision Tree?
What’s Random Forest? How to implement Random Forest?
Why randomly restrict the features in each split? Build a random forest at each split the algorithm is not allowed to consider a majority of the available predictors. Why?
What’s the difference between bagging and boosting?
How to tune hyperparameters of Random Forest?
What are the Advantages/Disadvantages of Random Forest?
What’s Ada-boost? How to implement Ada-boost?
Why use exponential function as loss function?
What are the Advantages/Disadvantages of Ada-boost?
What’s Gradient Boosting? How to implement Gradient Boosting?
What are the Advantages/Disadvantages of Gradient Boosting?
What’s the difference between XGBoost and GBM?
What’s Support Vector Machine (SVM)? How to implement SVM?
What’s Kernel Trick?
What’s the cost function of SVM?
What are the Advantages/Disadvantages of SVM?
What’s PCA? How to implement PCA?
What is the truncation of PCA?
How would you choose the value of K? The eigenvectors you will take to next stage?
What are the Advantages/Disadvantages of PCA?
What are the basic concepts of Neural Network?
How to implement Neural Network? What’s feedforward? What’s backpropagation?
What’s the cost function of Neural Network?
What are the activation functions of Neural Network?
What are the Advantages/Disadvantages of Neural Network?

Disclaimer: Views and ideas expressed in this post are my personal, individual and unique perspectives, and not those of my employer.

End to End Guide for Setting up AWS SageMaker Ground Truth Public Data Labeling Jobs

Bruce Yang — Thu, 11 Jul 2019 02:59:09 GMT

AWS Data Labeling

According to AWS: “Amazon SageMaker Ground Truth helps you build training datasets for machine learning.”

Machine Learning Labeling - Amazon SageMaker Ground Truth - AWS

Basically, you put your unlabeled datasets into AWS, and AWS will hire someone to do the label for you.

This service is crucial to Data Scientist, it saves a lot of labeling time.

AWS claims 3 Benefits of this service:

Reduce data labeling costs by up to 70%.
Work with public and private human labelers.
Achieve accurate results quickly.

Let us walk through this data labeling process step by step by showing a real-world example.

Data Sources: I am using Kaggle’s Quora Insincere Questions Classification Competition Training data as our example data to label. It already has the label-“target”, so we can test the quality of SageMaker’s data labeling.

Quora Insincere Questions Classification

Input Data Prepare:

Sagemaker’s default Text Classification mode requires CSV format data in one piece without any newline character such as “/n, /n/r, /n/n”

I attached a jupyter notebook script to do the data prepare & preprocessing here:

BruceYanghy/AWS-SageMaker-Data-Labeling

Steps to Deploy a AWS SageMaker Data Labeling job:

I assume you have already created an AWS Account and know how to create an S3 bucket.

Step 1: Create an S3 bucket and make sure to turn off “Block all public access”

Step 2: Create a folder with well-defined name and put your data-set into this folder

Here I use “test-quora-200-36c”, because I randomly select 200 samples and I use $0.36/data as price.

Step 3: Navigate to AWS SageMaker Dashboard, Click Labeling jobs and hit ‘Create labeling job’

Step 4: Job name, Input & Output dataset location, IAM Role and Task type

Job name: come up with an unique and meaningful name

Input dataset location: this is the most important one

Hit ‘Create manifest file’, select ‘Text’, paste S3 path, in this case is: s3://test-label/test-quora-200–36c/

Click ‘Create’ and wait about 1–2 minutes, it will automatically create a manifest file, you can download it and check it out, it’s in the same S3 folder.

Select ‘Use this manifest’

Output dataset location: s3://test-label/test-quora-200-36c/

I use the same S3 folder, you can create another folder but it’s easy to forget where it is.

IAM Role: Make sure to create an IAM role

Task type: Select ‘Text classification’ and click Next

Step 5: Worker types, Price per task, Automated data labeling, and Number of workers per dataset object

Price per task:

low complexity tasks: $0.012

Medium complexity tasks: from $0.024 to $0.24

High complexity tasks: $0.36 to $1.2

If you are doing image labeling like ‘Cat or Dog’, then low complexity cost would be fine, but for text classification you probably want to choose medium or even high complexity costs to in order to get a high quality labeled data.

Number of workers per dataset object:

More workers more costs and more consistency:

Here I choose $0.36 and 3 workers, then it’s gonna be $1.08 per data cost.

In the end, this is a majority vote label.

Step 6: Instructions

You want to write as more details as possible. Here I copy the description from Kaggle.

Quora Insincere Questions Classification

Click Preview to see what it looks like for your labeler.

Step 7: Collect the outputs

The output file is always in the /manifests/output/output.manifest

You can just download it from S3, change to .json because it’s a json file.

It took AWS Mechanical Turk 15 minutes to finish the 200 questions’ labeling job which is pretty fast.

Let’s evaluate the data quality by compare it with the true label.

Output Data Preprocessing & Quality Test:

I also attached a jupyter notebook script to do the data preprocessing & quality test here:

BruceYanghy/AWS-SageMaker-Data-Labeling

Here is the Evaluation Metrics:

We have 10 false positive and 13 false negative. Not bad! If we increase the price/label, I am sure the quality will increase.

Thanks!

AWS EC2 Launch Jupyter Notebook Server/Jupyter Lab with Screen

Bruce Yang — Sun, 09 Jun 2019 02:54:06 GMT

AWS EC2 Launch Jupyter Notebook/Lab Server with linux screen

Part 1: Create AWS EC2 Instance

AWS EC2: Create EC2 Instance (Linux)

Part 2: SSH to your EC2 instance (Ubuntu)

In your terminal with keypem.pem in your current folder:

sudo ssh -i “keypem.pem” ubuntu@ec2–ip-address.region.compute.amazonaws.com

Jupyter Notebooks on AWS EC2 in 12 (mostly easy) steps [updated April 2019]

Part 3: Install Anaconda on Ubuntu

How to Install Anaconda on Ubuntu 18.04

Part 4: Start a Jupyter Notebook/Jupyter Lab Server

AWS EC2: Start a Jupyter (IPython) Notebook Server

Step 1: In the terminal of your EC2 Ubuntu environment

ipython

Step 2: In ipython

from IPython.lib import passwd

#set up your password and save it somewhere
passwd()

quit()

Step 3: In the terminal

jupyter notebook --generate-config

mkdir certs

cd certs

sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

cd ~/.jupyter/

nano jupyter_notebook_config.py

Step 4: Copy and Paste into jupyter_notebook_config.py

c = get_config()

c.IPKernelApp.pylab = 'inline'

c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'

c.NotebookApp.ip = '*'

#or try this
#c.NotebookApp.ip = ‘0.0.0.0’

c.NotebookApp.open_browser = False

# Your password below will be whatever you copied earlier
c.NotebookApp.password = u'your ipython password'

c.NotebookApp.port = 8888

Step 5: In the terminal

screen -mdS jupyter_lab_serving bash -c 'jupyter lab'

#kill all screens: pkill screen

So that you will keep the jupyter notebook running even if you close the terminal.

Step 6: In the web browser:

https://public-ip-address:port/

“Your connection is not private” — click advance and allow.

Enter passoword.

You should have your Jupyter lab running!!!

Tip1: If you have permission error

PermissionError: [Errno 13] Permission denied: Cannot open Jupyter on Browser despite running correctly on AWS EC2 instance

sudo chown $USER:$USER /home/ubuntu/certs/mycert.pem

Tip 2: If you have security group problem

EC2: How to add port 8080 in security group?

Just add ‘8888’ or whatever port your choose to your security group.

Tip 3: make sure use https:// not http

SSL: WRONG_VERSION_NUMBER when setting up public Juypter server