This article is a short review of the paper “Spatio-temporal analysis of rail station ridership determinants in the built environment, Yadi Zhu et al

It is quite hard to figure out a model when it comes to a transportation domain. An estimate of the arrival time, Ridership prediction — alignment, delivery route optimization, and so on. This is all because of the spatial(spatio)-temporal properties embedded in the model.

The first thing you can think of is the statistical model with variants. Here, that variants are temporal factors and spatial factors.

Base Model

The basic model is a simple linear regression model with particular variants designed for spatio properties: Land-use…


This time, interview was a data scientist position from mobility platform who should build a spatial-temporal regression model combined with ML practice.

  1. First Short Question

Why do you think logistic regression model uses sigmoid as a link function in ML perspective.

Image for post
Image for post

My Answer:

First, in the logistic function, function like sigmoid is called link function because it enables the model to be working on linear assumption(normal error assumption).

I think sigmoid shape resembles CDF which implies that most of sample resides in the middle area. …


An interview is another way of progress. As soon as I thought I was the right person to that company, they tackled the problem that I am not ready.

This is a shameful moment especially when you think you are an advanced engineer. One way of getting thru this experience is looking backward and keep practicing it.

Recommendation System Position

  1. Tell me your Recsys experience

I built a first recommendation system using Amazon Personalize. It is easy-to-go, but I realized that my data doesn’t fit the algorithm that Amazon Personalize based on.

2. Why did you choose Amazon Personalize for your engine and how did you come to realize that your data doesn’t fit? …


It is interesting algorithm that it is a deep learning algorithm incorporating likelihood function.

Full article is here : https://www.sciencedirect.com/science/article/pii/S0169207019301888

  1. Set up

- RNN representation as a function. h is hidden layer

Image for post
Image for post
  • Encoder — Decoder system (x and y are both sequential for time series prediction)
Image for post
Image for post
Fig. 2. Left. A RNN without an output layer. Right. A partially unrolled RNN with an output layer and multiple hidden units.

2. Model

Our job is to find a probability distribution of time series by maximizing likelihood function.

Image for post
Image for post
t0 is now at the point of prediction and T is future time window that we want to see

Once we know the probability distribution, we would get the statistics such as mean and std along with samples based on the shape of the distribution. …


AWS file saving API playbook

Whatever produced in AWS goes in S3(Simple Storage Service).

S3 is a triumphant datalake system looks after hadoop file system. Whatever or wherever you want to start, it should be s3. It is not Sagemaker EBS nor your local storage even though they are closer to you because of little awkward linkage between S3 and other provisioned services on AWS.

Let’s Dive into several cannonical ways to save object from Sagemaker Notebook.

  1. Easiesy way(Novice).

You just save an file object to EBS on Sagemaker such as csv file format, then download to your local machine, then you upload again into S3 bucket thru console or s3 cp API. …


AWS new AI service Rekognition Custom Labels is quite amazing. There are GCP AutoML Vsion, Azure Custom Vision services for public cloud SaaS but here I focus on Rekognition Custom Labels with Image Augmentation.

Summary

  1. You need very small amount of data (yet you need augmentation for more accurate model)
  2. Training Speed in amazing (my case, 7GB of 6000 image only consume 1hour more or less)
  3. Training cost is cheap ($1 per hour, 10-hour free tier)
  4. One-click deployment
  5. Pre-architectured image(AWS Cloudformation YAML file) is prepared by AWS
  6. However, there are some downsides (expensive hosting — $4/hr. …

Few days earlier, I got an email from one recruiter who is based at Hong Kong. He sent me an invitation to take a coding test at Quanthub.com in order for me to apply for McKinsey Data Scientist Position.

I do Python almost everyday since my position is a cloud data scientist but not everyday is data science coding(some developing).

It consists of 4 parts. R, Python, Statistics, and Modeling. Sequence it R → Python → Statistics → Modeling. Overall, Statistics is relatively easy. Modeling is a mixture of ML modeling and statistical test and also moderate. Python is moderate high with some of unfamiliar type of pandas functions. Finally R is pretty hard. This is also adaptive test type. (But you can select a your skill level from the beginning from 1 to 5) I did 3 for R, 4 for Python, 5 fore Statistics, 4 for Modeling but R was the most difficult one for me. (I do not code R in my workplace). 100 minute was given but time is definitely not enough. Never take the hard ones for long for first two category. …


Econometrics is really the backbone of data science along with probability theory(rest assured that ML is also big name).

However, everyone ever tried to understand convergence theorem(large number theory) have experienced from fully understanding.

I truly believe that in order to be a good data scientist, this is the really the criteria I will filter out future candidates among applicant.

So, why don’t we go through a very important proof?

Sample Variance as a Consistent Estimator for the Variance.

Refer to this for a full proof.

Image for post
Image for post
Sample variance decomposed

Above is just an unfolding expression of sample variance. Cross product term of summation of Wᵢ*Wₙ is -2Wₙ² so cancels one Wₙ² at the end so the last line comes as a result. …


  • Acceptance-rejection method
library(rbenchmark)
alpha <- 4
beta <- 3
rejection <- function(f, M, g, rg,n)
naccepts <- 0
result.sample <- rep(NA, n)

while (naccepts < n) {
y <- rg(1)
u <- runif(1)

if ( u <= f(y) / (M*g(y)) ) {
naccepts <- naccepts + 1
result.sample[naccepts] = y
}
}

result.sample
}
f <- function(x) 100*(x^(alpha-1))*(1-x)^(beta-1)
g <- function(x) 1
rg <- runif
M <- f((alpha-1)/(alpha+beta-1))

result <- rejection(f, M, g,rg, 1e5)
hist(result,freq = FALSE)
points(seq(0,1,0.01),dbeta(seq(0,1,0.01),alpha,beta),type = "l")
  • Inverse transformation
U <- runif(1e5)
alpha <- 4
beta <- 3
b_rand <- qbeta(U, alpha, beta)
hist(b_rand, col="skyblue", main = "Inverse U", freq=FALSE)
points(seq(0,1,0.01),dbeta(seq(0,1,0.01),alpha,beta),type …

This story deals with the transportation data in Austin, Texas

Goal : We would like to model ridership prediction for Austin city provided that we know some characteristic data for each bus stop(i.e.distance to CBA, traffic condition, population, time zone, weather, oil price)

GTFS(General Transit Feed Specification transit feed) formatted Automatic Passenger Counter(APC) data offers raw data for the analysis.

We need to engineer data. First, neat variables that are necessary.

Engineered data looks like below. We have bus stop_id, trip time, and ridership (alighted, off-board) along with latitude and longitude. …

About

Yong Rhee

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store