Regression Service in Ruby on Rails (Part1)

10 min readMay 30, 2017

Introduction

This is a series article about how to implement a regression service which provide linear regression and logistic regression. I divide this topic into three chapters. Through this series articles. You will know the linear regression, logistic regression and create your own docker image. This web service is implemented by ruby on rails framework. And all the test data is coming from couresa course.

And the project git repo: https://github.com/Isaac234517/regression_service

Hope you enjoy it.

Series Article Background

In the past few months, I studied machine learning on couresa. I learn some basic analysis models and found I am enthusiastic in machine learning. Through the course programming assignment, I implement linear regression and logistic regression through Octave. Octave is an open source software which allow you to do mathematical calculation using Math Lab language. As you may know Octave provides matrix operation to vectorized the regression algorithm implementation. In order to be better understand those regression algorithms. I decide to implement them again in ruby. I know most of the people would like to use python to do machine learning but my purpose is to write the non vectorized version and I am familiar with ruby. Thus, I use ruby to implement them again. Furthermore, the last article is about how to create docker image. I use the regression service as a entry point to express how to build your own docker image. Since in the past few month. I have participate in dockerized project and study some docker commands that they should be mastered by beginner.

Initialize ruby on rails architecture.

I suggest you to use unix based system to write ruby since window is not very nice to ruby.

Before install ruby on rails. You should install ruby firstly 🙃

If you are linux or mac user, I suggest you to install ruby through rvm . rvm is a tool to manage multiple ruby version, gem version in your pc.

Download rvm and install it

Install Guideline: https://rvm.io/rvm/install

Install Ruby: command is rvm install 2.3.3

Gemset and install Bundler

Gemset isolate the gem therefore you can manage it well if you have multiple ruby projects in your pc

Create gemset for this project : rvm gemset creat rails5

Change current gemset: rvm gemset use rails5

Bundler is a tool to install you need gems in order to run your project.

Install bundler: gem install bundler

After complete all the steps. You can type the following command to create a rails project in api mode.

create rails project

Then you should see a folder name regression_serivce has created.

Please use command cd regression_service to change the current directory to regression_service. Inside the folder there is a Gemfile file. It defined the gem dependency for this project. And I use multi_json, and rails-rspec gem in this project. Therefore, please add them in Gemfile and run command bundle install again.

Now, regresion_service project has created. And in this section, I would tell you how to create a controller to handler the http request, parsing given http data, defined http data structure and set routing. But firstly, please create a controller name linear regression through the following command.

create linear regression controller

Linear Regression Controller

This controller handles request about linear regression. It provides three api find best theta, cost function, cost and gradient. Three api receive json format string and responds json format string. Before move to the algorithm implementation. I would like to implement the parsing function which parse the receive data to expect format and respond function firstly. And these function will be invoked in before_action method which is provided by ruby on rails framework. It means this function will be called before handle each request. And the code show as following.

The above function parse the json string which received from request. It extract the input value to variable x and output value to variable y. If thetas is not defined in the dataset, it initialize its withe zero value. In addition, the number of iterations, learning rate and the flag indicates do standardization or not also get from the request data.If they do not defined in the receive data. I assign a default value to these variable. The further meaning of each variable I would explain in the part of Linear Regression Model. The test request data I have been put it in the spec/data. The json string structure shows as following.

The last element of each element in dataset is output y and others is input x. For instance, the first element 399900 is output (y) and 2104 and 3 are input (x)

invoke parse_params before handle each request

After completed parse_params function. I create three functions in LinearRegresssionController. They are find_best_thetas, calculate_cost and cost_and_gradient_descent. Then I add routes to match these function.

routes.rb

In order to separate the detail logic from controller. The algorithm I implement in linear regression class. Therefore, I create a class under lib folder because ruby on rails load this folder path in ruby path automatically.

Before deep in to code implementation. I have to show you the linear regression concept.

Linear Regression Model

Linear Regression Model is mathematical model. It assumes that the input and output is a linear relation. It is a very common and simple way to predict the output. Assume you want to predict the rental price for a flat, and you think that the rental price is determined by the size of flat. The price is $7o0 and the size of flat is 600 square meter. Then the linear regression model is

700 = theta0 + theta1 * 600.

This is a one variable linear regression case. Now suppose you consider the rental price is determined by the floor of your flat too. And the floor is 5th then you have two input(x) and one output(y). Now the relation between inputs and output is formulated to 700 = theta0 + theta1 * 600 + theta2 * 5

To conclude that , the linear regression model formula is y = theta(0) + theta(1) *x(1) + …. theta(n) * x(n).

Hypothesis function

Usually this function in machine learning world. It is a hypothesis function. If you familiar with matrix operations. this formula can be simply to y=θt *X where θt and X is a matrix of thetas and matrix of x. From the above formula, you know that it is quite depend on the value of thetas. Therefore, if you want to predict the output more accurately. You have to find good thetas value. The mission of finding good thetas, it is the job of cost function do.

Cost function The Cost function formula is same as the following picture

h(x) is the hypothesis function I mention before. It because h(x) = theta(0) + theta(1) *x(1) + …. theta(n) * x(n). And x is a known number. Thus, put it in the cost function formula. The Cost function will finally become a function in terms of thetas.

Gradient If you familiar with calculus, you must know the terms Gradient. Easily, it is a change rate of output if input value has been change. And the gradient formula show here

Gradient for theta(0) and for theta(number >1)

Alpha is a learning rate. It is quite critical in the gradient descent formula. It is because if you pick up a large value of learning rate. The cost function may be fail to converge mathematically. Or a small value of learning rate you choose. It will take more time in the gradient descent calculation.

Gradient Descent gradient is a calculation process to find the best thetas which can help you to predict the output value more accurately. The logic step of gradient descent are

1.Calculate the cost value at initial thetas firstly: Usually initial thetas are zero or in terms of matrix is a zero matrix.

2. Find the gradient value at initial thetas.

3 Update thetas value: For instance, theta(0) = theta(0)-learning rate * gradient(0) … theta(n) = learning rate * gradient(n)

4. Use new thetas calculate the new cost again.

5. Compare new cost and origin cost: If the difference of new cost minus original cost is greater than an unacceptable value. It means that it is start to converge fail. Otherwise, use new cost as original cost and repeat step 2 to 5 until new cost greater that old cost.

Finally, last thetas are the best thetas in this iteration running.

Linear Regression Class

Linear Regression is a class to handle calculate cost hypothesis and gradient at particular thetas and given data set. This class has three variable x,y and thetas . In addition, I put all the other mathematical operation such as transpose, find variance, mean etc into a class called MathUtil.

MathUtil: https://github.com/Isaac234517/regression_service/blob/master/lib/utils/math_util.rb

Let take a look on hypothesis function firstly.

This function simply iterate every input variable x. x is a variable which contains all the input data set. When the iteration start, It iterates each row of data again and sums each data dot product with theta. It likes a simply matrix operation for matrix is [1,X] dimension times [X,1] dimension. Finally after each iteration completed, you get a collection. It’s contain each sum of dot product of row. Now, you can use this collection to calculate cost function value and gradient.

According to the cost function formula, I have to find the square error of hypothesis value minus actual value. Therefore, I write a code to iterate the collection of hypothesis value and do a minus operation with actual value. Then I get a square error collection and do the sigma operation. Finally, divide the summation value to 2 * number of data set. You get the cost function value.

In order to calculate the gradient easily, I transpose the original input data. As you know those thetas(index greater than 1) need to multiple their corresponding feature collection. For instance, our input data is this formation [ [x1,x2,x3], [1,2,3], [4,5,6]] and thetas is [1,2,3]. Then thetas[1] ‘s corresponding feature is x1 and x1 collection is [1,4]. Therefore, transpose it will be benefit to the upcoming calculation. After transposed the input data and got the collection of difference between hypothesis and actual value. I calculate the gradient by iterating the transposed input data. you should be careful of calculate the first sum. According to the gradient formula, the first sum should not multiple the first collection of features since it is a bias term. Thus, I put a if else statement inside the second loop.

Now, those critical functions in linear regression you have been implemented. Let write some test cases to verify the function is it workable.

I advocate the logical test should be automated. It’s because you will find that if you do a big code base project without writing auto test. You will be painful. Although this service is just a tutorial purpose, good practice should be keep. Writing a ruby code test case, I usually use rspec and because. I am implementing this service on ruby on rails framework, I would like to use rspec rails.

Install respec-rails.
run command: rails generate rspec:install to generate files .spec, rails_helper and spec_helpers.
Then add “require File.expand_path(File.dirname(__FILE__)+”/../lib/algorithm.rb”)” sentence in spec_helpers. Because all the file under spec will require spec_helper, add this sentence inside spec helper can let all your defined class and be initialize and use in each test case.
Write you test case now.

My test cases: https://github.com/Isaac234517/regression_service/tree/master/spec

APIs

As I mention, I implement three APIs in linear regression controller. They are find best theta, cost function, cost and gradient.

https://github.com/Isaac234517/regression_service/blob/master/app/controllers/linear_regression_controller.rb

The core part logic I have implemented in Linear Regression class. Therefore, there is few logic in these APIs except find best theta. In order to find best theta, you have to update the theta value again and again and calculate the new cost function value to compared with the last value. Just like my previous mention. But I do consider this process is not the natural of liner regression algorithm so I consider it as a part of find best theta API.

These api have 3 path sub process. The first part is to do or not do feature scaling. The second part is run gradient descent. and Finally respond the result.

Interaction

After a long coding time, in this section I would do the demonstration for these three api. I would use Firefox plugin HttpRequester to post a http request to this service.

find_best_thetas: Use test data under spec/data folder: linear_regression_test.json
calculate_cost: Use test data under spec/data folder: linear_regression_test2.json
cost_and_gradient: Use test data under spec/data folder: linear_regression_test2.json

Final

Linear regression is a simply model and is useful to do some prediction. The most important thing is that it is easily to learn for beginner. Now, hope you have a better understanding of machine learning. In the upcoming article, I would talk about logistic regression. It is useful for classifier.