An Introduction to Data Science via ScalaTion Part 2 — HelloWorld of Data Science

Source: https://blog.rntech.co.uk/

In Part 1 of this two-part series we introduced ScalaTion, a robust Scala-based modeling and analytics library written in such close alliance to the fundamentals of linear algebra, calculus, statistics, and machine learning that it has made an integral contribution to my development within the field of data science. We also downloaded all the required software and automated the build process for starting new projects. In Part 2, we execute our first ScalaTion script — HelloWorldOfDS.scala .


Runnin > Walking… Gimme the results!

If you have followed Part 1 of this tutorial, then your project folder should be set up and equipped with the necessary build.sbt and .jar files. Additionally, you should have a local clone of The Quarks’ GitHub repository, which contains the HelloWorldOfDS.scala file we will be running and exploring. You can transfer this file into your own project folder built in Part 1 or you can simply run the script from the local clone itself.

Either way, to run the script, enter Terminal or another bash shell and cd into the main project folder containing the build.sbt file and data, lib, project, src, and target folders. In the quarks GitHub repository, this folder is the one titled HelloWorldOfDS inside the scalation folder.

Once inside this folder, typing sbt run should open an sbt shell, compile, and execute the code within the project’s tree structure. If all goes smoothly, you will end up with an output something like this:

(1)

|====================|
|   CANCER  DATASET  |
|   10  FOLD  CV    |
|    NAIVE  BAYES    |
|====================|
>meanCV
accuracy     recall  precision    f-score
NaiveBayes 0.719285714 0.825150405 0.771436896 0.792888583
==========================================================
>stdCV
accuracy     recall  precision    f-score
NaiveBayes 0.132583597 0.134881091 0.134661226 0.118966704
==========================================================

… and if you scroll up a bit, a sequence of outputs resembling

(2)

backwardElim: cols_j = Set(0)
backward model: remove x_j = 11 with b = VectorD(5.87791)
fit = VectorD(3840.99, 3840.99, 0.784196, 0.885548, 0.670793, 3.78858e-15, 3.77476e-15, Infinity, -1188.69, -1182.19)

… and further up

(3)

|====================|
|  WineDrop DATASET  |
|    MULT LIN REG    |
|====================|

Coefficients:
| Estimate   |   StdErr   |  t value | Pr(>|t|)
x₀ | 140.098884 |  18.529223 |   7.5610 |   0.00000
x₁ |   0.060412 |   0.020444 |   2.9550 |   0.00313
x₂ |  -1.904478 |   0.112399 | -16.9440 |   0.00000
x₃ |   0.076795 |   0.007549 |  10.1731 |   0.00000
x₄ |  -0.276671 |   0.542720 |  -0.5098 |   0.61020
x₅ |   0.003777 |   0.000842 |   4.4848 |   0.00001
x₆ |  -0.000311 |   0.000377 |  -0.8235 |   0.41020
x₇ | -138.989068 |  18.906592 |  -7.3514 |   0.00000
x₈ |   0.660673 |   0.103723 |   6.3696 |   0.00000
x₉ |   0.617767 |   0.100076 |   6.1730 |   0.00000
x₁₀ |   0.009603 |   0.001095 |   8.7670 |   0.00000
SSE:             2751.6229
Residual stdErr: 0.7504 on 10 degrees of freedom
R-Squared:       0.2836, Adjusted rSquared:  0.2822
F-Statistic:     193.4762 on 10 and 4887 DF
AIC:             -2802.3636
BIC:             -2730.9012
--------------------------------------------------------------------------------
VIF:
VectorD(0.340531, 0.000321888, 0.939356, 1.39361e-05, 0.250047, 0.0932211, 3.71570e-07, 0.00305491, 0.000444961, 0.0593856)

Great, I’m a Data Scientist! But… uh… What have I done?

In short, you read in two different datasets, one from a CSV file and one from an ARFF file, containing real-valued data and categorical data respectively.

You performed multiple linear regression (MLR) on the real-valued data in its raw form (WineQual), after performing a simple transformation (Wine-tf), and again after removing an attribute (WineDrop) guided by forward selection, backward elimination, and VIF calculation (more on that in Part 3). Each regression fit returned the derived least squares coefficients and associated statistics as well as several quality of fit measures (See outputs (2) and (3) above).

You then applied the naive Bayes classifier to the categorical data, reporting the mean and standard deviation for accuracy, recall, precision, and F1 score from 10-fold cross validation (CV) (See output (1) above).

Note: The output above the naive Bayes CV results resembling repetitions of

vc = Array(6, 3, 11, 7, 3, 3, 2, 6, 2)
distinct value count vc = Array(6, 3, 11, 7, 3, 3, 2, 6, 2)

is built-in output from the ScalaTion NaiveBayes class. It shows the number of unique values for each attribute, and is returned at two points of execution during each CV fold.


Congratulations! You’ve officially run your first machine learning models using the ScalaTion library. If you’re a beginner like I was when I first used the library, just getting to this point can be daunting, so we’ll stop here for now. Stay tuned for Part 3, where we will take a deeper look into the code we’ve just run.