# An Introduction to Data Science via ScalaTion Part 2 — HelloWorld of Data Science

In Part 1 of this two-part series we introduced ScalaTion, a robust Scala-based modeling and analytics library written in such close alliance to the fundamentals of linear algebra, calculus, statistics, and machine learning that it has made an integral contribution to my development within the field of data science. We also downloaded all the required software and automated the build process for starting new projects. In Part 2, we execute our first ScalaTion script — `HelloWorldOfDS.scala`

.

#### Runnin > Walking… Gimme the results!

If you have followed Part 1 of this tutorial, then your project folder should be set up and equipped with the necessary `build.sbt`

and `.jar`

files. Additionally, you should have a local clone of The Quarks’ GitHub repository, which contains the `HelloWorldOfDS.scala`

file we will be running and exploring. You can transfer this file into your own project folder built in Part 1 or you can simply run the script from the local clone itself.

Either way, to run the script, enter Terminal or another bash shell and `cd`

into the *main project folder *containing the `build.sbt`

file and `data`

, `lib`

, `project`

, `src`

, and `target`

folders. In the `quarks`

GitHub repository, this folder is the one titled `HelloWorldOfDS`

inside the `scalation`

folder.

Once inside this folder, typing `sbt run`

should open an sbt shell, compile, and execute the code within the project’s tree structure. If all goes smoothly, you will end up with an output something like this:

(1)

|====================|

| CANCER DATASET |

| 10 FOLD CV |

| NAIVE BAYES |

|====================|

>meanCV

accuracy recall precision f-score

NaiveBayes 0.719285714 0.825150405 0.771436896 0.792888583

==========================================================

>stdCV

accuracy recall precision f-score

NaiveBayes 0.132583597 0.134881091 0.134661226 0.118966704

==========================================================

… and if you scroll up a bit, a sequence of outputs resembling

(2)

backwardElim: cols_j = Set(0)

backward model: remove x_j = 11 with b = VectorD(5.87791)

fit = VectorD(3840.99, 3840.99, 0.784196, 0.885548, 0.670793, 3.78858e-15, 3.77476e-15, Infinity, -1188.69, -1182.19)

… and further up

(3)

|====================|

| WineDrop DATASET |

| MULT LIN REG |

|====================|

Coefficients:

| Estimate | StdErr | t value | Pr(>|t|)

x₀ | 140.098884 | 18.529223 | 7.5610 | 0.00000

x₁ | 0.060412 | 0.020444 | 2.9550 | 0.00313

x₂ | -1.904478 | 0.112399 | -16.9440 | 0.00000

x₃ | 0.076795 | 0.007549 | 10.1731 | 0.00000

x₄ | -0.276671 | 0.542720 | -0.5098 | 0.61020

x₅ | 0.003777 | 0.000842 | 4.4848 | 0.00001

x₆ | -0.000311 | 0.000377 | -0.8235 | 0.41020

x₇ | -138.989068 | 18.906592 | -7.3514 | 0.00000

x₈ | 0.660673 | 0.103723 | 6.3696 | 0.00000

x₉ | 0.617767 | 0.100076 | 6.1730 | 0.00000

x₁₀ | 0.009603 | 0.001095 | 8.7670 | 0.00000

SSE: 2751.6229

Residual stdErr: 0.7504 on 10 degrees of freedom

R-Squared: 0.2836, Adjusted rSquared: 0.2822

F-Statistic: 193.4762 on 10 and 4887 DF

AIC: -2802.3636

BIC: -2730.9012

--------------------------------------------------------------------------------

VIF:

VectorD(0.340531, 0.000321888, 0.939356, 1.39361e-05, 0.250047, 0.0932211, 3.71570e-07, 0.00305491, 0.000444961, 0.0593856)

#### Great, I’m a Data Scientist! But… uh… What have I done?

In short, you read in two different datasets, one from a **CSV** file and one from an **ARFF** file, containing real-valued data and categorical data respectively.

You performed **multiple linear regression** (MLR) on the real-valued data in its raw form (`WineQual`

), after performing a simple transformation (`Wine-tf`

), and again after removing an attribute (`WineDrop`

) guided by forward selection, backward elimination, and VIF calculation (more on that in Part 3). Each regression fit returned the derived least squares coefficients and associated statistics as well as several quality of fit measures (See outputs (2) and (3) above).

You then applied the **naive Bayes classifier** to the categorical data, reporting the mean and standard deviation for accuracy, recall, precision, and F1 score from 10-fold cross validation (CV) (See output (1) above).

*Note*: The output above the naive Bayes CV results resembling repetitions of

vc = Array(6, 3, 11, 7, 3, 3, 2, 6, 2)

distinct value count vc = Array(6, 3, 11, 7, 3, 3, 2, 6, 2)

is built-in output from the ScalaTion `NaiveBayes`

class. It shows the number of unique values for each attribute, and is returned at two points of execution during each CV fold.

**Congratulations!** You’ve officially run your first machine learning models using the ScalaTion library. If you’re a beginner like I was when I first used the library, just getting to this point can be daunting, so we’ll stop here for now. Stay tuned for **Part 3**, where we will take a deeper look into the code we’ve just run.