An Introduction to Data Science via ScalaTion Part 2 — HelloWorld of Data Science
In Part 1 of this two-part series we introduced ScalaTion, a robust Scala-based modeling and analytics library written in such close alliance to the fundamentals of linear algebra, calculus, statistics, and machine learning that it has made an integral contribution to my development within the field of data science. We also downloaded all the required software and automated the build process for starting new projects. In Part 2, we execute our first ScalaTion script — HelloWorldOfDS.scala
.
Runnin > Walking… Gimme the results!
If you have followed Part 1 of this tutorial, then your project folder should be set up and equipped with the necessary build.sbt
and .jar
files. Additionally, you should have a local clone of The Quarks’ GitHub repository, which contains the HelloWorldOfDS.scala
file we will be running and exploring. You can transfer this file into your own project folder built in Part 1 or you can simply run the script from the local clone itself.
Either way, to run the script, enter Terminal or another bash shell and cd
into the main project folder containing the build.sbt
file and data
, lib
, project
, src
, and target
folders. In the quarks
GitHub repository, this folder is the one titled HelloWorldOfDS
inside the scalation
folder.
Once inside this folder, typing sbt run
should open an sbt shell, compile, and execute the code within the project’s tree structure. If all goes smoothly, you will end up with an output something like this:
(1)
|====================|| CANCER DATASET || 10 FOLD CV || NAIVE BAYES ||====================|>meanCVaccuracy recall precision f-scoreNaiveBayes 0.719285714 0.825150405 0.771436896 0.792888583==========================================================>stdCVaccuracy recall precision f-scoreNaiveBayes 0.132583597 0.134881091 0.134661226 0.118966704==========================================================
… and if you scroll up a bit, a sequence of outputs resembling
(2)
backwardElim: cols_j = Set(0)backward model: remove x_j = 11 with b = VectorD(5.87791)fit = VectorD(3840.99, 3840.99, 0.784196, 0.885548, 0.670793, 3.78858e-15, 3.77476e-15, Infinity, -1188.69, -1182.19)
… and further up
(3)
|====================|| WineDrop DATASET || MULT LIN REG ||====================|
Coefficients:| Estimate | StdErr | t value | Pr(>|t|)x₀ | 140.098884 | 18.529223 | 7.5610 | 0.00000x₁ | 0.060412 | 0.020444 | 2.9550 | 0.00313x₂ | -1.904478 | 0.112399 | -16.9440 | 0.00000x₃ | 0.076795 | 0.007549 | 10.1731 | 0.00000x₄ | -0.276671 | 0.542720 | -0.5098 | 0.61020x₅ | 0.003777 | 0.000842 | 4.4848 | 0.00001x₆ | -0.000311 | 0.000377 | -0.8235 | 0.41020x₇ | -138.989068 | 18.906592 | -7.3514 | 0.00000x₈ | 0.660673 | 0.103723 | 6.3696 | 0.00000x₉ | 0.617767 | 0.100076 | 6.1730 | 0.00000x₁₀ | 0.009603 | 0.001095 | 8.7670 | 0.00000SSE: 2751.6229Residual stdErr: 0.7504 on 10 degrees of freedomR-Squared: 0.2836, Adjusted rSquared: 0.2822F-Statistic: 193.4762 on 10 and 4887 DFAIC: -2802.3636BIC: -2730.9012--------------------------------------------------------------------------------VIF:VectorD(0.340531, 0.000321888, 0.939356, 1.39361e-05, 0.250047, 0.0932211, 3.71570e-07, 0.00305491, 0.000444961, 0.0593856)
Great, I’m a Data Scientist! But… uh… What have I done?
In short, you read in two different datasets, one from a CSV file and one from an ARFF file, containing real-valued data and categorical data respectively.
You performed multiple linear regression (MLR) on the real-valued data in its raw form (WineQual
), after performing a simple transformation (Wine-tf
), and again after removing an attribute (WineDrop
) guided by forward selection, backward elimination, and VIF calculation (more on that in Part 3). Each regression fit returned the derived least squares coefficients and associated statistics as well as several quality of fit measures (See outputs (2) and (3) above).
You then applied the naive Bayes classifier to the categorical data, reporting the mean and standard deviation for accuracy, recall, precision, and F1 score from 10-fold cross validation (CV) (See output (1) above).
Note: The output above the naive Bayes CV results resembling repetitions of
vc = Array(6, 3, 11, 7, 3, 3, 2, 6, 2)distinct value count vc = Array(6, 3, 11, 7, 3, 3, 2, 6, 2)
is built-in output from the ScalaTion NaiveBayes
class. It shows the number of unique values for each attribute, and is returned at two points of execution during each CV fold.
Congratulations! You’ve officially run your first machine learning models using the ScalaTion library. If you’re a beginner like I was when I first used the library, just getting to this point can be daunting, so we’ll stop here for now. Stay tuned for Part 3, where we will take a deeper look into the code we’ve just run.