An Introduction to Data Science via ScalaTion Part 3 — A Deeper Look
If you’ve made it this far, congratulations! You’ve officially set up your environment (Part 1) for running the powerful data modeling and analytics library ScalaTion, and have run your first program (Part 2), implementing two common and powerful machine learning models on two different datasets in the process. Now, let’s dive a bit deeper to better understand the ScalaTion library, the code we’ve run, and the output we received.
Stage 1. Data Prep
If we look at the main object of our HelloWorldOfDS.scala
script, i.e., the part enclosed within the brackets belonging to
object HelloWorldOfDS extends App
we first see two methods prepReg()
and prepBayes()
.
preReg()
uses the MatrixD
class in ScalaTion to read in a real-valued dataset winequality-white.csv
(provided by ScalaTion, located in the data/analytics
directory within the library’s main folder). The winequality-white
dataset contains nearly 5000 examples of white wine batches, each of which maps eleven physico-chemical properties of the wines to a rating assigned by an oenologist.
Likewise, the prepBayes()
method uses the Relation
class to read in the categorical breast-cancer.arff
dataset within ScalaTion’s data/analytics/classifier
directory. The dataset, which maps various patient data points to a binary diagnosis of breast cancer, is converted to the MatriI
data structure, similar to the MatrixD
class, except that the values are I
ntegers (i.e., categorical) instead of D
ecimals. The missing x
means the class allows various types of matrices (e.g., dense, bidiagonal, sparse, symmetric triangular, etc.).
Stage 2. Data Crunch
After being prepped, the data are processed using either multiple linear regression (MLR, for the real-valued data) or Naive Bayes (for the categorical data) via the run Reg()
and crossValidateAlgos()
methods, respectively.
Regression:
runReg()
uses the Regression
class in ScalaTion. Regression techniques choose from several factorization techniques to solve for parameter vector b that is fit to the data. b gives us the weights for each predictor as a way to model the overall sample space and thus map a given data point to a predicted outcome. We find b by minimizing the squared error between our predicted and actual target values and thus call b the least squares coefficients.
Given the equation for MLR, y=Xb+e, where e represents the error that the model cannot eliminate (e.g., due to noise in the data or bias of the model), MLR attempts to approximate the optimal values for b by performing factorization on the known components (y and X) using the Normal Equation: Xt (Xb) = Xt y, where Xt is the transpose of X. There are five such factorization techniques offered by ScalaTion, including QR factorization, Cholesky factorization, singular value decomposition, LU factorization, and Inverse/Gaussian elimination. The default technique is the Inverse/Gaussian elimination, which is the classical textbook technique.
As stated in Part 2, MLR is performed on the raw data (WineQual
), after performing a simple transformation (Wine-tf
), and again after some pruning. The transformation step transforms one of the attributes, that representing “Alcohol content”, by squaring its values. Since the attribute for Alcohol content is among the most correlative attributes w.r.t. the overall wine quality (¯\_(ツ)_/¯), squaring its values should provide even greater separation between classes since the magnitude of transformation on larger values will be exponentially larger than that on the smaller values. I.e., transormation is performed to help the data help us.
After a second MLR on the transformed data, a pruning step is implemented. Pruning, which consists of removing entire attributes of the data, is useful as a way to reduce the complexity and size of your dataset. However, care must be taken to ensure that only the less informative attributes are removed. Thus, we have used forward selection and backward elimination in the prunePreds()
method and VIF, optional output from the Regression
class, to guide our pruning decisions.
Forward selection iteratively rebuilds the dataset by selecting the most predictive attribute at each iteration, stopping once a certain number of attributes are selected or once a threshold of informativeness is reached. Similarly, backward elimination iteratively removes the least informative attribute from the dataset by finding the one which, when hidden from the model, has the least effect on the prediction accuracy. VIF stands for variance inflation factor and is a method for finding multicollinearity among the attributes. I.e., VIF finds attributes which contribute the same information for the overall prediction. All but one of these attributes can be removed with neglible effects on prediction accuracy.
When ScalaTion’s Regression
class is run, the coefficients are returned as well as the model’s performance after fitting. Ideally, we should see an increase in performance after each preprocessing step. Refer to the output snippets in Part 2, sections (2) and (3), for example outputs from running this section of the code.
Bayes
Next we run our second algorithm, Naive Bayes. Naive Bayes uses Bayes’ Theorem to predict a target class for a given input data vector x; i.e., it aims to predict “y given x,” or y|x, by finding the probability of each target class c given the data instance x, p(y|x), and choosing the class with the highest probability. Bayes’ Theorem approximates this unknown probability by using known probabilities — the prior probability and the conditional probability. The prior probability for each class c, p(y=c), is simply the proportion of training data points with class c as the target value. The conditional probability looks at each predictor variable j and computes the probability that a variable j = a value h given the target y = class c, i.e., p(xj = h | y = c). Put it all together, and Bayes’ Theorem states p(y|x)=p(x|y) p(y) / p(x). We typically drop the term p(x), which is 1 divided by the number of samples, as it is simply a normalizing factor. We choose the class c that maximizes this probability for the given input x.
Naive Bayes is called in our script within the crossValidateAlgos()
method. Cross validation is a technique that adds robustness to your results by repeating the predictions multiple times. The dataset is first divided into k equal subsets, and for k iterations, or folds, the kth subset is held out during training and used as a testing set.
During each fold of the cross validation, after the predictions are made on the kth subset, the checkClass()
method is called to first check the predicted class against the ground truth class for each prediction and accumulate the number of true positives (tp), false positives (fp), true negatives (tn), and false negatives (fn). Since the prediction class is binary, this process is quite straightforward. Next, the checkClass()
method calls getStats()
to use the tp, fp, tn, and fn values to calculate the precision, recall, accuracy and F-score. Each of these is a slightly different performance measure from the others, and looking at all of them together again adds robustness to your results and can also provide insights into the model’s performance.
Following cross validation, the mean and standard deviation of each performance measure are calculated across all folds and the results are reported (See the output snippet (1) in Part 2 for an example). We ran both 10-fold cross validation and 20-fold cross validation for a comparison.
Conclusion
We have now completed our first Data Science project using ScalaTion! After working our way through the initial download and build procedure, we implemented a script that loaded in two types of data and ran two different machine learning algorithms along with some data preprocessing and techniques to strengthen our results. With all that behind us, I hope you feel comfortable using this incredible resource for your own projects, and I wish you the best of luck in your Data Science endeavors!