An Introduction to Data Science via ScalaTion Part 1 — Setting Up

Christian McDaniel
The Quarks
Published in
5 min readJun 9, 2018
Source: Data-Science-Blog-Image.png

As I iterate through the high-dimensional hilly ascent toward becoming a data scientist — clambering over statistical overhangs, deep computer science crevices, and technical stretches of mathy rock walls — I have realized both the importance and the difficulty of keeping one foot in theory and the other in implementation. All too often I have mastered a concept but remained clueless about applying it to a messy real-world problem. Conversely, my busy schedule as a Master’s student and the availability of robust single-line function calls make it easy to neglect Step Two in the “implement first, learn-what-you-just-did second” strategy.

Enter ScalaTion, a data modeling and analytics library written straight out of your linear algebra and data science text books. This SCALAble simulaTION and analytics library is written in Scala, an object oriented, functional programming language and a member of the Java family. Thus, ScalaTion combines the interpretability and conciseness of a high-level programming language with the by-the-books technicality of a “low-level” translation of concepts in linear algebra, statistics, machine learning, and more. Using this library for projects in my Data Science II course was instrumental in bridging the gap between the statistical fundamentals and model implementation involved in learning data science.

Once you are familiar with setting up and running projects using ScalaTion, you begin to understand the robustness and interpretability of its classes and functions. However, ScalaTion can be a bit tricky when you’re getting started and this blog post is intended to guide you through just that. We will first set up our ScalaTion environment, and then in Part Two we’ll execute our first data science model using ScalaTion, and finally in Part Three we’ll dig into the code we will have just run. Keep in mind that this tutorial, although written for complete beginners, was created using a 2017 MacBook Pro and some of the steps may not generalize seemlessly to other makes and models.

The ScalaTion Environment

I’ve constructed a bash script on The Quarks’ GitHub repository to download all the necessary software, set up your first project directory, and transfer in the necessary files. Clone the repository and on your local machine, enter the folder scalation/ and execute the .sh file via

sh <path/to/scalation_setup.sh>

where <path/to/scalation_setup.sh> is the path to the saved bash script. This site is helpful for creating a GitHub account and getting set up to use the git command. To clone our repository, follow the link and hit the green Clone or Download button to copy the link, then in Terminal on your Mac, cd into a folder in which you’d like to install the repo and enter git clone https://github.com/quinngroup/quarks.

Now let’s discuss what this script does so we can be informed data scientists!

Downloading your software

  • BEFORE running the bash script, ensure your computer’s Java is up to date by typing java -version in your terminal. Java version 1.8 or higher is sufficient; otherwise, you can download/update Java here.
#!/bin/bash

This tells the computer you are using a bash shell.

cd /Applications

Change the current working directory to /Applications. This is where the software will be stored.

sudo port install sbt

Download the Scala programming language by downloading their simple build tool (sbt). You’ll need to have admin priveleges on your device and will need to enter your password for the script to continue (pro tip: when typing a password at the command line, no indication of typing is given as a security measure; just type in your password and hit Enter).

wget http://cobweb.cs.uga.edu/~jam/scalation_1.4.tar.gz
tar xvfz scalation_1.4.tar.gz

ScalaTion is under ongoing development in Dr. John Miller’s lab in the computer science department at the University of Georgia. The wget command downloads ScalaTion from its homepage hosted by the UGA CS server. ScalaTion 1.4 is currently the most stable version, so this is what we will download; however, newer versions are currently being tested and perfected!

Troubleshooting tips:

  • The wget command needs to be installed on Mac devices. This can be done using Homebrew via brew insall wget or from source using this helpful guide. The subsequent tar command uncompresses the library into a usable format.
  • If the download has issues, make sure your version of Scala is compatible (Scala 2.12 for ScalaTion 1.4). The download links on the homepage tell what version of Scala is needed for each version of ScalaTion.
  • If you are having issues with the .tar compressed file, a .zip version is also available. Download the file with the .zip extension and run unzip scalation_1.4.zip to unzip.
mkdir ~/Documents/HelloWorldOfDS
cd ~/Documents/HelloWorldOfDS

The project directory is created in the default~/Documents folder; feel free to change the location of this directory.

Automate the build structure

# source for automated build: Scala Cookbook Ch. 18.1
# generate the file structure
mkdir -p src/{main,test}/{java,resources,scala}
mkdir lib project target
# create an initial build.sbt file
echo 'name := "HelloWorldOfDS"
version := "1.0"
scalaVersion := "2.12.4"' > build.sbt
  • Sbt compiles and runs programs using a structured filepath system. Manually setting this up from the command line can seem tedious, but we have just automated the process. I suggest creating a separate bash script that performs this automated build for later projects:
#source: Scala Cookbook Ch. 18.1
#!/bin/sh
mkdir -p src/{main,test}/{java,resources,scala}
mkdir lib project target
# create an initial build.sbt file
echo 'name := "MyProject"
version := "1.0"
scalaVersion := "2.12.4"' > build.sbt
  • Title the script something like mkdirs4sbt and save with a .sh extension.
  • Now when you create your project’s main folder, you will simply enter that folder (cd) and run sh <filepath/to/mkdirs4sbt> and the file structure will be built automatically.
  • More information can be found in the Scala Cookbook Chapter 18.1.
  • Side note for beginners (like me): As someone with very little computer science background before entering the field of data science one year ago, this build process seemed quite cumbersome given the ease of writing and running Python scripts. However, I have since used similar build processes with Java (Maven) and C++ (Make), and using ScalaTion from the command line helped familiarize me with these common procedures in lieu of formal training. So don’t give in to apprehension; get your hands dirty!
scp -r /Applications/scalation_1.4/scalation_models/lib/ ./lib/

We need to copy the .jar files in the lib folder of the scalation_modeling package into the lib folder of our project. The .jar extension stands for Java ARchive, and these files typically hold a number of packages and class files together for easy distribution. This step is performed using the secure copy command scp.

Learning from ScalaTion

As you begin to explore ScalaTion, note that Dr. Miller has included on the library homepage a number of textbooks/manuals to accompany the library. The most useful one for me so far has been Introduction to Data Science Using ScalaTion, which explains the linear algebra, statistics, calculus, and machine learning concepts that inspire many of the library’s classes and methods. Additionally, many of the classes include thorough documentation and example Test objects which demonstrate the use of the class and its methods. The data folder within the main scalation_1.4 folder contains toy datasets for this purpose and for your own usage.

Conclusion

You have now successfully set up an environment for building powerful machine learning models using the ScalaTion library. In Part 2 of this tutorial we run our first standalone program using ScalaTion, which will read in some data and run a simple model. Then, in Part 3 we’ll explore the code, and you’ll be ready to start your own machine learning projects :)

--

--