An Introduction to Machine Learning with Java

Published in

The Software Guild Blog

13 min readMar 29, 2019

This article is the fourth in our new series of blog posts for 2019; perspectives from our expert instructors. Keep your eye on this space to get the latest in what’s happening in code education, straight from our instructors. This month, Dave Smelser, Software Guild Instructor, introduces you to Machine Learning with Java.

Machine learning! It’s the hottest trend in software development. According to a recent indeed.com (the highest-traffic job search website on the internet) survey, they concluded that Machine Learning Engineer was the “best job” of 2019 in terms of the growth rate of job postings in the prior three years (344%) and average base salary ($146,085): http://blog.indeed.com/2019/03/14/best-jobs-2019/

Machine learning is an expansive topic, but the current state-of-the-art techniques typically rely on Artificial Neural Networks. In this blog post, we’ll explore the foundations of neural networks and develop a working example of a simple feed-forward neural network using Java and the Neroph neural network framework to quickly get our code up and running. This post assumes familiarity with object-oriented programming and the Java programming language, but we’ll walk through all the details of machine learning.

Environment Setup Guide

To begin with, we’ll be using the Java 8 JDK which can be downloaded here: https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

Next, we’ll be installing the Netbeans IDE, which can be downloaded here:

https://netbeans.apache.org/download/index.html

Training Data Setup

For our example, we’re going to try to predict the next morning’s opening stock price, given the open, low, high, close, and trading volume of a given stock over the prior five days. FAIR WARNING: this will not really work. If stock market prediction were as easy as this, there would be quite a lot more millionaires in the world. Stock market prediction is an application of machine learning that attracts millions (if not billions) of dollars every year. The techniques we’ll be applying are not state-of-the-art and will lack the large datasets such techniques rely upon. However, this guide will give us insight into foundations of those techniques.

In this guide, we’ll be developing a feed-forward neural network and training it using backpropagation. Neural networks are collections of nodes (which can hold values) and connections between them which have “weights” (simply numeric values) that are applied (my multiplication) to the node values. Each node (other than the input nodes) sums the weighted inputs and sends that value to a non-linear “activation function” (https://en.wikipedia.org/wiki/Activation_function). Typically nodes are arranged into layers where every node in a given layer is connected to every node in the prior layer going back from the output layer all the way to the input layer.

Fortunately for us, most of the technical details for doing these things will be handled by the Neuroph library we’ll be using, so our biggest hurdle will be getting data fed into our network correctly. To begin with, we’ll first download a dataset to start our work with. We can pull historical stock data from nasdaq.com in a CSV (comma separated values) format which we’ll be able to manipulate for our purposes. First of all we’ll go to https://www.nasdaq.com/symbol/ibm/historical to pull our daily stock data.

We’ll want to have lots of data to work with for our training, so we’ll adjust the timeframe from the default of 3 Months to the maximum of 10 Years:

Once the data loads, we’ll scroll (very far) to the bottom of the page and select Download this file in Excel Format:

This will produce a file called Historical Quotes.csv in our Downloads folder. While the link mentions Excel specifically, the file is actually just a plain text file that can be viewed in any plain text editor.

Just to give us an idea of the kind of file we’ll be working with, let’s take a look in a plain-text editor:

As you can see, the data is comprised of six columns: date, close, volume, open, high, and low. The date column has a format of “yyyy/MM/dd” (more on that later and the rest are floating point numbers.

NOTE: For the purposes of this exercise, I’ve moved my data file to my desktop, however you should be able to programmatically locate the file wherever it lives.

To begin with, let’s create a Maven Java Application project in NetBeans:

Click Next > and then we’ll set the Project Name (I used NeuralStocks) and the Project Location (I used my desktop again, but feel free to locate it wherever makes sense for you):

Once you see finish, you should see the project open up with the following structure:

Next, we’re going to be adding our dependencies and building the project in order to download the relevant jar files. Expand Program Files and double-click your pom.xml file. Within the file, we’ll need to add the neuroph repository so that we can pull down our requirements. Inside of the project node add the following:

<repositories>       
<repository>            
<id>neuroph.sourceforge.net</id>            <url>http://neuroph.sourceforge.net/maven2/</url>        </repository>    
</repositories>

Now, we’ll add the actual dependencies we’ll be using for this project:

<dependencies>        
<dependency>            
<groupId>org.neuroph</groupId>            
<artifactId>neuroph-core</artifactId>            <version>2.8</version>        
</dependency>        
<dependency>           
<groupId>org.apache.commons</groupId>            <artifactId>commons-csv</artifactId>            <version>1.6</version>        
</dependency>    
</dependencies>

The neuroph-core dependency will provide the neural network classes we’ll be using. The commons-csv class we’ll be using will provide a convenient way to read our data file.

First things first, let’s add a App class by right clicking our main package under Source Packages (for me, it’s com.sg.neuralstocks) and add our main method to it. When we’re done the file should look something like this:

/* 
* To change this license header, choose License Headers in Project Properties. 
* To change this template file, choose Tools | Templates 
* and open the template in the editor. 
*/
package com.sg.neuralstocks; /** 
* 
* @author dsmelser 
*/ 
public class App { public static void main( String[] args ){ 
} 
}

At this point, we should build our application (using the hammer button) and see (hopefully) that netbeans downloads the relevant jar files.

Now, we’ll add a new class file to our package that will represent a single day’s worth of data. This will simplify the process of building our training rows (as we’ll see shortly). Just like we added the App class, we’ll be adding a NasdaqRecord class with date, close, volume, open, high, and low fields as well as getter setter methods for each of them:

/* 
* To change this license header, choose License Headers in Project Properties. 
* To change this template file, choose Tools | Templates 
* and open the template in the editor. 
*/ 
package com.sg.neurophdemo; import java.time.LocalDate; /** 
* 
* @author dsmelser 
*/ 
public class NasdaqRecord {
    private LocalDate date;
    private double close; 
private long volume;  
private double open; 
private double high; 
private double low; 
/** 
* @return the date 
*/ 
public LocalDate getDate() { 
return date; 
} /** 
* @param date the date to set 
*/ 
public void setDate(LocalDate date) { 
this.date = date; 
} 
/** 
* @return the close 
*/ 
public double getClose() { 
return close; 
} /** 
* @param close the close to set 
*/ 
public void setClose(double close) { 
this.close = close; 
} 
/** 
* @return the volume 
*/ public long getVolume() { 
return volume;
 } 
/** 
* @param volume the volume to set 
*/ 
public void setVolume(long volume) { 
this.volume = volume; 
} /** 
* @return the open 
*/ 
public double getOpen() { 
return open; 
} 
/** 
* @param open the open to set 
*/ 
public void setOpen(double open) { 
this.open = open; 
} /** 
* @return the high 
*/ 
public double getHigh() { 
return high; 
} 
/** 
* @param high the high to set 
*/ 
public void setHigh(double high) { 
this.high = high; 
} 
/** 
* @return the low 
*/ 
public double getLow() { 
return low; 
} /** 
* @param low the low to set 
*/ 
public void setLow(double low) { 
this.low = low; 
} }

With that class created we are ready to begin writing a method in our App class to read our CSV file and build a list of NasdaqRecord objects. We’ll start by defining a field variable at the top of our App class that will hold our file path. Normally we wouldn’t want to hard-code this value, but the process of getting a filepath from the user is beyond the scope of this walkthrough. For now, we’ll add a field like so:

final static String csvPath = "C:\\Users\\dsmelser\\Desktop\\HistoricalQuotes.csv";

If we decide to move or rename our files, we’ll be able to update this variable and our code should still work the same. Again, this path points at MY desktop because I moved my file there. You’ll need to adjust this path accordingly for your own system.

With the file path defined, we’re ready to write methods in our App class that will extract our data:

private static List<NasdaqRecord> readRecords() throws IOException { List<NasdaqRecord> allRecords = new ArrayList(); File csvData = new File(csvPath); CSVParser parser = CSVParser.parse(csvData, Charset.defaultCharset(), CSVFormat.RFC4180); for (CSVRecord record : parser) { 
NasdaqRecord toAdd = convertRecord(record); 
if (toAdd != null) { 
allRecords.add(toAdd); 
} 
} return allRecords; 
} private static NasdaqRecord convertRecord(CSVRecord record) { NasdaqRecord toReturn = new NasdaqRecord(); 
try { toReturn.setDate( 
LocalDate.parse(record.get(0), DateTimeFormatter.ofPattern("yyyy/MM/dd"))); toReturn.setClose(Double.parseDouble(record.get(1))); toReturn.setVolume((long) Double.parseDouble(record.get(2))); toReturn.setOpen(Double.parseDouble(record.get(3))); toReturn.setHigh(Double.parseDouble(record.get(4))); toReturn.setLow(Double.parseDouble(record.get(5))); } catch (DateTimeParseException ex) {
 //row failed, just return a null
 toReturn = null; 
} return toReturn; 
}

With this code, we’ll be able to parse through our data, convert every (valid) row into a NasdaqRecord object store it in list to be used later for building our training set. As you can see, each record’s date is converted in the “yyyy/MM/dd” format, the volume is originally read as a double, but then converted to a long, and the rest are left as doubles to be read. This isn’t terribly relevant as we’ll end up using double for all the training data, but if we ever wanted to use the NasdaqRecord for anything else (reporting etc) it could be more useful to have the numeric data in its “true” form.

With a list of NasdaqRecord objects, we should be able to build our training and validation datasets. As part of creating our DataSets, we’ll need to “normalize” our data so as to not overwhelm our network. If we have data that’s very large, our network weights will have to become very small in order to compensate and our training algorithm will have trouble adjusting the weights accordingly. To handle this, we’ll define a normalization constant that will apply to all of our data. In the App class, underneath where we defined the path to our CSV file, we’ll define a double constant:

final static double NORMALIZATION_FACTOR = 10000.0;

In machine learning, we’re often concerned about “overfitting” (though that shouldn’t be an issue for us in this case). To check for this, we set aside some of our data and do NOT show it to the network during training. We can then verify that our machine learning model has not simply memorized the correct answers for our training data by then showing it the data we set aside and seeing if it responds appropriately to new inputs.

Now, in order to create our two DataSet objects we’ll be defining three parameters to our generation method: the list of NasdaqRecord objects, an inclusive minimum index to include and an exclusive maximum index:

private static DataSet generateSet(List<NasdaqRecord> allRecords, int incStart, int excEnd) { 
//dataset has 25 inputs (close, volume, open, high low) x 5 days 
//  and 1 output (tomorrow's open) 
DataSet toReturn = new DataSet(25, 1);
 //now that we've loaded all the data, we need to sort by date allRecords.sort((a, b) -> a.getDate().compareTo(b.getDate())); //now we need to build training rows 
for (int i = incStart; i < excEnd; i++) { 
//tomorrow's open is the desired output 
double[] output = new double[]{allRecords.get(i + 1).getOpen() / NORMALIZATION_FACTOR}; 
double[] input = new double[25]; 
//we're gonna gather the last 5 days of data for 
(int lookbehind = 0; lookbehind < 5; lookbehind++) { 
//each day's worth of data takes up 5 columns 
//all data is normalized 
input[0 + lookbehind * 5] = allRecords.get(i - lookbehind).getClose() / NORMALIZATION_FACTOR; //we do the natural log of the volume to compress the range into something 
//more reasonable and to make it so that only large shifts seriously impact 
//our analysis 
input[1 + lookbehind * 5] = Math.log(allRecords.get(i - lookbehind).getClose()) / NORMALIZATION_FACTOR; 
input[2 + lookbehind * 5] = allRecords.get(i - lookbehind).getOpen() / NORMALIZATION_FACTOR; 
input[3 + lookbehind * 5] = allRecords.get(i - lookbehind).getHigh() / NORMALIZATION_FACTOR; 
input[4 + lookbehind * 5] = allRecords.get(i - lookbehind).getLow() / NORMALIZATION_FACTOR; 
} 
toReturn.addRow(input, output); 
} 
return toReturn;
 }

This code starts by sorting the list by date (this will come into play later) and builds two arrays for every trading day in our range, an input and an output. Our input arrays hold 25 elements (the close, the log-volume, the open, the high, and the low for five days) and the output array holds a single element (the next day’s open).

With this in place, we’ll make a couple of utility methods to calculate the boundaries of the training and validation data sets. To do this, we’ll first have to define one last constant field which will decide the proportion of data to use for training. For mine, I set it to .99 (99%) but you can try other proportions as well. Underneath our NORMALIZATION_FACTOR constant, we’ll define our double:

final static double TRAINING_PERCENT = 0.99;

With this constant in place, we can compute the start and end points of each DataSet in our list for our two helper methods:

private static DataSet generateTrainingSet(List<NasdaqRecord> allRecords) throws IOException { 
int incStart = 4; int excEnd = (int) (TRAINING_PERCENT * allRecords.size() - 5); return generateSet(allRecords, incStart, excEnd);
 } 
private static DataSet generateValidationSet(List<NasdaqRecord> allRecords) throws IOException { int incStart = (int) (TRAINING_PERCENT * allRecords.size()); 
int excEnd = allRecords.size() - 1; 
return generateSet(allRecords, incStart, excEnd); 
}

These two methods together will let us quickly and easily create our data sets. The remainder of our steps (creating the neural network, training it, and then testing the validation set) will be fairly trivial because of the hard work done by the Neuroph team. Let’s finish our application by updating main and writing a small helper method to test our validation data:

public static void main(String[] args) {
 try { 
List<NasdaqRecord> allRecords = readRecords(); 
//grab the first 90% of the data to train with 
DataSet trainingSet = generateTrainingSet(allRecords); 
//grab the last 10% of the data to validate with DataSet validationSet = generateValidationSet(allRecords); //open, high, low, close, volume x 5 days = 25 inputs 
//3 hidden layers with 1000, 100, and 10 nodes per layer 
//1 output layer with 1 node 
MultiLayerPerceptron neuralNet = new MultiLayerPerceptron(25, 1000, 100, 10, 1); MomentumBackpropagation learningRule = (MomentumBackpropagation) neuralNet.getLearningRule(); 
learningRule.setLearningRate(0.1); learningRule.setMaxIterations(100000); System.out.println("BEFORE LEARNING"); printValidationCheck(neuralNet, validationSet); System.out.println("=========================="); 
//perform backpropagation neuralNet.learn(trainingSet);
 
System.out.println("AFTER LEARNING"); System.out.println("Error: " + 
neuralNet.getLearningRule().getTotalNetworkError()); printValidationCheck(neuralNet, validationSet); } catch (IOException ex) { Logger.getLogger(App.class.getName()).log(Level.SEVERE, "Could not read file: " + csvPath, ex); 
} }

To review: in main() we create a MultiLayerPerceptron (https://en.wikipedia.org/wiki/Multilayer_perceptron) with 25 nodes in the input layer (again, five columns of data per day multiplied by five days of data), 1000 nodes in the first hidden layer, 100 nodes in the second hidden layer, 10 nodes in the third hidden layer, and finally a single output node.

Next, we pull out and alter the parameters of the default LearningRule that a MultiLayerPerceptron is built with in the Neuroph framework. By default, this class uses backpropagation with momentum. Backpropagation (https://en.wikipedia.org/wiki/Backpropagation) is a technique whereby the marginal change in error (the error derivative) with respect to each weight is calculated, allowing us to alter the weight in the opposite direction (slightly) in order to reduce the error. These derivatives only tell us the direction toward less error but not the distance to move in order to move in order to minimize error. The way this is typically handled is by setting a learning rate (in our case, a rate of 0.1). The learning rate is multiplied by the error derivative and the sum of all such changes is added across all learning samples.

With our learning rule defined, we first print the de-normalized output of the neural network for our validation dataset. At this point, we should run our project and take a look at the output window. As expected, the initial random weights produced by the network produce values that are an order of magnitude off from the desired values:

However, after training we see much more reasonable numbers:

As you can see, the output is largely (but not wholly) independent of the inputs. The reason this occurs is that our model is simply not sufficiently complex to really approach this problem. The best the training algorithm can do is to reduce the impact of the inputs and move towards simply outputting values near the mean open. If you want to try to make this a little more responsive to the inputs, try adding more hidden layers or more nodes to the existing hidden layers (or both) by adjusting this line:

MultiLayerPerceptron neuralNet = new MultiLayerPerceptron(25, 1000, 100, 10, 1);

NOTE: increasing the number of layers or the size will increase the training time for the model.

Tweaking these sorts of “hyper-parameters” (learning rates, network structure, etc) is part of the art of machine learning. Fortunately, with the extensive frameworks that exist in Java, we can spend our time dealing with high level concerns rather than having to tinker at a low level.

If you’re interested in learning to program in Java or C# or in improving your existing skills, take a look at our courses and get started toward your career today:

Learn Java: https://www.thesoftwareguild.com/coding-bootcamps/java-training/

Learn C# .Net: https://www.thesoftwareguild.com/coding-bootcamps/asp-net-c-sharp-training/

Originally published at www.thesoftwareguild.com on March 29, 2019.

An Introduction to Machine Learning with Java

Environment Setup Guide

Training Data Setup

Written by The Software Guild