Basic Machine Learning in Ruby

Ryan Flach
5 min readJul 19, 2016

--

Although increasingly popular in both application and as somewhat of a buzz word, machine learning is often unnecessarily difficult to approach as a beginner. In this post we’ll look at how Ruby, one of the most ‘readable’ languages, can be used to tackle some simple Bayesian classifications, a type of machine learning.

First, what is machine learning? If you have an interest in the topic, there’s a three-part post on Medium that is beginner-friendly and highly informative. It summarizes the topic well:

Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data.

The most popular and commonly-used generic algorithms are typically focused around classifying data, which, with the help of a Ruby gem, is what we’ll be doing.

Getting Started

The classifier-reborn gem has support for both Bayesian classification and Latent Semantic Indexing (LSI). It can be added to your project’s Gemfile with:

gem ‘classifier-reborn’

You will also want to add the gem for its only dependency, fast-stemmer.

gem ‘fast-stemmer’

Bayes and LSI

Although LSI is more flexible and arguably more powerful for simulating human learning, it’s beyond the scope of this discussion. Instead, we’ll take a look at Bayesian classification. Bayesian classification has been studied since the 1950s, and is commonly used for text categorization (e.g., identifying spam e-mail messages). It works by comparing novel data to learned categories and assigning a probability score (a float) of its relationship to that category. The category with the highest score is returned as the probable match.

Basic Usage

Usage is relatively simple. After requiring the gem in the ruby file, the class is instantiated with any desired categories (defined by the user), trained, and called upon for a result. Let’s look at some code from the gem’s documentation for an example (comments added):

require 'classifier-reborn'# Initialize and supply two categories: 'Interesting' and 'Uninteresting'
classifier = ClassifierReborn::Bayes.new('Interesting', 'Uninteresting')
# Supply text to train the classifier for 'Interesting'
classifier.train_interesting("here are some good words. I hope you love them")
# Supply text to train the classifier for 'Uninteresting'
classifier.train_uninteresting("here are some bad words, I hate you")
# Supply novel text and ask for classification
classifier.classify("I hate bad words and you")
=> 'Uninteresting'

In the above code, the method call to `classify` will return ‘Uninteresting’ as a string. It is also possible to return an array of the classification and its score by using the method `classify_with_scores`:

classifier.classify_with_score("I hate bad words and you")
=> ["Uninteresting", -4.852030263919617]

You can find more information on the mathematics of how the result is determined here.

It’s important to note here that our results will be limited by the small scope of training data we supplied. This can easily be mitigated by supplying data either from a file or via DATA/__END__ (see this blog post for more information). Here’s another example from the classifier-reborn documentation (comments added):

require 'classifier-reborn'# Create an array of training_set data
training_set = DATA.read.split("\n")
# Remove the first item from training_set and create an array where each item is stripped of whitespace
categories = training_set.shift.split(',').map{|c| c.strip}
# Initialize and supply our categories as an array
classifier = ClassifierReborn::Bayes.new(categories)
# Iterate through each remaining item in training_set
training_set.each do |a_line|
next if a_line.empty? || '#' == a_line.strip[0]
parts = a_line.strip.split(':')
# Train based on 'category':'text'
classifier.train(parts.first, parts.last)
end
# DATA evaluates to lines below __END__
__END__
Interesting, Uninteresting
interesting: here are some good words. I hope you love them
interesting: all you need is love
interesting: the love boat, soon we will be taking another ride
interesting: ruby don't take your love to town

uninteresting: here are some bad words, I hate you
uninteresting: bad bad leroy brown badest man in the darn town
uninteresting: the good the bad and the ugly
uninteresting: java, javascript, css front-end html

With this small amount of additional information trained, we can make more interesting calls to `classify`:

classifier.classify “Ruby is a wild ride”
=> 'Interesting'

At this point we’ve touched on basic functionality of the classifier-reborn gem and utilized it to perform some very basic machine learning classifications via Beyesian classification. If you feel like you’ve sated your curiosity at this point, please feel free to stop reading now. Otherwise, we’ll move on to some of the additional functionality offered in classifier-reborn.

Additional Functionality

The above examples were static in the sense that our training data is limited to the current file. We can take training data with us by utilizing Ruby’s built-in Marshal module. It’s not important in this context to know what Marshal is doing, so instead just pay attention to the syntax in the example below. At a high level, it converts Ruby objects into bytes.

# Continued from previous example with DATA/__END__
...
# Pass the classifier to the dump method in Marshal to store bytes
classifier_data = Marshal.dump(classifier)
# Save a file called 'classifier.dat' in your current path that stores classifier_data
File.open("classifier.dat", "w") {|f| f.write(classifier_data) }

This file can now be used in other files by reading from the file and again using Marshal.

# Read the .dat file and save to a local variable
data = File.read("classifier.dat")
# Create a new classifier that contains our previously learned categories
new_classifier = Marshal.load(data)
# Classify to your heart's content
new_classifier.classify("I love boats")
=> 'Interesting'

We can also set thresholds for all categories, effectively controlling the limits for when a category is considered the best choice. By default, a threshold is not set. If thresholding is enabled, the default is 0.0. Again, some examples from the classifier-reborn docs :

b = ClassifierReborn::Bayes.new(
'good', # one or more categories
enable_threshold: true, # default: false
threshold: -10.0 # default: 0.0
)

These settings can also be changed after being initialized. This includes enabling or disabling thresholding to an existing instance.

b.threshold            # get the current threshold
b.threshold = -10.0 # set the threshold
b.threshold_enabled? # Boolean: is the threshold enabled?
b.threshold_disabled? # Boolean: is the threshold disabled?
b.enable_threshold # enables threshold processing
b.disable_threshold # disables threshold processing

Conclusion

Machine learning is a broad topic, of which we’ve barely scratched the surface. However, I hope this has been at the least an introduction into its possibilities within Ruby. If you’re looking for more, I highly recommend the Machine Learning is Fun posts mentioned earlier. For more information on classifier-reborn, including information on how to use LSI and caveats on thresholding, please view their docs.

--

--

Ryan Flach

Software Developer and Returned Peace Corps Volunteer (Philippines 2010–2012) based in Denver, CO.