Integrating H2O with Feedzai OpenML Engine
Up until November 2018, the Feedzai platform allowed to perform the full data science loop within the platform, using only its own ML model and data processing tools. While this was one of our advantages it was also one of our biggest pains. The platform didn’t support external models and consequently when adopting our platform the client usually had to start from scratch, throwing away tools that took years to develop.
Adapting to the evolution of the business, Feedzai continues to work on new solutions to make our products more user-friendly to data science teams. This allows them to decide whether they want to use our solutions or to use other technologies that they are familiar with.
One of the solutions developed at Feedzai was the OpenML engine. This solution allows to integrate ML models created outside of our platform. So, you don’t need to start from scratch when adopting our platform. Instead you can continue to work with the models that you created on other platforms, for example in H2O.
Why H2O?
H2O is an open-source machine learning and artificial intelligence platform popular among data scientists.
H2O offers a web user interface allowing users to easily analyse a dataset, create ML models and explore their performance, among other features. Moreover, H2O’s API allows to use the full potential of H2O from external programs.
H2O OpenML provider
The OpenML Engine was designed to support two types of providers.
- Loading providers — which load external ML models into the Feedzai platform and it can be used it to classify events;
- Training providers — which are similar to the previous except that it adds the capability of training new ML models inside the Feedzai platform.
The H2O provider is a training provider, which means it can be used both to import existing models, and to train new models directly from the Feedzai platform. In order to load existing models, the model must be stored in one of the provider supported formats:
- Plain Old Java Objects (POJO) — supported by almost all the H2O models. POJOs are java files/classes, that must be compiled by the provider on loading. It doesn’t support source files larger than 1 GB.
- Model Object, Optimized (MOJO) — is H2O’s alternative to POJOs, it is supported by the majority of the classifier models and it’s easier to use. It doesn’t have a size restriction. The MOJO models are smaller in disk and faster during scoring compared with POJO models.
When loading an existing model, in addition to the model, you must provide the schema of the dataset used to train the model as well as the target field. To do this, you can use one of the existing schemas on the Feedzai platform or you can hand out a JSON file that describes the schema of the training data.
If you prefer to train a new model directly from the platform, you can use the same process as you would do to train Feedzai native models. H2O supports three kinds of models: classifier, regressor and unsupervised. At the moment all classifier models other than Cox Proportional Hazards and Stacked Ensembles are supported by the H2O OpenML provider. A survey of product requirements on the H2O models was conducted before we started working on this feature. These two classifier models were not identified as a requirement because they do not seem to fit with the use cases of the Feedzai platform.
Wherever you are importing or training an H2O model, you can later use that model inside the Feedzai platform to score events in real-time.
Want to try it out? Here is where you can find the H2O OpenML provider.
- Github repository: https://github.com/feedzai/feedzai-openml-java/tree/master/openml-h2o
- Artifacts of the released versions: https://mvnrepository.com/artifact/com.feedzai/openml-h2o
Provider history
The first versions of this provider were developed using the H2O’s REST API. Although it offered all the capabilities of H2O without the concerns of managing the lifecycle of H2O’s process, it had the downside that in order to use it you would need to install and launch H2O on all the machines that interact with H2O models. In other words, you would need to install H2O on all the machines running the Feedzai platform, and if the environment contains a Spark cluster you also would need to install H2O on all the worker nodes. To address this problem, starting from the version 0.3.0, we no longer depend on the REST API, and instead we are using the Java API that hooks directly into an embedded instance of H2O. The embedded instance is initialized by the provider, this way you no longer need to worry about installing H2O.
In the version 0.5.0, the provider started to use other functionalities of H2O other than the import and load of models. The feature importance functionalities of the Feedzai platform were enriched through the gradient boosting machine algorithm. At the model it’s only possible to use it on an AutoML process.
Before the 1.0.0 version, it was only possible to load and train supervised models. Starting with that release, it’s possible to load and train models using the Isolation Forest algorithm that detects anomalies on transactions.
Challenges faced during development
While developing the H2O OpenML provider, we faced a couple of interesting challenges. In order to give examples of some of them, let’s look at the simplest way to initialize an H2O instance. You can find how the instance is initialized on the H2O provider in the Feedzai H2OApp object.
import water.H2OApp;public class InitH2O {
public InitH2O() {
H2OApp.main(new String[0]);
}
}
Lack of documentation
One of the biggest challenges was caused by the scarce documentation over the public API of H2O. It’s possible to find documentation and examples for the R and Python API but little regarding the Java API. The REST API is also documented but the REST API binding objects lack documentation. The binding objects allow to interact with the REST API from a Java program.
Fortunately, H2O was developed in Java and the source code is publicly available. The best way to understand how to use the Java API it’s to dive into the source code which was also challenging due to the lack of comments. Fortunately, their test suite allowed us to better understand the code and it put us on the right track to find the objects that we needed to take into account.
Singleton H2O instance
Once an H2O instance is created it’s not possible to start another one from the same OS process. This problem can be solved and can be avoid recurring to the singleton pattern, which only allows a single instance.
Besides that, when H2O is initialized with the default values, it will initialize several services, such as Web UI, that are not being used by the provider. This will cause errors if we try to initialize multiple H2O instances in the same machine, even if they are initialized from different OS processes. The first instance would bind the ports of the machine causing the following instances to fail. A solution for this problem is to initialize only the services used by the provider.
import water.ExtensionManager;
import water.H2O;public class InitH2O {
public InitH2O() {
H2O.main(new String[0]);
ExtensionManager.getInstance().registerRestApiExtensions();
}
}
Unable to close an H2O instance
Once an H2O instance is initialized, it’s not possible to shut it down, at least without stopping the process that initialized it.
The only way to shut down H2O is to call “H2O.shutdown(0)” which calls “System.exit(int)” causing the process responsible for the creation of the H2O instance to shut down.
In the Feedzai platform that would mean that the JVM of the application would be stopped.
Default values cause unexpected behaviors
Initializing the instance with the default values might cause unwanted behaviors, especially if you are launching several instances using the same user. When an H2O instance is created it will search and try to connect to a cluster of H2O instances that have the same cluster name, which by default is the name of the user that started the process.
If you want to isolate your H2O instances you should give them unique names and set their addresses to localhost. This way they won’t search for other instances outside of the local machine and they will search locally for a unique name only used by that instance.
Static variables
An H2O instance is initialized using static variables and the initialization doesn’t create any local variable that can be accessed in the Java program. The interaction with the instance follows the same pattern. For example, the following line is used to define the log level of the H2O instance.
water.util.Log.setLogLevel("WARN", true);
At first sight, it’s not clear that this method interacts with an H2O instance because none of the parameters refers to the instance. If you look at the source code of the Log object you will see that once more it’s using static variables making it difficult to understand the side effects of changing those variables.
Conclusion
After addressing the problems pointed above, the startup of an H2O instance was evolved to the following code.
import water.ExtensionManager;
import water.H2O;
import java.util.UUID;public class InitH2O {
private final static Object instanceLock = new Object();
private static volatile InitH2O instance; public static InitH2O getInstance() {
if (instance == null) {
synchronized (instanceLock) {
if (instance == null) {
instance = new InitH2O();
}
}
}
} private InitH2O() {
H2O.main(new String[] {"-name", UUID.randomUUID(), "-ip", "localhost"});
ExtensionManager.getInstance().registerRestApiExtensions();
water.util.Log.setLogLevel("WARN", true);
}
}
Summing up, in this section we present a simple example of some of the challenges that we faced during the development of the provider. As we delve deeper into the code, we discovered other more challenging problems, such as the training of H2O models, particularly the XGBoost algorithm that was especially quite challenging. Truth be told if we had access to better documentation it would make our lives easier but nothing that we could not overcome by exploring the source code.
Feel free to participate on the community and to explore the source code of the project. If you are interested in joining us to fight fraud globally check the open job positions to see what roles are available.