WEKA, get Machine Learning solutions Embedded in java applications:
If you are looking to take advantage of Artificial Intelligence and add some predicting functionalities to your java based application, you have two different options:
- The first one is to implement yourself the learning algorithms by starting from scratch, opt for this choice when minimizing dependencies is an important criteria for you. But take into account that this is a very expensive solution due to the big number of algorithms to implement and also to the difficulty to find the algorithm that fits the best with your business data.
- The second option is to use an open source machine learning library for java. There are many libraries that provide a Java implementation of Machine Learning algorithms. The use of one of these libraries save you a lot of time and boilerplate code. This helps the data scientist to focus more on the important parts of the Knowledge Discovery process (the data cleansing, the choice of the algorithm, the performances of the predicting model on the Dataset and so on) instead of investing time implementing algorithms. Checking for the wide used Java libraries, you will find WEKA at the top of the range. Waikato Environment for Knowledge Analysis is a Machine learning library for java. According to the official web site of the University of Waikato were WEKA was developed, it is downloaded millions of times and being used wherever in the world.
WEKA brings into the table a java implementation of most of Machine Learning algorithms, it also provides many data preprocessing filters and data visualization functionalities. You can interact with those algorithms through many ways:
- WEKA GUI
WEKA GUI is the graphical user interface tool of WEKA that enables interactions with different WEKA functions from one of its GUI tools (The Explorer, The Experimenter, The KnowledgeFlow and The CLI). Each one of these four tools is designed for a specific use case. By knowing what each one enables, you are going to utilize WEKA at its full power in order to extract features and patterns from your business data.
- WEKA Explorer :
With the Explorer, you can upload a dataset, visualize it, apply filters on it, train a model on it by choosing a machine learning algorithm compatible with the nature of your dataset, visualize the trained model, test the model and measure its performances. This means that the Explorer is the tool to use when you need to train one model based on a specific learning algorithm that you already know.
- WEKA EXPERIMENTER :
The Experimenter is used in order to compare algorithms’ performances on datasets. You can compare two or more algorithms’ performances on one given dataset, as you can compare one algorithm’s performances on two or more given datasets, or also you can compare more than one algorithm’s performances on many given datasets. So the best practice after preprocessing your dataset is to use the experimenter in order to find the algorithm fitting the best with your data, than use the Explorer to train and apply the predicting model based on the chosen algorithm.
- WEKA KnowledgeFlow :
The KnowledgeFlow GUI is the visual programming tool of WEKA, you can set up all the Data Mining Workflow to train and test a predicting model by drag and drop nodes into the workflow then configuring them. This provides a comprehensive visibility to the whole Discovery Knowledge process.
- WEKA CLI :
The above WEKA GUI tools are limited in terms of memory, because they use the Java Virtual Machine. In a BIG DATA context where the dataset holds more than 1 Million rows and 25 attributes, these GUI will stuck and pop up a run out of memory exception. The WEKA CLI command line tool overcomes this limitation and enables the work with BIG DATA. Indeed, WEKA provides the java interface called UpdateableClassifier which addresses the limitation of memory by uploading each time one instance into the memory and not the whole dataset. The only one limitation is that you mustn’t use the k-fold Cross Validation method to evaluate the model. Use only the split method into training and test set instead.
In spite of the fact that working with BIG DATA is enabled in WEKA CLI. It may take a while to process the data and train the model. To make computations faster, WEKA makes both distributed and real time stream processing possible using either Hadoop or Apache Spark analytics engines over cluster(s).
- WEKA API :
Weka API is the Java API of WEKA where you can integrate all WEKA functions inside your Java application. The use of WEKA API should come as the final step of your Knowledge Discovery development life cycle. Before coming to this step and integrating an Artificial intelligence functionality to your Java application, you certainly should have used the other WEKA GUI tools in order to quickly look for the algorithm that fits the best with your business data.
By Hamza EL RHAZI (Data Scientist Engineer)