Data Mining Hub for Scientists

Published in

Data Minin Hub

6 min readSep 18, 2014

Hi everybody!

We have launched the Data Mining Hub and would like to tell you what it is and how useful it can be!

Data Mining Hub (DMH) is a platform for development of data mining and machine learning algorithms, which is based on an iterative approach, as well as a business tool that helps to analyze large amounts of data and extract from the data useful and important information.

Differences of DMH from similar platforms, such as kaggle and algomost:

a task is divided into iterations
an author owns algorithm code, a customer only rents it
calculations, evaluation and money manipulations are managed by DMH
a scientists does not need to verify his qualification

There are two roles in DMH: customer (business) side and scientist side. A customer describes a task (problem) and a scientist tries to solve this task.

DMH allows scientists to take part in solving interesting problems, compete with other scientists and of course get paid if their algorithm was chosen by a customer. If the algorithm was not selected in the iteration, it can always be selected in the next one. The algorithm results will be automatically migrated by DMH from the last iteration to new one if original data is not changed. Also there is an opportunity to improve the algorithm and get paid in the next iteration.

For a customer DMH is a single integration point with a large number of scientists and an easy way to use different algorithms for the same data.

In short, DMH operating principle can be described as follows:

A customer creates a task, provides description, defines acceptable budget, duration and decision making period for each iteration
The customer loads the data, which then will be used by scientists
The customer confirms the task and after that the data is available to scientists
Using the data scientists create their algorithms, upload them to the DMH and set the cost of algorithm usage
The customer chooses an algorithm he liked and then transfers the payment to the scientist

Everybody can go to www.datamininghub.com/invite/me and ask DMH to invite them entering an e-mail.

Let’s see, what a scientist should do to participate in task solving. Everything is quite simple: he just needs to select a task, create an algorithm and try it on input data. If a satisfactory result is received then the algorithm usage cost can be set.

More detailed participation steps description.

After authentication on datamininghub.com the start page opens where all tasks are listed, which require solving. Choose a task and download the input data from the Data Set section.

Next, develop the algorithm using any tool. The main requirement is that the algorithm must be a jar file (or several files), which could be run as a job on hadoop.

Simple algorithm example is available on Scala at: github.com/datamininghub/example-algorithm

Realistic example of task solution is available at: github.com/datamininghub/example-bill-status-prediction

The task description: https://www.datamininghub.com/task/1/details

To upload your algorithm it is required to perform the following steps:

Select Algorithms in the menu above. Your algorithms page will open where all algorithms of this user are listed:

Click add new algorithm in upper right corner of the page.

If the user profile was not linked to an AWS account the system will ask to do it at this step:

If the AWS account does not exist it will be needed to register it.

Following the link http://aws.amazon.com/free/ new account can be registered and free limits can be used within a year.

When AWS account is registered follow the link Sign up for Amazon S3 — Find my keys to create keys that later should be entered in user profile info page at DMH.

When the user profile is linked to AWS account Algorithm details page will open.

On Algorithm details page default algorithm name DataMiningHub algorithm %N% for Hadoop 1.0.3 is displayed.

Click Edit on the page and Algorithm edit page will open.

On Algorithm edit page it is possible to change the algorithm name and Hadoop version.

Click Add step to add a step. One step is one uploaded jar file that contains algorithm code and specified arguments, which this jar file is launched with.

Add file page will open.

On Add file page select jar file for upload and click Upload button or enter S3 link to the file.

As an example bill-status-prediction.jar file was taken.

When file upload if finished Step algorithm edit page will open.

The file upload can take some time!

On Step algorithm edit page set arguments, which the jat file will be launched with, and click Save button:

The following arguments are used as an example:

-o {output} —events {events} —bill_deputy {bill_deputy} -f

When arguments are set Algorithm edit page will open again but with entered step information. If required other jar files can be uploaded repeating steps from “add file” step.

Now click bet on navigation panel of Algorithm edit page to set algorithm usage cost and perform calculation.

On Algorithm bet page select the task the algorithm was developed for:

In the example only one task is available — Prediction if a bill becomes the law in future or not.

When the task is selected Add new bet use algorithm %algorithm_name% page will open.

On Add new bet use algorithm %algorithm_name% page set desired algorithm usage cost and click bet it button.

When algorithm usage cost is set Edit calculation page will open.

On Edit calculation page make mapping between all arguments from all steps and input data, clicking assign near every argument in Mapping section and selecting an appropriate data source. Then click calculate on navigation panel to start the calculation. If required it is possible to save the calculation input information by clicking Save button.

Calculation details page will open.

On Calculation details page the calculation state will be displayed. When the calculation is finished its result will be sent to the e-mail that is specified in the user profile.

Example of started calculation:

Example of finished calculation:

When the calculation is finished its result will be displayed in the task description as well as algorithm usage cost, and the customer will be able to choose the algorithm as task solution.

It is possible to check an algorithm using any data as input information before specifying algorithm usage cost.Click try it on navigation panel of Algorithm details page and Edit calculations page will open, where mapping between all arguments and desired data should be performed in the Mapping section. Then click calculate on navigation bar to start the calculation.

Data Mining Hub for Scientists

Written by Kirill A. Korinsky