Democratizing Datasets —

Real world machine learning is different than data science competitions such as Kaggle, drivendata etc.. While data science competition optimizes one metric such as AUC or log-loss or RMSE etc, real world models need more thorough comparison between models on variety of metrics. In order to practice real data-science, one needs to understand and compare the different metrics from the submission. Arithmetica’s platform provides majority results for each submission in addition to compare your submission performance based on one metric.

Arithmetica provides an ability to practice, compete and create practices, truly democratizing data-science.

The following are selected few public data-sets are available for practice:

Classify Sarcoidosis-specific markers

Sarcoidosis is a disease involving abnormal collections of inflammatory cells that form lumps known as granulomas. The challenge is to Classify Sarcoidosis-specific markers from whole blood gene expression.The dataset owners hypothesized that microarray analyses of whole blood gene expression would identify patterns of gene expression useful in the diagnosis for sacroidosis and identify inflammatory mediators relevant to the underlying pathophysiology. They analyzed whole blood RNA from 37 patients with sarcoidosis, 20 healthy controls and 6 patients with hypersensitivity pneumonitis using genome-wide expression microarrays.

Classify Cover Types

The goal for this UCI dataset was to predicting forest cover type from cartographic variables only (no remotely sensed data). The actual forest cover type for a given observation (30 x 30 meter cell) was determined from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. Independent variables were derived from data originally obtained from US Geological Survey (USGS) and USFS data. Data is in raw form (not scaled) and contains binary (0 or 1) columns of data for qualitative independent variables (wilderness areas and soil types).

Diabetes 130-US hospitals for years 1999–2008

The goal for this UCI dataset was to predict A1C results. This data-set was derived from the paper Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records.

Finally, there are a lot of other data-sets available for practice. The goal is to learn and discuss. Aritmetica’s free platform also invites people to submit ready datasets to improve the collective intelligence of our generation.