Data and its modeling
Partial Automation of Data Selection, Data Visualisation and Machine Learning Model Building using Python
Recently, one of my teachers suggested that I document the procedure for a project my friends and I are working on. This article summarises the same as a formal report.
ABSTRACT
Machine learning and data science have become buzzwords in the past few years, and it is only natural to be curious and dive into these concepts as learners. We set out as a group of explorers and on our journey, we discovered that it is quite a difficult set of skills to obtain. Furthermore, a wide variety of machine learning and data processing requires rigorous programming and hours of effort to bring out valuable results. How much time could we save by automating some of these tasks? How much stress, confusion and frustration can be avoided if we avoid the problems caused by syntax errors? To discover the answer to these, and possibly provide a solution, we decided to come up with our project on partial automation of the data processing and machine learning model building process.
INTRODUCTION
In our project, titled ML On Click , we have tried incorporating the idea described in the abstract. This paper introduces the concepts used in:
- Partial automation of selection of data
Its visualisation in both brute force methods and fine-tuned methods - Recommendations on machine learning models based on simple threshold conditions
- Preliminary, downloadable Machine Learning models
Apart from these, some existing methods for fulfilment of these requirements have also been mentioned as part of our research, following which is a discussion on how this solution is different from others.
This paper suggests a procedure to follow for the automation of these tasks in the commonly used and renowned language — Python — and establishes certain boundaries and limitations to explain how this approach may be improved in the future.
RESEARCH AND DOCUMENTATION
PRE-EXISTING METHODS
The problem statement discussed here has been solved to a certain extent by large technological firms like Microsoft , Google and Amazon ; however, these come at a cost. This is understandable, for the quality of processing, customisation and scalability offered by these software solutions is commendable and reliable.
Most of these solutions are available on the cloud platform and can be availed for large scale tasks and automations with ease.
OUR PROPOSED METHOD
We are developing a solution to reduce the time taken up in typing up repetitive code, going through the process of evaluating data quality by measuring the quantity of unusable data points, and visualising the data points to obtain a basic idea of the dataset.
We do this by following these steps:
- Upload the dataset in a relevant file format, to the temporary memory of the application
o Good file formats include ‘.csv’ and ‘.txt’
o File size around 200 MB can be processed till now - Display the first 10 rows of data in a tabular format
o This allows the user to understand the dataset and recognise what kind of values are contained within - Run a test on the dataset to count ‘null’ or ‘not a number’ or missing values
o If count of missing values in a column of the table is greater than one third of the column size, mark the column as ‘Not Recommended’
o If the data through the column is not consistently of the same type, mark the column as ‘Not Recommended’
o If there exists a row with 50% or more missing column values, mark the row for deletion
o If the dataset contains more than 30% missing cell values, notify the user of the low quality of data
o If the dataset passes all of the above-mentioned tests, display a list of ‘Recommended’ Columns to the user - Request the user to choose target variable for machine learning
o From the list of all columns, even if not recommended.
o For predictor and classifier models, one target variable must be chosen - Request the user to choose independent variable(s)
o At least one must be chosen. Multiple choice is allowed and changes the models which are applicable to the selected data - Depending on the choice of variables, conduct a brute force visualisation of the selected data
o This is facilitated by the Python Library Seaborn , and using the PairPlot function of the same to gain a detailed scatter plot or distribution of all selected data points
o Using this visualisation, the user can recognise the relationship patterns in the data, find outliers and trends and decide which model would be best suited to this dataset
o One limitation is that no automated action can be taken on the visualised data yet. Once an outlier is detected, the user must analyse the cause of the anomaly and clean the data manually - Once the user decides upon a model to use, process the selected data using the algorithm
Some automations for hyperparameter tuning can be introduced:
i. Grid Searching for finding the best value of ‘α ’, and degree of polynomial in Polynomial Regression
ii. Backward Feature Elimination based on χ2 Test for dimensionality and parametric reduction in Logistic Regression by evaluating p-values - If any of the processes fails, notify the user of the same, but continue training the machine learning model without the failed feature
- Once the model is trained, test the model on test data
o If the model accuracy is above 70%, make it available for download to the user in the form of a ‘.pkl’ file using the Python Pickle Library
o If the model accuracy is below 70%, do not allow the user to download, as it may give rise to false predictions or anomalies when used
o Display a visualisation of the model performance
- If using a Classification model, display an annotated Heatmap of Confusion Matrix using Seaborn Library along with SciKit Learn Library
- If using a Regression model, display a Regression plot, using scatter plot overlayed with the regression line from the model coefficients
- If using a Polynomial model, display a Distribution plot demonstrating how close/far the predictions are from the actual distribution
DISCUSSION
We now understand the procedure followed to partially automate the ML process. This solution proposed by us in our project allows the user to avoid having to write hundreds of lines of code for repetitive tasks like working on the same models. This time can be utilised in analysing the results of the visualisations provided by this application, and then investing in understanding any anomalies that are observed.
The downloadable machine learning models can be integrated into Python environments using the Pickle Library itself, by un-pickling a file and using it to predict data without having to re-train the model!
Fine tuning certain hyperparameters without having to set up the entire environment and code for it allows the user to get a better performing model without re-working the basics.
CONCLUSION
With this project, paper and documentation of procedure, we wish to mark our steps towards supporting the computer science community by contributing to it. Open-source software is what binds the community together, by providing valuable resources, free of cost, and us making our efforts to bring to life such an application will mark our endeavour towards growing as computer scientists and engineers. There is yet a lot to complete in this project and it is a part of the learning curve of a budding developer to encounter such difficult tasks.
We are hopeful that this will help us as well as the people around us.
TECHNOLOGY AND TOOLS USED
- Python
- Streamlit — API, Module and Cloud Service
- Matplotlib
- Seaborn
- Pandas
- NumPy
- SciKit Learn
- Statsmodels — API and Module
- Google Colaboratory
- GitHub — Version control and deployment
- Visual Studio Code — Editor