Chipy Mentorship Part II

This is the second installment for the Spring 2017 mentorship program.

Since the last post, I have had quite a few headaches and growing pains. This has been such a great learning experience this past couple weeks. It turns out pandas did not resolve my issue. The dataframe did not take too kindly to an embedded array in a column. So I went back to the methods of old, parsing the file and extracting the data needed to begin the analysis. I created paring file that iterates through all of the JSON objects in the file obtained from EIA website. This file extracts the monthly data that I have been using in my analysis. Let the fun begin!

We hear these terms thrown around: “Big Data”, “Data Science”, “Machine Learning”. Working with large datasets can be exciting and intimidating when working on a project. You want to be able to evaluate the data, but you really do not know where to begin. There is so much information, you do not know which parts are valuable. The data you will be evaluating will need to be arranged in a way that can be easily read by the computer. This format is typically in a table, where the columns are known as features and the rows are referred to as samples.

There are multiple approaches to Machine Learning. You have Supervised vs. Unsupervised learning. Each type of learning has many different methods. There is a great resource that can be found on the Scikit-learn which can be found here: Choose the right estimator.

The image is a great representation of categorization for the different methods in the ML package. There are four categories used for ML analysis: Classification, Clustering, Regression, and Dimensionality Reduction. These can be grouped according to the prediction outcome you are looking for.

Supervised learning is used when looking for a equation/function and variables that may be applied to a test set. Unsupervised may be used when searching for patterns in the data that are more suited towards classification of features. Classification and Clustering might be used when predicting the category of a dataset. Regression and Dimensionality Reduction can be used for predicting values or a quantity. Since I am trying to predict specific set of values, Supervised Learning will be the method I plan to use. Somewhere in the dataset you have some number of features that can have a direct relation to the value you are trying to predict. Determining the relevance of these features is the underlying concept we are analyzing. So far, this is where the bulk of my time has been spent for my project.

I have no previous experience with Machine Learning, but I am determined to learn and understand the methods for each ML model. My mentor Patrick and I spent a while disussing the best direction to take when analyzing my dataset. He has sent me many articles on various topics in Machine Learning, which has been a huge help.

When you are scouring the web for answers, you need to know how to ask the correct questions. Alot of the articles I have read helped solidify my understanding of the terminology used in ML. I started by reading through the documentation on the Scikit-Learn website. After spending some time with the examples and trying to understand each part and the function, I decided to tackle my own data set.

Understanding how Dimensionality Reduction or Feature Selection will impact the remaining features the dataset was a major determination of the method I chose. Methods like PCA, which are use for Dimensionality Reduction transform your features down from a multi-dimentional level to a more simple case, by compressing the layers to a single planar form. During this compression many individual features are converged; doing so removed redundancy of some features in the dataset. Removing these redundant features allows the models to then run the analysis on a smaller feature set. The only issue, is that your feature set is can no longer be view on a feature-by-feature basis. All features have been converged to one indicator. For my analysis, I needed to know the specific features that have the greatest impact on my target data, so I needed to find another method.

Feature selection is the next method I am trying. Feature selection allows you to enter the number of relevant features you want, the uses computations methods to rank based on a specified estimator. The estimator has a lot to do with the normalization of the data set. It was not till recently, that I realized the estimator was the main factor on was making it so difficult to pass through my own data. I would try to pass in my dataset through the feature selection examples with no results! It would just lag, with no result ever coming up.

Knowing and understanding the intricacies of machines operations and data processing is something that can be very helpful when researching and turning your ML models. Now that I have a better understanding of the what the estimator objects are doing, I can try different combinations and compare the results. Picking the right combination and understanding why a certain combination of models is the best for my dataset will be the most important factor in the prediction of the quantities and values for my project.

I have also started to research the functionality of Dask. The number of features for my dataset is quite large and can be very expensive on the compute time for this analysis. I plan to cut down this time by splitting the machine learning computations to Raspberry pi’s. Hopefully, by splitting the tasks, the runtime can be reduced.

There is still alot of work to be done. So there eill be more to talk about in the next post. Until next time!