Predictive Models with Supervised learning in R
The concept of statistical learning started from the method of least squares in the early 1900s has led to the invention of linear regression method. Most of the concepts at those times were applied to astronomical science. The evolution of linear and multiple regression methods gave rise to quantitative statistical computing. Statistical computing divides the majority of the conundrums into two categories. Those are supervised and unsupervised learning categories. In processing large data sets, any predictive model of statistics, there are two important observations and measurements for gaining accuracy in measuring the errors in the predictions. Prediction measurement (p), pl, l = 1 is followed by the measurement with response rl. The success of overcoming large data set problems is to be able to build a predictive model with the accuracy of future responses based on the current predictive measurements and build a correlation between the two measurements.
The classification of variables also determines the type of categorization of the variables. Quantitative analysis uses numerical values. The qualitative variables may be the kind of the company brand or the type of plant used for manufacturing a product. All such quantitative variables are referred as regression conundrums, and the response received based on the kind of the brand, or the plant manufactured categorized as classification conundrums. However, sometimes the qualitative variables might be used in the logistic regression test. The mean squared method can provide the measurements of observations and actual performance of the response predictions. Classifiers for conditional probability such as Bayes Classifier and K-Nearest neighbors can be applied to minimize the errors in the prediction model.
Linear regression is a type of quantitative method that can be leveraged as supervised learning in statistical computing with the responses. In simple linear regression, there is a single quantitative response Q with a single predictive variable Z. The linear regression expects a relationship between Q and Z. This expression can be shown in an equation as Q ≈ β0 + β0 Z. This expression shows that there is a regression of Q onto Z or Q is regressed over Z. The linear regression can be applied with lm() function in R language by specifying the Q over Z with respect to a particular dataset. It can be expressed with lm(Q~Z,dataset). For example, in Boston package that is available under MASS library, there are two variables available lstat and medv. These variables can be leveraged as lstat as the predictor and medv as the quantitative response with lm.fit() function. When the summary function is applied on lm.fit, it provides a list of coefficients, residuals, minimum, maximum, and median values, but the challenge is to define or identify the response and predictive variables in linear regression. The cofint() function provides the interval of confidence as suggested by the function name. Cofint() can be applied to lm.fit as well. Predict() function can provide a range of prediction intervals and confidence intervals with the combination of lm.fit and data frame specified for the response with defined vector of values. Abline() can be leveraged for least square regression or it can also aid drawing a line with intercept slopes. Residuals() function can be leveraged for computing the residual of a linear regression.
While setting up a framework of hypothesis, it is important to define the dependent and independent variables that require being part of the strategy. In practice, there could be more than a single predictor variable in a regression. One of the strategies achieving the outcome could be creating a model for each predictor variable separately. However, this will establish a bevy of single regression models. Instead, a single model can be built with multiple regressions aiding numerous predictors. The earlier equation in the linear regression model can be extended with regression coefficients Q ≈ β0 + β1 Z1 + β1 Z1 + … + … + βx Zx. The role of non-linearity plays a significant part in building an accurate prediction model to connect the dots between the predictor variables and the responses. In R, the role of residual plots determines the non-linearity of the data. The error terms also show non-constant variance. Another critical factor to improve the quality of the data would be during the collection of the data. If the data collection is not accurate, there could be a significant deviation from the predicted outcome to the actual outcome. This phenomenon is called an outlier. Collinearity can attribute the association of two or more data predictor variables. The multiple linear regression can leverage lm() function as well with multiple predictor variables instead of single predictor variables. Multiple linear regression methods in R can take advantage of all the functions leveraged by linear regression. ANOVA function can compute the variance in the analysis. The ANOVA function can be applied on lm.fit as well with the quantitative response and predictor variables. Basically ANOVA function can analyze large data sets to draw inferences from multiple models and define a null hypothesis that symmetrically fits well in all the models.
The elimination of errors is a significant step to outdo the complexity of the large data sets. The measurement models should be tuned to reduce the error on the responses of the future data prediction models. Most of the measurement models try to reduce the errors in the measurement with current data. Therefore, the errors on the new data will always be higher than the errors on the supervised data with regression analysis. Supervised optimization of the prediction model will reduce the errors on the future data to a minimum. Building a complex prediction model may involve several dependent and independent variables and including parameters to linear and multiple regressions may optimize the statistical model and improve the accuracy.
In statistics, random sampling plays a critical role for performing validations on the data to reduce the error rates and optimize the accuracy of the prediction model. Selecting a sample from massively large data set with random sampling presents accurate predictions. Hierarchically, the sampling can have two classifications probability sampling and non-probabilistic sampling.
Most of the probability sampling has an estimated probability of selecting the individuals from large size population. Five categories can classify the probability sampling (The Pennsylvania State University, 2016).
A simple sampling from random population
In this sampling method, there is equal chance of being selected for every individual from a large population.
In this method of random sampling, the population sets are divided into smaller subsets as stratas. In dividing the stratas, the variables that affect each other have been classified into one strata. Stratum represents each group. Now, the random samples are collected from each stratum and combined building a much large random sample for statistical computing. The purpose and objective of such classification could be due to classifying the populations from different time zones and heterogeneous athletes from each sports category as an example.
In this random sampling method, the population is divided into multiple clusters instead of dividing them into groups of stratas. The samples are obtained from each cluster that represent a collection of data from disparate population clusters.
This method is applied by selecting any integer number in the population and by defining an interval with another integer variable.
This method is a combination of all the above sampling methods that perform statistical computing in different stages.
The population may contain bias when building a hypothesis. The selection of population could be due to the response of the volunteer surveys than an accurate methodology for conducting systematic questionnaire surveys with no bias. The data could skew the accuracy of the outcome when such samples applied to predictive models, as these are not random samples. It is recommended to avoid non-probability sampling method to improve the accuracy of the prediction model for statistical computing.
In R language, seed function in conjunction with set.seed() can be leveraged that works as a generator of random numbers in the population in R. The function sample() can partition the data observations equally into two groups as supervised learning statistical method and subsequently predict() function can be used to compute the predictions for the entire data set of observations regression.
Fortmann-Roe, S. (2012). Accurately Measuring Model Prediction Error. Retrieved January 11, 2016, from http://scott.fortmann-roe.com/docs/MeasuringError.html
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) (5 ed.). New York:
The Pennsylvania State University (2016). 3.5 Simple Random Sampling and Other Sampling Methods. Retrieved January 13, 2016, from https://onlinecourses.science.psu.edu/stat100/node/18