R has knives out.

Figure 1. Hortonworks and Revolution Analytics have teamed up to bring R.

Introduction

Originally Bell Labs has conceived the idea of language S in the mid-1970s to resolve data analytics and statistical conundrums. The purpose of the implementation project was to perform statistical analysis of their corporation leveraging the libraries of Fortran language. The invention of S language did not include the functions needed for statistical computing. In the late 1980s, the act of rebuilding the source code in language C reinvented S language. Subsequent releases of S language versions led to the advanced functions of R in the late 1990s. S language followed a slew of acquisitions and mergers from Alcatel-Lucent, Bell Labs, Insightful Corp, and TIBCO. The most recent versions such S-PLUS have seen an array of new features mainly built with graphical user interface models moving away from command line prompt based systems. Thus, the fundamental building blocks of S originated from the data analytics and statistical computing and not from the requirement of inventing a classical programming language, which is what made S language ahead of the pack for decades for statistical computing (Peng, 2015).

In the early 1980s, the rise of the computer machines led MIT to create a collaborative project to provide free software to the general public who can further enhance and enrich the source code of the software application under the General Public License (GPL) distribution. However, most versions of language S were commercial distributions including S-PLUS. In the early 1990s, R language invention at Auckland led to the creation of open-source distribution software with GPL with the advanced graphical interface and statistical computing. Though, the syntax of S and R look a lot similar, they are entirely different concerning schemes and R can operate on an astounding number of operating systems R that consistently improved with enhancements to the open source code. Once the R platform approves the changes made to the system of origin, there is a new release available for distribution. Unlike, the commercial distributions, that require significant releases for different seasons, the various lifecycles of resolving the bugs in R programming language aided the public with advanced features and immediate resolution of the bugs. The sophistication of R language with statistical graphs is far advanced than any other statistical computing language (Peng, 2015).

Statistical aspects of R.

The descriptive statistics provides a way first to perform data wrangling by transforming the data and then statistics can turn the raw data into valuable insights by describing the data at the atomic level with a brief summary of the statistics. R provides several statistical functions such as mean, median, standard deviation, variance, median absolute, deviation, quantile, range, sum, lagged difference, minimum, maximum, and scale (Kabacoff, 2011).

The descriptive statistics can support various functions and packages of R with sapply() with the aid of the packages Hmisc, pastecs, and psych (Quick-R, n.d.).

R also supports many probability functions with the probability distributions of beta, binomial, cauchy, Chi-squared, exponential, Wilcoxon Signed Rank, gamma, hypergeometric, lognormal, and geometric (Kabacoff, 2011). 
R also supports rich visualization plotting the bars, pie charts, histograms, Kernel density plots, box plots, parallel box plots, violin plots, and dot plots for the data analyzed with statistical functions (Kabacoff, 2011).

The descriptive statistics supported by R provides various advantages such as grouping the summary with specific elements of data. R also supports the correlation coefficients to represent associations between the quantitative variables. R offers a bevy of correlations such as Spearman, Pearson, Kendall, polychoric, partial, and polyserial (Kabacoff, 2011). R also supports a huge set of regression models such as simple linear, Cox proportional hazards, polynomial, time-series, multivariate, nonlinear, multiple linear, robust, and nonparametric for performing quantitative analysis (Kabacoff, 2011).

R also provides regression diagnostics on the data. For example, cars package provides several built-in functions for performing regression diagnostics such as avPlots(), ncvTest(), qqPlot(), crPlots(), outLierTest(), scatterplotMatrix(), durbinWatsonTest(), and vif(). Multicollinearity aids to measure the multiple results from the regression data (Kabacoff, 2011). 
The statistical aspects of R either with built-in default package datasets or with external datasets are for creating predictive models of the data from historical patterns of the data trends (James, Witten, Hastie, & Tibshirani, 2013).

Several such statistical methods aid a bevy of applied fields such as scientific, mathematical, finance, and business arenas. Few of such statistical models encompass vector methods, the creation of regression models, and sparse regression methods (James, Witten, Hastie, & Tibshirani, 2013).

The statistical aspects can broadly classify as supervised and unsupervised learning methods. The supervised statistical method of R can take multiple input variables and create an output for creation of futuristic predictive analytics and estimation models. The unsupervised statistical method can similarly work with multiple inputs. However, there is no particular supervised output. Such unsupervised outputs render the understanding of the relationships and correlation of data components (James, Witten, Hastie, & Tibshirani, 2013). 
R provides the support for simple and multiple linear regression methods based on the single or multiple predictor variables. For example, if a corporation is generating the revenues through Internet, radio, and newspapers, a simple regression model with single predictor may not provide in-depth analysis on the medium for the projection of the revenues. In case of multiple linear regression models, there is advantage of having multiple predictor variables with regression coefficient estimates (James, Witten, Hastie, & Tibshirani, 2013).

R can perform the analysis on variance with dependent and independent variables with one-way ANOVA, two-way factorial ANOVA, one-way ANCOVA aided by one covariate, two-way factorial ANCOVA aided by a couple of covariates, robust MANOVA, and analysis of multivariate MANOVA (Kabacoff, 2011).

Programming features of R

R programming can be run at the R Console or RStudio by writing R Script. The input needs to have a complete expression for clearing the syntax of the command. All the comments in R language can be maintained with hashtag symbol #. R language performs evaluation of the inputs and returns the result set of the operation (Peng, 2015).

The data elements of R have five molecular components to determine the type of the data.

i.Character which stores and represents the text or string of the data type

ii.Number represents the real numeric data types

iii.Integer

iv.Complex data type

v.Boolean data type that represents the logical outcome such as true or false value of the data type (Peng, 2015).

R programming attributes and functions

R programming represents the metadata state of each R object. These metadata objects can be the foundation of descriptive pattern of the object. The attribute describes the elements that encompass each column such as arrays or names. The attribute of the R object can be retrieved with the function attributes(). The vector function c() can be leveraged to concatenate multiple objects together. There are multiple types of vectors such as matrices, factors, and lists. Another programming feature of R is data frames. The purpose of the data frames is extracting the data from .csv files with read function. It can also be leveraged to extract the tabular data with table() function. Date and time functions are part of POSIXlt and POSIXt classes of R (Peng, 2015).

R mathematical functions

R programming provides numerous built-in mathematical functions similar to other programing languages for computing absolute value, smallest integer, largest integer, rounding, calculus functions, and logarithms. The other features include signal processing, and random number generation (Kabacoff, 2011).

R programming feature to support write and read functionality

Read functions

R can read the data from text-based files with read.csv function
R has a special function readLines to read particular lines of the text-based file
Source and dget are few other read functions to read the source code of R
The function load can retrieve the work saved
R can read individual R objects that are available in the binary code with unserialize function (Peng, 2015).

Write functions

R can write data to text-based files with write.table function
R can write the data appending each line to the text file with writeLines function
Dump and dput are other functions for gleaning the textual data of R objects
Compressed data in the binary form can be stored with save function
The data can be written to either a connection or to a file with serialize function (Peng, 2015).

Memory architecture of R

Considering that R is not a database to store the data on traditional disk-based database system for INSERT, DELETE, and UPDATE operations, R stores the entire data in main memory operating with in-memory architecture similar to any other in-memory database, except it operates the data from the files. Approximately a dataset with one and half million records and 100+ columns will require the RAM memory in excess of 1.3 GB. Though the CPU has memory beyond 1.3 GB, the availability is purely dependent on the consumption of the memory by the applications at runtime on each operating system. A machine that runs multiple operating systems on Mac OS X will consume more memory for each operating system separately than running it on Windows operating system. The best way to ensure to run the large dataset on R would be to optimize the number of applications that are running on the CPU (Peng, 2015).

External connection integration for interfaces

R can connect to any external text file in a variety of formats with file function including delimited text data files, excel file, XML, data interchange format, DBF, binary files, image files, and SPSS files. 
R can connect to a string of database sources with R open database connectivity (RODBC) such as SAP HANA, Oracle, PostgreSQL, MySQL, SQLite, and IBM DB2 on multiple operating systems. R can be called from a series of programming languages such as Java, C, C++, Python (CRAN-R, n.d.).

  • R can connect to a network socket by performing socket programming CRAN-R, n.d.).
  • R can connect to disparate sources of files in compressed formats such as gzip and bzip2 with gzfile and bzfile functions
  • R can connect to a specific hyperlink of a webpage to glean the data with url function (Peng, 2015).

Packages

R comes with default packages that are filtered and delivered by R framework. These packages aid the development of R at a rapid pace to perform statistical computing on a specific package. Most of the packages delivered in R are written in C++. Not all default packages come with the installation of R. Any specific packages can be installed with install.packages function from CRAN. After validating the completion of package installation , it has to be loaded into the current session of R with the particular operation library (Peng, 2015).

R programming control structures

R has multiple control structures to execute numerous expressions of R language. The expressions are similar to several other object-oriented programming languages that contain:

If-else statement to build a nested condition loop to validate and execute multiple conditions
The loop can be introduced on a particular operation specifying number of times for the condition coding block to execute with for loop statement
R also has control structure to execute only till the condition is true with while statement
Another control structure is repeat. This particular statement allows the loop to execute continuously and exit with a specific break statement upon fulfilling a condition
The statement break is usually leveraged to exit the loop for a specific condition
To avoid looping through a particular data condition, next statement can be leveraged 
For is another conditional structure to loop through the condition for the range of data. It can be implemented with nested loops as well with multiple for conditions (Peng, 2015).

R debugging

R is advanced programming language with debugging facility to trace any issues that can occur during execution of specific statements with traceback, debug, browser, trace, and recover functions (Peng, 2015).

R for machine learning

R programming supports machine learning with the main package nnet. The package delivers built-in functionality to perform parallel computing with the calculation node patterns of neural networking connecting multiple axons and synapses. R also provides several algorithms to support machine learning such as recursive partitioning, random forests, shrinkage and regularized methods, boosting, vector machines, Bayesian methods, genetic algorithms, fuzzy-rule based systems, model selection, and meta packages (Hothorn, 2015).

R programming features

List of R programming features

  • R programming environment — R Console and RStudio
  • Data types of R — characters, numbers, integers, complex numbers, and Boolean data types
  • R programming contains attributes and functions — For example vectors, attributes, matrices, factors, data frames, tables, and lists
  • Mathematical functions in R — Logarithms, calculus functions, random number generation, rounding, ceiling, flooring, truncate, absolute value, rounding, minimum and maximum functions to find the smallest and largest integers, and other features such as signal processing
  • R Read functions — .csv, readLines, source, dget, load, and unserialize
  • R Write functions — write.Table, writeLines, Dump, dput, save, and serialize
  • In-memory architecture of R — Data loads into RAM — physical memory
  • External connections to files — Text files, Excel spreadsheets, tab delimited text files, .csv files, DIF, DBF, binary files, image files, gzip, bzip2 and SPSS files
  • R connectivity to databases — IBM DB2, MSSQL server, MySQL, PostgreSQL, Oracle, and SAP HANA
  • R connectivity to programming languages — R can be invoked from Java, C, C++, VC++, and Python
  • R network socket programming — Socket programming can be performed in R
  • R webpage connectivity — R can connect to URLs and glean the data from webpage

R crunching petabytes of data

R can aid processing colossal amounts of big datasets including high-performance computing for petabytes of data. The large datasets can be processed with both user-controlled mechanism and system-performed parallelization techniques allocating heavy-duty RAM. The programming features of R also aid to break the data into multiple chunks with MapReduce techniques, which are more of tactics than architectural strategies to handle large volumes of the big data. R delivers specific packages for such large datasets such as bigmemory and ff or direct external database connections to a bevy of databases. In case of connection to Hadoop and the parallelization can be achieved through the packages HadoopStreaming and Rhipe. There are few more packages such as bigtabulate, bigvideo, bigalgebra, biganalytics, and synchronicity to handle enormous volumes of data. Each package provides specific matrix functions to handle gargantuan volumes of data (Rosario, 2010).

In order to overcome the limitation of R running the data on main memory on one core regardless of the number of cores available, MapReduce package is leveraged which is more of abstract architectural parallelization paradigm rather than the hardware parallelization on the number of cores. The data splits into multiple chunks each producing its own output and ultimately converging the result sets of the individual outputs into a single output. HadoopStreaming package is leveraged to process the map and reduce batch jobs written in any compatible programming language on Hadoop (Rosario, 2010). Microsoft on January 4, 2016 has released commercial distribution of R server on a bevy of platforms such as Hadoop with Red Hat Linux, Teradata, and SUSE Linux reinventing R for processing large datasets (Foley, 2016).

R weds big data, a perfect integration

In today’s world, majority of the corporations are leveraging big data to gain competitive advantage over other enterprises either in the same field or to propel their year-to-year growth with connected revolution and convergence of IoT and big data. R analytics. R analytics can aid the corporations to tame and wrangle their data. The algorithms built on astounding amounts of big data can provide more accurate and precise information than performing algorithms on smaller set of data that resides on structured relational database management systems (RDBMS). The corporations that can apply such algorithms can capture the major share of the market in the industry. The rise of algorithmic economy when leveraged through R has aided corporations to turn the raw data into counterintuitive insights. The standard built-in algorithms provided by the classical database systems cannot provide deeper insights into the data. R analytics can perform the statistical analysis with quantitative methods to fuel the organization with innovation and exploratory data analysis. Thus, R brings the in-memory management and ability to connect to a bevy of databases and programming languages and perform MapReduce methods and glean the data from Hadoop open-source platform or from commercial distributions such as Cloudera, MapR, and Hortonworks Hadoop platforms. R analytics is best suited for big data analytics, as R can perform in-file or in-database analytics with world-class quantitative methods (Revolution Analytics, n.d.).

References

CRAN-R (n.d.). R Data Import/Export. Retrieved January 5, 2016, from https://cran.r-project.org/doc/manuals/r-devel/R-data.html#Imports

Foley, M. J. (2016). Microsoft rolls out its R Server big-data analytics line-up. Retrieved January 5, 2016, from http://www.zdnet.com/article/microsoft-rolls-out-its-r-server-big-data-analytics-line-up/

Hothorn, T. (2015). CRAN Task View: Machine Learning & Statistical Learning. Retrieved January 5, 2016, from https://cran.r-project.org/web/views/MachineLearning.html

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R (Springer Texts in Statistics) (5 ed.). New York:

Kabacoff, R. (2011). R in Action (1 ed.). New York: Manning Publications.

Peng, R. D. (2015). R Programming for Data Science. Retrieved January 3, 2016, from https://www.cs.upc.edu/~robert/teaching/estadistica/rprogramming.pdf

Quick-R (n.d.). Descriptive Statistics. Retrieved January 5, 2016, from http://www.statmethods.net/stATS/descriptives.html

Revolution Analytics (n.d.). Advanced ’Big Data’ Analytics with R and Hadoop. Retrieved January 5, 2016, from http://www.revolutionanalytics.com/whitepaper/advanced-big-data-analytics-r-and-hadoop

Revolution Analytics (2013). The Modern Data Architecture for Predictive Analytics with Hortonworks and Revolution Analytics. Retrieved August 31, 2016, from http://www.slideshare.net/RevolutionAnalytics/modern-data-architecture-for-predictive-analytics-hortonworks-and-revolution-analytics

Rosario, R. R. (2010). Taking R to the Limit, Part II, Working with Large Datasets. Retrieved January 5, 2016, from http://www.slideshare.net/bytemining/r-hpc