Data Analysis in R
How we analyse data has changed dramatically with the development of open source tools like python,R these days. Today everything we speak, talk, walk to accomplish the daily activity is recorded and used for some purpose. Thanks to the advancement of personal computer processing power, the expansion and growth of internet access and speed. Today the amount of data collected by companies is growing exponentially. The science of data analysis either it be statistics,psychometrics, Econometrics and machine learning has kept the pace to address the growing interest and demand of data driven solutions. Today i am more interested to show something of the two coolest versatile technologies side by side from the start . Python is one of the object oriented programming language that most data scientists use in their every day life to analyze and program,model and predict business outcomes and solve real world organization specific problems. The same manner R is also one of the language and environment for statistical computing and graphics. The crazy part about R is that new packages or methods become available almost on weekly basis to download. R has the best and state of the art capabilities in visualization with few tweaks and tuning. R is case sensitive and quite sometimes confusing if much of the time working on python and start using R.. As far as my current state of knowledge with R there are two way to enter command in R Studio. The first is to enter commands one at a time at the command prompt(>) or run a set of commands from the text editor file. The basic functions to start R is that set your working directory using getwd(); gets you by returning your working directory something like this > getwd()
Or you can also set your working directory before loading the file to R using setwd(); what it does is that it will set your working directory and if you tried it probably you would get something similar like this setwd(dir = “C:/Users/kiraz/Documents/projectmedium”) . Once finished setting the working director, its time to start slicing the cake, what i mean is working on the data. For this post i am using one of the wisconsin hospitals utilization and cost dataset, i downloaded from the internet and the data set is very small.
The below page shows how you do it at each command prompt. And to read the file we need to use the read_csv and then specify the file, just similar to the below snippet.
What does the output of the histogram indicate : from the graph that is displayed, we can see that infants have the maximum frequency of hospital visit, going above 300. The summary of AGE attribute gives the numerical output (after converting the age from numeric to factor) — and we can see that there are 307 entries for those in the range of 0–1 year.
Lets first see what my data set looks like using the head method. The head method similar to python. In python i would use hops.head()[:2], which i am saying is that calling the function to show me the first two rows of the hops dataset. In R you just need to put
Then if interested to collapse the data, its pretty handy in R, using somewhat similar function as follows
> aggregate(x,by,FUN), where x = the data to be collapsed,by =new variables need to create the new observation or attribute, FUN , is the scalar function used to calculate the summary statistics that i need to apply. In python there are so much ways to do either group by or simply the describe to see the summary statistics.
Therefore i am now interested to find the category with the maximum expenditure, that means i need to add the expenditure for each age, and find the maximum value from the sum. I will use the aggregate function to add the values of total expenditure according to the values of age.
> aggregate(TOTCHG~AGE,FUN = sum,data=hops), applying this in R studio, i will see the following results.
What the above result is saying is that the infant, which is age zero category has maximum hospital costs as well (in accordance with the number or frequency of visit) and similar explanations can be go on with all age categories depending on the interest of the facility collecting the data and the insurance company paying the premiums based on the service records. The other way of checking the maximum value in the data set is to call the max function on the aggregate function above and then get the results and here is my example:
This max function is similar to the python max function when we comparing the elements on the dataset list. That’s how i will do it in python, data = [1,2,[3,4],5, [7,8,9], then my max function would look like this max(data,key =lambda item:item).
Data visualisations in R
Data visualisation is one of the exciting and innovative way of showing with out too much technical jargons to the users or readers. The vast majority of python programming data visualisation uses Seaborn, Matplotlib and for interactive viz bokeh and plotly are also great. Today, i am not focusing on the python based applications of visualisation, rather in R programming language visualisations. I am going to use the datasets in the RStudio to show the visualisation steps using the most commonly used R packages. The first thing to do is to install the RStudio depending on the type of machine you are using from the link or from here.
Once installed open the RStudio the next action is loadding the library of the datasets. The examples i am going to show here all make use of datasets included in a default RStudio installation.
Its understood that every one will have the basic or advanced knowledge and skills or familiarity with the basic basic charts from elementary school, middle school, high school and college. For easiness of the flow i am going to put the meaning as follows:
A bar chart is a type of chart which shows the values of different categories of data as rectangular bars with different lengths. A histogram is a bar chart that shows how data is spread over its categories. It is another one dimensional diagram also known as pillar diagram or column graph.
The concept of the bar chart in R is the same as it was in the past scenarios — to show a categorical comparison between two or more variables. However, there are several different types of bar charts to know and understand.
Horizontal and vertical bar charts are already common and familiar — they are standard formats in most academic or professional presentations. But R provides a stacked bar chart that lets you introduce different variables to each category. Here to show the case i am going to use the datasets of automobile datasets that anyone worked with machine learning Sklearn datasets in python.
> Numbers <-table(mtcars$cyl,mtcars$gear)
>barplot(Numbers,main='Automobile cylinder number grouped by number of gears', col=c('red','orange','steelblue'),
legend=rownames(Numbers),xlab='Number of Gears',ylab='count')
The next visualisation i am going to show is Histograms. So what’s the difference between bar charts and histograms. Although they look the same, bar charts and histograms have one important difference: they plot different types of data. Plot discrete data on a bar chart, and plot continuous data on a histogram
To show the histogram i used the dataset that shows air-quality in Fahrenheit from the RStudio library datasets. The histogram will show frequency of ozone values in ‘airquality’ dataset
> hist(airquality$Temp,col='steelblue',main='Maximum Daily Temperature',xlab='Temperature (degrees Fahrenheit)')
A heat map is a graphical representation of data where the individual values contained in a matrix are represented as colors. Fractal maps and tree maps both often use a similar system of color-coding to represent the values taken by a variable in a hierarchy. There are multiple ways to do heatmaps depending on the language of coding applied and here i am focused on R, which emphasises the colour intensity to visualise relationships between multiple variables. The heat map is two dimensional between X and Y which makes it easy to grasp and interpret the relationship of the variables under the investigation or study or analysis. As a basic example, a heat map highlights the popularity of competing items by ranking them according to their original market launch date. It breaks it down further by providing sales statistics and figures over the course of time.
# simulate a dataset of 10 points
ACorrelogram is an image of correlation statistics. For example, in time series analysis, a correlogram, also known as an autocorrelation plot, is a plot of the sample autocorrelations versus (the time lags). If cross-correlation is used, the result is called a cross-correlogram. Correlated data is best visualized through corrplot. Most correlograms highlight the amount of correlation between datasets at various points in time. Comparing sales data between different months or years is a basic example. I will use the same dataset i used to show barplots and histograms to show the correlogram here under in RStudio.
> corr_matrix <- cor(mtcars)
> corrplot::corrplot(corr_matrix,method = 'number',type ='lower')