How to write a good statistical report for free (and make your boss happy).

During last years I was working in a multinational company in a position between logistic/warehouse support and IT, so data data data and more data to analyze and covert in useful KPIs/charts.

In the same time I was following the Coursera Data Science specialization course, 9 modules long specialization full of useful information about statistics, Data Science, Machine Learning and more, created by Johns Hopkins University. I always was curious about R language and in this course I used in deep with the help of the R Studio environment.

This document wants to be a fast tutorial on what you need to implement the platform and be productive in hours/days and it’s based on my personal experience, if you have some suggestions please write a comment!

Software and hardware

In the first step you can use your personal computer with R Studio desktop, a very useful IDE spacial designed for R and statistical elaborations. If you want to install it you will need:

If you can access a shared folder it will be useful for easily share scripts and results.

If you have space on an intra-factory server and you can install applications maybe it’s time to try with R Studio server, in fact is the same than installed one but run on server and you can access using your browser

Not bad at all!

The developing process

In order to develop a complex report and make it work I used this steps.

Create the scripts

The first step is to download and imports data from the various informative system and convert in a useful format for R, in my case I needed to elaborate data from:

  • CSV files download from some shared folders
  • HTML tables provided from some PHP pages
  • Data results from some remote MySQL databases

R can easily manage all this and mutch more with the libraries provided by the CRAN archive.

The main idea in this step is to realize various script (maybe one for a system or for data kind) able to perform this steps:

  • download the data needed,
  • convert the data in a data.frame and adjust the format of the variables (eg. timestamp has to be converted in a native time format),
  • clean, convert and reformat and data that need it (eg. covert ),
  • summarize the dataset if needed (eg. calculate the sum by month),
  • save all the dataset in the folder “data”, this will allow the report to charge data from a common place.

Create the reports

The output has to be one file only and if needed images can be extracted, we can use the Knitr package to generate some HTML reports.

The report is a document made by some text in markdown format and some pieces of code (called Chunks) that allow printing formatted data, tables, charts etc. Rmarkdown allows us to pass some arguments to the report (eg. use the same report for visualize warehouse movement between two dates passed as an argument).

The report can have a structure similar to the following skeleton:

 — -
title: “My useful report”
author: “Valerio Vaccaro”
date: “Jan 2017”
output:
html_document:
fig_height: 10
fig_width: 14
params:
today: !r “2017–01–01”
— -
```{r setup, include=FALSE}
setwd(“/our_path/our_folder”)
# load libraries
library(…)
# load data
load(file=”dataset_1.Rdata”)
# elaborate data
```
```
## Executive summary
## Chapter 1
```{r ggplot_chart_1, include=FALSE}
```
## Chapter 2
```{r ggplot_chart_2, include=FALSE}
```
## Annex

In witch we can recognize:

  • the code form load the datasets
  • the code for elaborate the data loaded
  • an introduction/executive summary

Plus the text with:

  • charts generated with the ggplot2 package
  • tables generated with the xtable package
  • other information generated by the scripts

Automagically make it work

When the script is ready we usually want to execute it automatically in order to generate our report and save in a shared folder. Maybe we can iterate the same approach on last 7 days document in order to update all the documents in a single shot, for make it we can use the following code.

#set the working folder
setwd(“/our_path/our_folder/”)
# call update scripts
source(‘/our_path/our_folder/update_dataset_1.R’)
source(‘/path/update_dataset_2.R’)
# calculate the date of last 7 days in order to iterate on it
today <- Sys.Date()
dates <- seq(today-7, today, “day”)
for (i in 1:length(dates)){
date <- as.character(dates[[i]])
   # generate the document calling the RMarkdown render 
# and passing an argument
rmarkdown::render(“my_report.Rmd”, params = list(today = date))
   # rename the generated document
file.rename(“my_report.html”,paste0(date,”_my_report.html”))
}

We will need only to call rscript with this script as argument in cron for automagically have our report (maybe one hour before we reach workplace :))and make your boss very happy and your life easier!