Towards Data Analytics: Data literacy in R
Original article by Ryan Garnett, republished with permission
1.0 Purpose
Data is everywhere, and is a critical component in every faction of work. Understanding data is essential for evidence based decision making. Within most organizations Excel is King, the application that is used to collect, share, manipulate, analyse, and communicate information from data. While great for many things, Excel has limitation related to data analytics. This story aims to demystify the perceived notions around the difficulty associated to data analytics. Focusing on understanding data, and introducing principle elements that data practioners face daily, the hands-on approach will help those looking to improve their analysis abilities beyond using Excel.
1.1 Learning Outcomes
The story will focus solely on using RStudio, an open source industry leader for data analytics. Specific focus areas will include:
- Importing data
- Explore data
- Working with data
- Performing calculations
- Visualizations
- Next steps in the RStudio journey
1.2 Data within RStudio
RStudio has the ability to work with a wide range of data types and formats.
- Structured data capabilities within RStudios:
- files (csv, xls, etc.)
- connect to databases
- pull from an application programming interface (API)
- read geospatial data
Data imported into RStudios is saved as an object in memory, meaning it is not a physical file. In memory objects allow for increase speed during data analysis. The objects are stored within RStudios and can be referenced, manipulated, queried, or visualized at anytime. Assigning data to an object is performed as follows:
object <- source or action
exampleData <- read_csv(“C:/Temp/sourceData.csv”)
The <- operator acts as an assignment, for example “using the source csv file, assign it to the object exampleData”. Within this story additional examples of assigning data to an object will be covered.
1.3 Understanding Data
RStudio has five (5) main data types:
- character (text, string, etc.)
- complex
- integer (number without decimals)
- logical (boolean)
- numeric (number including those with decimals)
Dates are a special sub type of characters or doubles. The are a number of different date formats that are available within RStudio:
Data analysis methods depend on the the data type. Some analysis methods are not available for different data types.
1.3.1 Tidy Data
The concept of tidy data is to provide a consistent organization and structure to data that reduces time required for data cleaning, allowing more time for data analysis. Tidy datasets provide benefit through their structure making the data easier to manipulate, model and visualize. The data arranged such that each variable is a column and each observation is a row.
There are three interrelated rules which make a dataset tidy:
- each variable must have its own column
- each observation must have its own row
- each value must have its own cell
(Tidy information, definitions, and graphics referenced from: https://r4ds.had.co.nz/tidy-data.html)
2.0 Setting up the Environment
RStudio, similar to other software application, has the capability to extend functionality via add-ins. Within the world of R add-ins are called “packages”. A package is similar to a library in other programming languages (i.e. Python), or extensions in Microsoft Excel.
As of April 17, 2019 there were 13,802 packages available on the official repository. In order to access a package it must first be installed, which can be done with the following line of code: install.packages(“packageName”) i.e. install.packages(“tidyverse”). Once a package is installed it must be accessed using a command called library. Within the story we will be using three packages: tidyverse, esquisse, and dataexplorer.
tidyverse: The tidyverse is a collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures. The core tidyverse includes the packages that you’re likely to use in everyday data analyses. As of tidyverse 1.2.0, the following packages are included:
- ggplot2
- dplyr
- tidyr
- readr
- purrr
- tibble
- stringr
- forcats
esquisse: The purpose of this plugin is to let you explore your data quickly to extract the information they hold. You can only create simple plots, you won’t be able to use custom scales and all the power of ggplot2.
This package allows you to interactively explore your data by visualizing it with the ggplot2 package. It allows you to draw bar graphs, curves, scatter plots, histograms, then export the graph or retrieve the code generating the graph.
DataExplorer Automated data exploration process for analytic tasks and predictive modeling, allowing users to focus on understanding data and extracting insights. The package scans and analyzes each variable, and visualizes them with typical graphical techniques. Common data processing methods are also available to treat and format data.
2.1 Packages
The following code snippets illustrate how to access packages within RStudio.
2.2 Import Data
The following code snippets illustrate how to import a .CSV file within RStudio.
3.0 Explore Data
Within analytics one of the first steps once a dataset is imported is to explore the data elements. Data exploration can be performed within a table or through visualization, which will be discussed later in the story. The purpose of exploring the data is to find anomalies or trends within the dataset. The findings from the data exploration exercise will influence data cleaning, model development and the creation of data visualizations.
There are many different approaches to explore data, and depending on the purpose (i.e. data cleaning, model development, data visualization, etc.) the techniques will differ. A general rule for data exploration is to:
- A: look at the structure and composition of the entire dataset
- B: look are the elements and records in each column
The following two tables outline techniques for performing data exploration at the dataset and column level.
4.0 Cleaning Data
Importing date information can be converted to different formats (i.e. integer or string) depending on the format that was entered in the data source. Lubridate is a package that is specifically developed to work with date information. In order to take advantage of date analysis functions the information must be stored as a date value.
Typical data cleaning tasks:
- change mistakes (i.e. spelling)
- changing character cases (i.e. upper case to lower case)
- convert data types (i.e. character to numeric, character to date, etc.)
- identify missing values
- populate missing values
- tidy data (i.e. create multi column from single column — I&T — Networking into two columns — I&T and Networking)
Performing data quality is a large and varying process, typically requiring multiple steps and processes. In depth data cleaning is out-of-scope of this story, but a common data type conversion will be explored.
5.0 Working with Data — Filters
When working with data a necessary tasks is to limit the number of records to analyze. A common approach to achieve this is filtering data based on some value, for example “show me only records that are greater than some value”, or “only records that are equal to some value”. When filtering data it is important to remember the data type, as some operators do not work with specific data types. The following table outlines common filtering operations.
6.0 Calculations
Creating new values is common when working with data; in the data analytics domain it is referred to as “feature engineering”. The creation of new values typically will result in a new column. Mutate is a function in dplyr, which is part of the Tidyverse package, that adds new variables and preserves existing columns. New values can be calculated using a range of operators, such arithmetic operators (+, -, *, /), properties (length, nchar, class), conditional (if_else, recode, case_when), etc.
7.0 Visualizations
Visualizations are an essential component in data analytics. There are a number of different visualization packages available, however ggplot2, which is part of the Tidyverse package, is the most commonly used visualization package utilized within RStudios. ggplot2 is a powerful package allowing for customized visualization products. ggplot2 uses the “grammer of graphics” syntax, which can be difficult. For this reason the story will use a package called esquisse, which is a drag-and-drop graphical user interface for ggplot2.
8.0 Resources
The following links are great resources for transitioning into data analytics using R: