Many Ways of Reading Data Into R — 1

Vivekanandan Srinivasan
Analytics Vidhya
Published in
12 min readNov 30, 2019

This series is intended for data science beginners who are looking for a quick introduction to data science using R. It covers all the basics to advanced concepts in statistics and machine learning necessary to build data science projects.

Whenever someone wants to start a career in data science. We have a vast number of online resources available. But most of the contents are bits and pieces in nature. This series is written to give a comprehensive introduction to all important topics in statistics and machine learning using R.

This series assumes you know basics of R programming that includes R data structures, control statements, loops and creating necessary R functions. If you are really new to R and looking for basics of R, kindly go through the following series where we discussed about R basics — Basics of R — 1

The contents in the article are gist from couple of books which I got introduced during my IIM-B days.

R for Everyone — Jared P. Lander

Practical Data Science with R — Nina Zumel & John Mount

All the code blocks discussed in the article are present in form of R markdown in the Github link.

The very first step in any data science project is to load the data into the environment. As with everything in R, there are numerous ways to get data. This article focuses entirely on different ways of loading data into the R environment. The following are some of the most common ways of doing it.

  • Reading from CSV data
  • Reading from database
  • R binary files
  • Data included with R
  • Data from other statistical tools
  • Extract data from websites
  • Reading data from AWS S3

Reading From CSV Data

The best way to read data from a CSV file is to use read.table. It might be tempting to use read.csv but that is more trouble than it is worth, and all it does is call read.table with some arguments preset. The results of using read.table is a data.frame.

Any CSV will work but for explanatory purposes, let’s use an incredibly simple CSV at http://www.jaredlander.com/data/Tomato%20First.csv. Let’s read that data into R using read.table.

url <- "http://www.jaredlander.com/data/Tomato%20First.csv"
tomato <- read.table(file = url, header = TRUE, sep = ",")
head(tomato)

The second argument header , indicates that the first row of the data holds the column names. The third argument gives the delimiter separating the data cells. For example, changing this “ \t ” or “ ; ” allows it to read other types of files.

One often unknown argument that is helpful to use is stringAsFactors. Setting this to False (the default is True) prevents characters columns being converted into factor columns. This saves computation time — this can be dramatic if it is a large column data with many character columns with many unique values.

There are numerous other arguments to read.table, the most useful being quote and colclasses , specifying the character used for enclosing cells and data type for each column respectively.

Sometimes CSVs are poorly built, where the cell separator has been used inside the call. In this case read.csv2 or read.delim2 should be used instead of read.table.R’s built-in read.table command can be made to read the most separated value formats.

Large files can be slow to read into memory using read.table, and fortunately, there are alternatives available. The two most prominent functions for reading large CSVs — and other text files — are read_delim from the readr package by Hadley Wickham and fread from the data.table package by Matt Dowle respectively. Both are very fast, and neither converts character data to factors automatically.

The readr package provides a family of functions for reading text files. The most commonly used will be read_delim for reading delimited files such as CSVs. Its first argument is the full filename or URL of the file to be read. The col_names argument is set to TRUE by default to specify that the first row of the file holds the column names.

library(readr)
theUrl <- “http://www.jaredlander.com/data/TomatoFirst.csv"
tomato2 <-read_delim(file=theUrl, delim=’,’)

When read_delim is executed, a message is displayed that shows the column names and the type of data they store. The data can be displayed using head. read_delim, and all the data-reading functions in readr, return a tibble, which is an extension of data.frame. The most obvious visual change is that metadata is displayed such as the number of rows and columns and the data types of each column. tibbles also intelligently only print as many rows and columns as will fit on the screen.

There are helper functions in readr that are wrappers around read_delim with specific delimiters preset, such as read_csv and read_tsv.

Another option for reading large data quickly is fread from the data.table package. The first argument is the full filename or URL of the file to be read. The header argument indicates that the first row of the file holds the column names and sep specifies the field delimiter. This function has a stringsAsFactors argument that is set to FALSE by default.

library(data.table)
theUrl <- "http://www.jaredlander.com/data/TomatoFirst.csv"
tomato3 <-fread(input=theUrl, sep=',', header=TRUE)

This is also faster than read.table and results in a data.table object, which is an extension of data.frame. This is another special object that improves upon data.frames. Both read_delim or fread are fast, capable functions, so the decision of which to use depends upon whether dplyr or data.table is preferred for data manipulation.

Reading From Excel

Excel may be the world’s most popular data analysis tool, and while that has benefits and disadvantages, it means that R users will sometimes be required to read Excel files. Fortunately, for anyone tasked with using Excel data, the package readxl, by Hadley Wickham, makes reading Excel files, both .xls and .xlsx, easy.

The main function is read_excel, which reads the data from a single Excel sheet. Unlike read.table, read_delim and fread, read_excel cannot read data directly from the Internet, and thus the files must be downloaded first. We could do this by visiting a browser or we can stay within R and use download.file.

download.file(url='http://www.jaredlander.com/data/ExcelExample.xlsx',destfile='../data/ExcelExample.xlsx', method='libcurl')

After we download the file we check the sheets in the Excel file.

library(readxl)
excel_sheets(‘../data/ExcelExample.xlsx’)

By default read_excel reads the first sheet, in this case, the one holding the tomato data. The result is a tibble rather than a traditional data.frame. Specifying which sheet to read can be done by supplying either the sheet position as a number or the sheet name as a character.

wineXL1 <- read_excel('../data/ExcelExample.xlsx', sheet=2)
head(wineXL1)

Since wineXL1is a tibble only the columns that fit on the screen (or in this case the page) are printed. This will vary depending on how wide the terminal is set.

Reading From Databases

Databases arguably store the vast majority of the world’s data. Most of these whether be Mircrosoft SQL Server, DB2 or Microsoft Access, provide an ODBC connection. The most popular open-source databases have packages such as RPostgreSQL and RMySQL. Other databases without a specific package can make use of the more generic, and aptly named, RODBC package. Database connectivity can be difficult, so the DBI package was written to create a uniform experience while working with different databases.

Setting up a database is beyond the scope of this article so we use a simple SQLite database although these steps will be similar for most databases. First, we download the database file using download.file.

download.file("http://www.jaredlander.com/data/diamonds.db",destfile = "../data/diamonds.db", mode='wb')

Since SQLite has its own R package, RSQLite, we use that to connect to our database, otherwise, we would use RODBC.

library(RSQLite)

To connect to the database we first specify the driver using dbDriver. The function’s main argument is the type of driver, such as “SQLite” or “ODBC”.

drv <- dbDriver('SQLite')
class(drv)

We then establish a connection to the specific database with dbConnect. The first argument is the driver. The most common second argument is the DSN3 connection string for the database, or the path to the file for SQLite databases. Additional arguments are typically the database username, password, host and port.

con <- dbConnect(drv, '../data/diamonds.db')
class(con)

Now that we are connected to the database we can learn more about the database, such as the table names and the fields within tables, using functions from the DBI package.

dbListTables(con)
dbListFields(con, name='diamonds')

At this point, we are ready to run a query on that database using dbGetQuery. This can be any valid SQL query of arbitrary complexity. dbGetQuery returns an ordinary data.frame, just like any other. Fortunately, dbGetQuery has the stringsAsFactors argument. Again, setting this to FALSE is usually a good idea, as it will save processing time and keep character data as character.

longQuery <- "SELECT * FROM diamonds, DiamondColors WHERE diamonds.color = DiamondColors.Color"
diamondsJoin <-dbGetQuery(con, longQuery,stringsAsFactors=FALSE)
head(diamondsJoin)

The first step to reading from a database is to create a DSN . This differs from the operating system but should result in a string name for that connection.

While it is not necessary, it is good practice to close the ODBC connection using dbDisconnect, although it will close automatically when either R closes or we open another connection using dbConnect. Only one connection may be open at a time.

R Binary Files

When working with other R programmers, a good way to pass around data or any R objects like variable and functions is to use RData files. These binary files that represent R objects of any kind. They can store single or multiple objects and can be passed among Windows, Mac or Linux without a problem.

First, let’s create an RData file, remove the object that created it, and then read it back into R.

save(tomato, file= "data/tomato.rdata")
rm(tomato)
# Check if it still exist
head(tomato)

Now let’s load it again from the RData file.

load("data/tomato.rdata")# Check if it exist now
head(tomato)

Now let’s create multiple objects and store them in single RData file, remove them and then load them again.

n <- 20
r <- 1:10
w <- data.frame(n,r)
# lets check them out
head(w)
# save them
save(n,r,w, file="data/multiple.rdata")
# delete them
rm(n,r,w)
# now load them back from rdata
load("data/multiple.rdata")
# print out n and r
n
r

Data Included With R

In addition to external data sources, R and some packages come with data included, so we easily have data to use. Accessing these data is simple as long as we know what to for. ggplot2 for instance, comes with a dataset about diamonds . It can be loaded using the data function as below

require(ggplot2)
data(diamonds)
head(diamonds)

To find the list of available data, simple type data() into the console. Sample datasets available in R as follows.

Extract Data From Websites

Information is often scattered about in tables, divs, spans or other HTML elements. As an example, we put the menu and restaurant details for Ribalta, a beloved New York pizzeria, into an HTML file. The address and phone number are stored in an ordered list, section identifiers are in spans and the items and prices are in tables.

We use Hadley Wickham’s rvest package to extract the data into a usable format. The file can be read directly from the URL, or from disc, using read_html. This creates an xml_document object that holds all of the HTML.

library(rvest)
ribalta <- read_html('http://www.jaredlander.com/data/ribalta.html')
class(ribalta)
ribalta

By exploring the HTML we see that the address is stored in a span, which is an element of an ordered list. First, we use html_nodes to select all span elements within ul elements.

Following a highly nested hierarchy of HTML elements can be bothersome and brittle, so we instead identify the element of our interest by specifying its class, in this case ’street’. In HTML, class is denoted with a period (.) and ID is denoted with a hash (#). Instead of having html_nodes search for a span, we have it search for any element with class ’street’.

ribalta %>% html_nodes('.street') %>% html_text()

In this particular file, a lot of information is stored in tables with class ’food-items’, so we specify that html_nodes should search for all tables with class ’food-items’. Since multiple tables exist, we specify that we want the sixth table using the extract2 function from the magrittr package. The data is finally extracted and stored in a data.frame using html_table. In this case, the tables do not have headers, so the columns of the data.frame have generic names.

ribalta %>%
html_nodes('table.food-items') %>%
magrittr::extract2(5) %>%
html_table()

Data From Other Statistical Tools

In an ideal world, another tool besides R would never be needed, but in reality, data are sometimes locked in a proprietary format such as those from SAS, SPSS or Octave. The foreign packages provide a number of functions similar to read.table to read in from other tools.

A partial list of functions to read data from the commonly used statistical tool is given below. The arguments for these functions are generally similar to read.table . These functions usually return the data as data.frame but do not always succeed.

  • read.spss — SPSS
  • read.dta — Stata
  • read.ssd — SAS
  • read.octave — Octave
  • read.mtp — Minitab
  • read.systat — Systat

While read.ssd can read SAS data, it requires a valid SAS license. This can be sidestepped by using Revolution from Revolution Analytics with their special RxSasData function in their RevoScaleR package.

Reading From AWS S3

One common way to read data from AWS S3 is to use client package aws.s3 . aws.s3 is a simple client package for the AWS S3 REST API. While other packages currently connect R to S3, they do so incompletely (mapping only some of the API endpoints to R) and most implementations rely on the AWS command-line tools, which users may not have installed on their system.

This package is not yet on CRAN. To install the latest development version you can install from the cloudyr drat repository.

# latest stable version
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"))
# on windows you may need:
install.packages("aws.s3", repos = c("cloudyr" = "http://cloudyr.github.io/drat"), INSTALL_opts = "--no-multiarch")

Once done, you can access S3 (assuming you’ve got the permissions right) using the functions described in the examples.

If you do not have access key and secret key kindly follow the link to get your credentials.

For e.g. listing the files as follows worked for me:

Sys.setenv("AWS_ACCESS_KEY_ID" = "XXXXXXXXX",
"AWS_SECRET_ACCESS_KEY" = "yyyyyyyy",
"AWS_DEFAULT_REGION" = "us-west-1")
library("aws.s3")
bucketlist()
# To get listing of all objects in a public bucketfiles <- get_bucket(bucket = 'bucket name')

S3 can be a bit picky about region specifications. bucketlist() will return buckets from all regions, but all other functions require specifying a region. A default "us-east-1" is relied upon if none is specified explicitly and the correct region can’t be detected automatically.

Some of the commonly used functions to read objects from S3 and write objects into S3 are as follows.

# save an in-memory R object into S3
s3save(mtcars, bucket = "my_bucket", object = "mtcars.Rdata")
# `load()` R objects from the file
s3load("mtcars.Rdata", bucket = "my_bucket")
# get file as raw vector
get_object("mtcars.Rdata", bucket = "my_bucket")
# alternative 'S3 URI' syntax:
get_object("s3://my_bucket/mtcars.Rdata")
# save file locally
save_object("mtcars.Rdata", file = "mtcars.Rdata", bucket = "my_bucket")
# put local file into S3
put_object(file = "mtcars.Rdata", object = "mtcars2.Rdata", bucket = "my_bucket")

Reading data is the first step to any analysis; so it is the most important step in any data analysis project. In this article, we have covered extensively some of the most common ways of reading data into the R environment. In our next article, we will see how to analyze the data and produce quality supporting graphics using R.

Statistical Visualization in R — 2

Do share your thoughts and support by commenting and sharing the article among your peer groups.

--

--

Vivekanandan Srinivasan
Analytics Vidhya

An analytics professional with over six years of experience spanning across predictive modelling, statistical analysis and big data technologies.