DAY 6: Getting data in R for exploratory data analysis

SaiGayatri Vadali
BuzzRobot
Published in
8 min readDec 27, 2017

This article is sixth one in the series “Getting started with data science in 30 days using R programming”. You can find all the other articles here.

Data comes in many forms and formats. Hence, data collection forms an extremely important step in data science. Today let’s see most common files and methods of getting data from them in R. The following are the most commonly available files from which we collect data.

  1. Flat files
  2. Excel files
  3. MySql files
  4. JSON objects
  5. text files
  6. HTML and XML files
  7. Other files and web scrapping

Installing packages and getting the workspace ready!!

Before, we get along this article, it is needed to understand that we don’t have many of the packages given below in our system or workspace.

Installing package from CRAN:

For this, you need to use following command in your R console to get the packages.

>install.packages("<packagename")

The names of packages can be case sensitive.

While this step is enough to be done once in life time, we always need to load the packages into our workspace before using them.

Loading the package into workspace:

This is like ‘import’ statement in Python. It can done as follows

>library("<package name>")

Now, let us see how we can read data from different files in R.

FLAT FILES:

Flat files include comma separated, text, html files . read.table() , read.csv(), read.csv2() functions are used to read data from them.

read.table() :

read.table() is the used to import data from a file in table format. It returns a data frame. The important arguments of this function include the file name, separator and header. ‘Header’ indicates the first row of the data which gives the names of attributes. If it is set to ‘FALSE’, the header column is removed.

Here is the basic syntax of using it.

mydata <- read.table("tabulardata.txt")

The CSV files:

Comma separated files are the most common variant of flat files.

They are imported using read.csv() apart from read.table() .

read.csv() also has similar syntax as that of read.table() and returns a data frame. You can set header to ‘FALSE’ or ‘TRUE’ even in it. You can explicitly set ‘sep’ argument to ‘,’ too.

> filename <-read.csv("/Users/saigayatri/name.csv")
> filename
John
1 Jack
2 John "Da Man"
3 Stephen
4
5 Joan "the bone", Anne
6
Doe X120.jefferson.st.
1 McGinnis 220 hobo Av.
2 Repici 120 Jefferson St.
3 Tyler 7452 Terrace "At the Plaza" road
4 Blankman
5 Jet 9th, at Terrace plc
6
Riverside NJ X08075
1 Phila PA 9119
2 Riverside NJ 8075
3 SomeTown SD 91234
4 SomeTown SD 298
5 Desert City CO 123
6 NA

read.csv2() is another variant used when the delimiter is a semi-colon -‘;’ .

Apart from above variants, there is another way of reading flat files too which is rather fast compared to inbuilt methods of R. It is provided by readr package which can give useful results during certain times.

>library(readr)
>my_data <- read_csv("filename.csv",col_types="ccccnnnnn")

col_types arguments gives the classes of all the columns of the dataset. For example the first letter ‘c’ indicates character type of column1.

Excel files:

Here, I am providing three ways to read excel files.

XLConnect package :

XLConnect is a java based solution.It yields satisfactory results with small data sets but may be slow with larger ones.

>library(XLConnect)
>df <- readWorksheetFromFile("<file name and extension>",
sheet=1,
startRow = 4,
endCol = 2)

xlsx package :

xlsx package provides read.xlsx() or read.xlsx2() functions to read data into R.

read.xlsx():

read.xlsx() is good for smaller data sets and is similar to read.table() in terms of arguments. It is to be noted that sheet index or sheet name are mandatory while using read.xlsx()

> library(xlsx)
>df <- read.xlsx("<filename >",
sheetIndex = 1)

read.xlsx2():

This function is good for larger data sets and provides faster computation than the other two.

>df <- read.xlsx2("<filename>", 
sheetIndex = 1,
startRow=2,
colIndex = 2)

MySql files:

In R, we have ‘RMySQL’ package to perform operations on MySql databases. We can connect to already existing databases in our system using ‘dbConnect’ function.

Once library is loaded into workspace, create the connection as follows

>mydatabase = dbConnect(MySQL(), user='user', password='password', dbname='database_name', host='host')

We can create tables in data base as follows

dbWriteTable(mydatabase, name='table1', value=data.frame.name)

Now, We can retrieve data from the table as follows

rs = dbSendQuery(mydatabase, 'select * from the table1')

We need to use fetch() function to convert results into a dataframe

data = fetch(rs, n=-1)

Let’s see an example :

I created a database namely ‘students’ from MySQLWorkbench. Now let’s see how to connect with this database and perform operations on it:

#connect with the database students
mydatabase = dbConnect(MySQL(), user='root', password='1234', dbname='students', host='localhost')
#get all tables in the database using dbListTables(mydatabase)
dbListTables(mydatabase)
[1] "mytable" "student"
#retrieve data from the student table
rs = dbSendQuery(mydatabase, 'select * from student')
data = fetch(rs, n=-1)
print(data)
user pwd email phno
1 123 temp1234 kesh@123 1234
2 123456 1234 1234 22222
3 123456 121232343 12312342342 22222
4 123456 12312432 jahsdf 22222
5 123456 ftgy jahsdf 22222
6 123456 dfefvgf ahsdf@123 22222
7 123456 svfsf ahsdf@123 22222
8 123456 rgaehetH ahsdf@123 22222
9 123456 hdhgb ahsdf@123 22222
10 123 12345678 fadf 222
11 123 12345678 fadf 222
12 11111 ngdj thweweh 1111
#let's convert a dataframe into mysql table
#Glimpse of dataframe
head(df)
YEAR Y W R L K
1 1948 1.214 0.243 0.1454 1.415 0.612
2 1949 1.354 0.260 0.2181 1.384 0.559
3 1950 1.569 0.278 0.3157 1.388 0.573
4 1951 1.948 0.297 0.3940 1.550 0.564
5 1952 2.265 0.310 0.3559 1.802 0.574
6 1953 2.731 0.322 0.3593 1.926 0.711
dbWriteTable(mydatabase, name='Newtable', value=df)
[1] TRUE
dbListTables(mydatabase)
[1] "Newtable" "mytable" "student"
#successfully added

json files:

In order to read JSON files, you need to add ‘rjson’ package to your workspace. If you don’t have it already in your packages, you can get it with the following code snippet.

> install.packages("rjson")

After installing rjson package, load it into workspace with library(“rjson”). Then with ‘fromJSON()’ function, we can load the JSON file.

>library("rjson")
> json_R_object <- fromJSON(file="/Users/saigayatri/Documents/r30.json")
> print(json_R_object)

Text files:

Text files can be read using read.table() just like how we read csv and tab delimited files. Here is a code snippet explaining the same.

> text_data <-read.table("/Users/saigayatri/Desktop/rprogramming.txt",header=FALSE)
> text_data
V1 V2 V3 V4 V5 V6
1 Hello people, welcome to this tutorial-
V7
1 Getting started with data science in 30 days using R programming
V8
1 .
> class(text_data)
[1] "data.frame"

Now, let us use a package ‘readtext ‘ to read text files

The ‘readtext’ package:

‘readtext’ package currently can handle plain text files, json, csv, PDF, microsoft word, files from a url, archive files. readtext package comes with a directory containing examples of all the above files. Let us read a text file using readtext package.

> DATA_DIR <- system.file("extdata/", package = "readtext")
> library(readtext)
> readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
readtext object consisting of 13 documents and 0 docvars.
# data.frame [13 x 2]
doc_id
<chr>
1 UDHR_chinese.txt
2 UDHR_czech.txt
3 UDHR_danish.txt
4 UDHR_english.txt
5 UDHR_french.txt
6 UDHR_georgian.txt
# ... with 7 more rows, and 1 more variables:
# text <chr>

Now, let us read our earlier ‘rprogramming.txt’ file and print it.

> read_text <-readtext("/Users/saigayatri/Desktop/rprogramming.txt")
> print(read_text)
readtext object consisting of 1 document and 0 docvars.
# data.frame [1 x 2]
doc_id text
<chr> <chr>
1 rprogramming.txt "\"Hello peop\"..."
> class(read_text)
[1] "readtext" "data.frame"

You can read more about readtext package here.

HTML and XML files:

HTML and XML files can be read with ‘XML’ library. Firstly, download it and load it into your workspace.

>install.packages("XML")
>library(XML)

Html file:

Now, let us read html file using this library. For this ,we use htmlTreeParse() function. The first argument of it takes the link to the url.

>doc <- htmlTreeParse("https://medium.com/@TheDataGyan/getting-started-with-data-science-in-30-days-with-r-programming-day-1-e59dd8d377e7",useInternal =TRUE)

Now, we will see what is the class of this ‘doc’ object.

> class(doc)
[1] "HTMLInternalDocument" "HTMLInternalDocument"
[3] "XMLInternalDocument" "XMLAbstractDocument"

Let us unlist it to get a character vector

>doc.text = unlist(xpathApply(doc, '//p', xmlValue))

now, let us put it in a data frame

> class(doc.text)
[1] "character"
> df <- data.frame(doc.text)

XML file:

Now, let us get data from an xml file and store into the most convenient format- a data frame

> url <-"/Users/saigayatri/Desktop/notes.xml"
> data_df <- xmlToDataFrame(url)
> data_df
text
1 Readers
2 TheDATAGyan
3 R PROGRAMMING
4 It is interesting!!

Other files, sources and web scrapping:

Let’s see how to get data from SAS and STATA files in R.

Importing data from SAS files:

In order to read ‘sas7bdat’ files, we need ‘sas7bdat’ package. Install it and load it into the workspace, then use following code snippet to load data from a SAS file into R.

> mySASDATA <- read.sas7bdat("airline.sas7bdat")
> mySASDATA
YEAR Y W R L K
1 1948 1.214 0.243 0.1454 1.415 0.612
2 1949 1.354 0.260 0.2181 1.384 0.559
3 1950 1.569 0.278 0.3157 1.388 0.573
4 1951 1.948 0.297 0.3940 1.550 0.564
5 1952 2.265 0.310 0.3559 1.802 0.574
6 1953 2.731 0.322 0.3593 1.926 0.711
7 1954 3.025 0.335 0.4025 1.964 0.776
8 1955 3.562 0.350 0.3961 2.116 0.827
9 1956 3.979 0.361 0.3822 2.435 0.800
10 1957 4.420 0.379 0.3045 2.707 0.921
11 1958 4.563 0.391 0.3284 2.706 1.067
12 1959 5.385 0.426 0.3856 2.846 1.083
13 1960 5.554 0.441 0.3193 3.089 1.481
14 1961 5.465 0.460 0.3079 3.122 1.736
15 1962 5.825 0.485 0.3783 3.184 1.926
16 1963 6.876 0.506 0.4180 3.263 2.041
17 1964 7.823 0.538 0.5163 3.412 1.997
18 1965 9.120 0.564 0.5879 3.623 2.257
19 1966 10.512 0.586 0.5369 4.074 2.742
20 1967 13.020 0.622 0.4443 4.710 3.564
21 1968 15.261 0.666 0.3052 5.217 4.767
22 1969 16.313 0.731 0.2332 5.569 6.511
23 1970 16.002 0.831 0.1883 5.495 7.627
24 1971 15.876 0.906 0.2023 5.334 8.673
25 1972 16.662 1.000 0.2506 5.345 8.331
26 1973 17.014 1.056 0.2668 5.662 8.557
27 1974 19.305 1.131 0.2664 5.729 9.508
28 1975 18.721 1.247 0.2301 5.722 9.062
29 1976 19.250 1.375 0.3452 5.762 8.262
30 1977 20.647 1.544 0.4508 5.877 7.474
31 1978 22.726 1.703 0.5877 6.108 7.104
32 1979 23.619 1.779 0.5346 6.852 6.874

Reading stata file:

We need ‘foreign’ package to read Stata files. Note that Stata files end with .dta extension. Install ‘foreign package’ , load it into workspace and read the data using read.dta() function. Note, I used the same airline dataset in Stata format here.

> library(foreign)
> mydata <- read.dta("airline.dta")
> mydata
year y w r l k
1 1948 1.214 0.243 0.1454 1.415 0.612
2 1949 1.354 0.260 0.2181 1.384 0.559
3 1950 1.569 0.278 0.3157 1.388 0.573
4 1951 1.948 0.297 0.3940 1.550 0.564
5 1952 2.265 0.310 0.3559 1.802 0.574
6 1953 2.731 0.322 0.3593 1.926 0.711
7 1954 3.025 0.335 0.4025 1.964 0.776
8 1955 3.562 0.350 0.3961 2.116 0.827
9 1956 3.979 0.361 0.3822 2.435 0.800
10 1957 4.420 0.379 0.3045 2.707 0.921
11 1958 4.563 0.391 0.3284 2.706 1.067
12 1959 5.385 0.426 0.3856 2.846 1.083
13 1960 5.554 0.441 0.3193 3.089 1.481
14 1961 5.465 0.460 0.3079 3.122 1.736
15 1962 5.825 0.485 0.3783 3.184 1.926
16 1963 6.876 0.506 0.4180 3.263 2.041
17 1964 7.823 0.538 0.5163 3.412 1.997
18 1965 9.120 0.564 0.5879 3.623 2.257
19 1966 10.512 0.586 0.5369 4.074 2.742
20 1967 13.020 0.622 0.4443 4.710 3.564
21 1968 15.261 0.666 0.3052 5.217 4.767
22 1969 16.313 0.731 0.2332 5.569 6.511
23 1970 16.002 0.831 0.1883 5.495 7.627
24 1971 15.876 0.906 0.2023 5.334 8.673
25 1972 16.662 1.000 0.2506 5.345 8.331
26 1973 17.014 1.056 0.2668 5.662 8.557
27 1974 19.305 1.131 0.2664 5.729 9.508
28 1975 18.721 1.247 0.2301 5.722 9.062
29 1976 19.250 1.375 0.3452 5.762 8.262
30 1977 20.647 1.544 0.4508 5.877 7.474
31 1978 22.726 1.703 0.5877 6.108 7.104
32 1979 23.619 1.779 0.5346 6.852 6.874

Apart from these methods, with the help of rvest package in R, we can scrape java script data into R. Also, there are some excellent packages like tm() which help in getting data from text files and storing them as corpus which plays a very significant role in text mining.

Did you find the article helpful? If so, hit the ‘clap’ button so that it can reach to many.

--

--

SaiGayatri Vadali
BuzzRobot

An inquisitive Machine Learning Engineer, yoga trainer, fitness freak and a passionate writer!