Preparing for the MCSA Exam 70–773
Part 1 | Reading and Exploring Data — FREE NOTES
When we work with files in R, normally we import csv/text files then convert to XDF.
Introduction
Hi everyone! I’m Deep. I am currently studying a bachelors in Finance and Business Information Systems at Monash University. These are the notes I took as a precursor to my 70–773 Exam. I hope that they help at least a few people!
Prerequisites
You will need to download Microsoft’s Machine Learning Server. You can find the install instructions for Windows here. Currently, I have some beginner knowledge of R and some understanding of the basic concepts in Machine Learning. Let’s see if this is enough to pass the exam!
Importing using a CSV as a source
# Creates a POINTER to the source
input_csv = file.path(rxGetOption(“sampleDataDir”),”claims.txt”)
text = RxTextData(input_csv)
text
RXTextData(csvFile) imports the csv file for us and converts it into an R object. Next, we will convert this object into an XDF File, and call this our workingData variable, to be used in the future.
#Reads from the actual file and writes it into a new file
new_file = file.path(tempdir(),”importedClaims.xdf”)
new_file
workingData = rxImport(inData = text,
outFile = new_file,
overwrite = TRUE)
There are other params for rxImport that can be found using:
?rxImport
Data Frame as source
An in memory data frame object can also be used as a source:
data frame -> xdfdata_frame_to_import = data.frame(c(1:10), c(letters[1:10]))data_frame_to_import # note the outfile can be written inline as well, without using file.path:imported_from_data_frame = rxImport(inData = data_frame_to_import, outfile = "dataframe.xdf", overwrite = TRUE)
ODBC
When connecting to an ODBC, we can import from a table or a query:
Table
Import the entire thing and process it into a variable, kind of like how iris is stored.
Query
Run a query on the database and then use the generated data frame from that
Summarising Data using Rx Functions
Summaries allow you to summarise data. There are plenty of functions in Microsoft’s addons to R to help with this:
rxSummary
Group By: Using rxCrossTab and rxCube
Groups by a certain factor and then displays in table/grid form
Common Errors
Error in doTryCatch(return(expr), name, parentenv, handler) :
All independent variables must be factors for rxCube and rxCrossTabs: “age”, “car.age”. Use ‘F(x)’ to declare that a continuous variable x is to be treated as a factor.
This happens because you haven’t used as.factor() on age and car.age. the working data columns that has been imported from the CSV file comes in as a class of character
The Means Parameter
Defaults for RxCrossTabs and Rx Cube?
- RxCrossTab will give sum as default
- RxCube will give mean as default
Using DplyerXDF
DplyerXDF is how we use revoSacalR in R.
The %>% function
The output on the left is passed in as the first parameter to the function on the right. This function allows easier chaining of operands.
For Example, the following are equivalent:
- iris %>% head()
- head(iris)
Using RxQuantile to sort by quantile
Probability quantiles sorting
When can this be inaccurate?
- When there is a big outlier. RxQuantile doesnt sort, it just indexes your data, because its made to be as fast as possible.
Creating Visualizations
Cross Tab Gives us a lot of information about the data. It can be, for example, used to draw a heatmap as below:
Common Errors
Error in crossTabs$means[[1]] %>% heatmap(xlab = "car age", ylab = "age", : could not find function "%>%"
This could be a result of not importing dplyerXdf prior to running the command. Please see block1.r for more info about installing dplyerXdf
Making GGPlot Plots
- Use RxCube (needs to be formatted in the long format, not a matrix)
- Then run ggplot on the rxCube