Preparing for the MCSA Exam 70–773

Part 1 | Reading and Exploring Data — FREE NOTES

Deep Bhattacharyya
DeeplyDiligent Blog
4 min readJan 20, 2019

--

When we work with files in R, normally we import csv/text files then convert to XDF.

Introduction

Hi everyone! I’m Deep. I am currently studying a bachelors in Finance and Business Information Systems at Monash University. These are the notes I took as a precursor to my 70–773 Exam. I hope that they help at least a few people!

Prerequisites

You will need to download Microsoft’s Machine Learning Server. You can find the install instructions for Windows here. Currently, I have some beginner knowledge of R and some understanding of the basic concepts in Machine Learning. Let’s see if this is enough to pass the exam!

Importing using a CSV as a source

RXTextData(csvFile) imports the csv file for us and converts it into an R object. Next, we will convert this object into an XDF File, and call this our workingData variable, to be used in the future.

There are other params for rxImport that can be found using:

Data Frame as source

An in memory data frame object can also be used as a source:

ODBC

When connecting to an ODBC, we can import from a table or a query:

Table
Import the entire thing and process it into a variable, kind of like how iris is stored.

Query
Run a query on the database and then use the generated data frame from that

Summarising Data using Rx Functions

Summaries allow you to summarise data. There are plenty of functions in Microsoft’s addons to R to help with this:

rxSummary

Group By: Using rxCrossTab and rxCube

Groups by a certain factor and then displays in table/grid form

Common Errors

This happens because you haven’t used as.factor() on age and car.age. the working data columns that has been imported from the CSV file comes in as a class of character

The Means Parameter

Defaults for RxCrossTabs and Rx Cube?

  • RxCrossTab will give sum as default
  • RxCube will give mean as default
RxCube when means = FALSE (the cost is a sum)
RxCube when means = TRUE

Using DplyerXDF

DplyerXDF is how we use revoSacalR in R.

The %>% function

The output on the left is passed in as the first parameter to the function on the right. This function allows easier chaining of operands.

For Example, the following are equivalent:

  • iris %>% head()
  • head(iris)

Using RxQuantile to sort by quantile

Probability quantiles sorting

When can this be inaccurate?

  • When there is a big outlier. RxQuantile doesnt sort, it just indexes your data, because its made to be as fast as possible.

Creating Visualizations

Cross Tab Gives us a lot of information about the data. It can be, for example, used to draw a heatmap as below:

Common Errors

This could be a result of not importing dplyerXdf prior to running the command. Please see block1.r for more info about installing dplyerXdf

Making GGPlot Plots

  1. Use RxCube (needs to be formatted in the long format, not a matrix)
  2. Then run ggplot on the rxCube
This will make the same histogram as done above with the heatmap(…) function, except it will be prettier (ggplot2) and has more controls.

Making RxHistograms

Shows Counts for Each Cost at each age category from 17–60+
Shows counts for each cost from the workingData

--

--

Deep Bhattacharyya
DeeplyDiligent Blog

Full Stack Developer at Learnmate, Australia's Largest Tutoring Agency. I love to share my passion in tech and finance. https://deeplydiligent.github.io/