Maximum & Fast readability of multivariate data vs Label
Published in
3 min readNov 13, 2016
Laurae: This post is about plotting data to maximize readability so you can read fast multivariate data vs a single label. Obviously, if there are interactions, it will be harder to notice them and you would go with regression coefficients / decision trees, or other statistics. It takes the example of House Prices: Advanced Regression Techniques, which has 80 predictors, versus a skewed label. The post was originally at Kaggle.
Plotting all data using tabplots
Objective: find out some of the good features visually =)
RMarkdown code: (it takes more than half of the code to just load the data =) )
# Plotting all data using tabplots
Objective: find out some of the good features visually =)
```{r, fig.width = 11, fig.height = 5.5, echo = FALSE, message = FALSE, warning = FALSE}
invisible(library(tabplot))
invisible(library(data.table))
columns <- c("numeric",
rep("character", 2),
rep("numeric", 2),
rep("character", 12),
rep("numeric", 4),
rep("character", 5),
"numeric",
rep("character", 7),
"numeric",
"character",
rep("numeric", 3),
rep("character", 4),
rep("numeric", 10),
"character",
"numeric",
"character",
"numeric",
rep("character", 2),
"numeric",
"character",
rep("numeric", 2),
rep("character", 3),
rep("numeric", 6),
rep("character", 3),
rep("numeric", 3),
rep("character", 2),
rep("numeric"))
data <- fread("../input/train.csv", data.table = FALSE, header = TRUE, sep = ",", colClasses = columns)
data$SalePrice <- log(data$SalePrice) # To respect lrmse
data <- as.data.frame(data)
for (i in 1:80) {
if (typeof(data[, i]) == "character") {
data[is.na(data[, i]), i] <- ""
data[, i] <- as.factor(data[, i])
}
}
for (i in 1:16) {
plot(tableplot(data, select = c(((i - 1) * 5 + 1):(i * 5), 81), sortCol = 6, nBins = 73, plot = FALSE), fontsize = 12, title = paste("log(SalePrice) vs ", paste(colnames(data)[((i - 1) * 5 + 1):(i * 5)], collapse = "+"), sep = ""), showTitle = TRUE, fontsize.title = 12)
}
```