If you know sql, you know R

Oscar Rojo
4 min readSep 21, 2020

--

The R sqldf package allows us to execute SQL statements. The clauses, the expressions, the predicates,… are the salvation for many mediocre types such as the now-writer. sqldf is an indispensable module, even Falete’s boyfriend is able to program in SQL.

Photo by Zhang Kenny on Unsplash

Obviously I’m not going to teach you how to do queries, but I do want to show you some of the possibilities that this R package offers. “e will work with examples.

# Detach package
# detach("package:some_package", unload = TRUE)
# Clean the working memory
rm(list = ls())
# Gargabe colector
# gc()
# Set working directory
setwd("~/Documentos/Medium/SQLdf/")
getwd()

‘/home/oscar/Documentos/Medium/SQLdf’

# Install packages
packages <- c("sqldf")
newpack = packages[!(packages %in% installed.packages()[,"Package"])]

if(length(newpack)) install.packages(newpack)
a=lapply(packages, library, character.only=TRUE)
rm(packages)

The first step is to create a data.frame:

saldo1=runif(100,0,1)*1000
saldo2=runif(100,0,0.5)*10000
saldos=data.frame(cbind(saldo1,saldo2))
saldos$id_cliente=c(1:100)
saldos$alto=as.factor(ifelse(saldo1+saldo2>=4000,1,2))
summary(saldos)
saldo1 saldo2 id_cliente alto
Min. : 32.75 Min. : 2.52 Min. : 1.00 1:28
1st Qu.:270.74 1st Qu.:1123.23 1st Qu.: 25.75 2:72
Median :519.33 Median :2644.90 Median : 50.50
Mean :504.42 Mean :2513.83 Mean : 50.50
3rd Qu.:727.64 3rd Qu.:3790.58 3rd Qu.: 75.25
Max. :998.53 Max. :4947.22 Max. :100.00

We create a structure with two numerical variables saldo1 and saldo2, a variable client_id (autonumeric) and a factor that indicates whether the balances are alto (1) or bajo (2). The first step is to summarize the balances by the factor high. Programming in R a mediocre type as I would use the aggregate function:

d1=aggregate(saldos$saldo1,list(saldos$alto),FUN="max")
names(d1)=c("alto" ,"max_saldo1")
d2=aggregate(saldos$saldo1,list(saldos$alto),FUN="min")
names(d2)=c("alto" ,"min_saldo1")
d3=aggregate(saldos$saldo2,list(saldos$alto),FUN="max")
names(d3)=c("alto" ,"max_saldo2")
d4=aggregate(saldos$saldo2,list(saldos$alto),FUN="min")
names(d4)=c("alto" ,"min_saldo2")
agr=cbind(d1,d2,d3,d4);rm(d1,d2,d3,d4)
agr

It doesn’t look exactly like it should, but it’s a quick and simple code on which we can create a function and which is perfectly parameterisable. However Raul Vaquerizo, someone who believes John Locke influenced Richard Alpert more than Hume, prefers to use SQL to add data:

# library(sqldf)
agr2=sqldf(
'select alto,
max(saldo1) as max_saldo1,
min(saldo1) as min_saldo1,
max(saldo2) as max_saldo2,
min(saldo2) as min_saldo2
from saldos
group by alto;'
)
agr2

Simple code. Perfectly understandable by those less skilled in R. Migrating from another application to R may be less complicated than we think, we may lose the fear of a hypothetical migration. Obviously we can also make joins between tables. We compare the code in R with the analogous code in sqldf and so we learn to use the merge function:

#We created a random sample of 50 records
muestra=data.frame(sample(c(1:100),50))
names(muestra)=c("muestra")
#In this union we keep the sample observations
saldos.muestra=merge(saldos,muestra,by.x="id_cliente",by.y="muestra",all.y)

If this union is attempted by someone who believes that the Philadelphia experiment was real, the first thing he does is a mythical left join (the maximum expression of the ineffective information manager):

saldos.muestra2=sqldf('
select b.*
from muestra a left join saldos b
on muestra=id_cliente;')

The essential inner join in this monographic:

#Empleando merge:
muestra2=data.frame(sample(c(1:1000),100))
names(muestra2)=c("muestra")
saldos.muestra.21=merge(saldos,muestra2,by.x="id_cliente",
by.y="muestra",all.x)
#Empleando sqldf:
saldos.muestra.22=sqldf(
'select a.*
from saldos a , muestra2 b
where id_cliente=muestra;'
)
summary(saldos.muestra.21)
summary(saldos.muestra.22)
id_cliente saldo1 saldo2 alto
Min. : 3.00 Min. :141.6 Min. : 523.4 1:1
1st Qu.: 8.00 1st Qu.:353.2 1st Qu.: 781.9 2:7
Median :52.50 Median :613.8 Median :1613.1
Mean :47.12 Mean :552.1 Mean :1951.3
3rd Qu.:82.25 3rd Qu.:739.6 3rd Qu.:2735.5
Max. :90.00 Max. :855.6 Max. :4942.2



saldo1 saldo2 id_cliente alto
Min. :141.6 Min. : 523.4 Min. : 3.00 1:1
1st Qu.:353.2 1st Qu.: 781.9 1st Qu.: 8.00 2:7
Median :613.8 Median :1613.1 Median :52.50
Mean :552.1 Mean :1951.3 Mean :47.12
3rd Qu.:739.6 3rd Qu.:2735.5 3rd Qu.:82.25
Max. :855.6 Max. :4942.2 Max. :90.00

And finally the full join:

saldos.muestra.31=merge(saldos,
muestra2,
by.x="id_cliente",
by.y="muestra",
all=TRUE)
head(saldos.muestra.31,30)

Conclusion

No, I didn’t forget to put the analog code in for sqldf. It’s not possible to do full join or right join in sqldf, but you won’t deny me that the code in R is very simple. Please correct me if I’m wrong and we can do full join with sqldf and I’ll modify the monographic immediately.

I hope you like it.

No matter what books or blogs or courses or videos one learns from, when it comes to implementation everything can look like “Outside the Curriculum”.

The best way to learn is by doing! The best way to learn is by teaching what you have learned!

Never give up!

See you on Linkedin!

--

--