Connect Google BigQuery to IBM SPSS Modeler using JDBC with R
If you want to mine your data using IBM SPSS Modeler and your data is stored in the Google Cloud, you can do it and I will show you how in this post.
I am not going to explain how valuable is to use the cloud, and how cool is to set up an Hadoop Cluster using IBM, Amazon, Google or any other’s cloud. In seconds you can have your infrastructure ready to use. So if you are dealing with big amounts of data you might need to mine it…and for this…the best to use is IBM SPSS Modeler!
There are (in my opinion), four different ways to connect to Google BigQuery and IBM SPSS Modeler:
1. R →There is a package called bigrquery https://github.com/hadley/bigrquery
You cannot use it yet in Modeler 16 because it uses R 2.15 and you need 3.1 to install this package. But for the next release we will be able to install this package and connecting to BigQuery and create an extension node will be very easy.
2. JDBC → There is a JDBC Open Source driver to do that, and you can find it here: https://code.google.com/p/starschema-bigquery-jdbc/
I created an extension for SPSS Modeler using R to connect toBigQuery through JDBC. It is less direct than using the bigrquery package of the previous point, but still quite easy to do. Here you can see how it looks like, using the Custom Dialog Builder I created the user interface and it is as easy as selecting your projectID, UserID, KeyID and then writing the Query.
3. If you are willing to pay, there are some companies that developed ODBC Drivers to connect to BigQuery: http://www.simba.com/connectors/google-bigquery-odbc
4. The best way might be using IBM SPSS Analytic Server, but BigQuery is not yet supported (but should be possible to implement).
How to get started?
If you want to test, just go to the Google Cloud Platform, and you can get started for free. If you are just testing I doubt you will reach the number of requests needed to pay, but anyway Google offers 300$ for free to start using their cloud.
When you create your BigQuery instance, there are publicdata:sample so you don’t have to upload your own dataset. Find here all the information about the Sample Tables included: https://cloud.google.com/bigquery/docs/sample-tables
NameSummary# RowsgsodSamples from US weather stations since 1929115MmlabMeasurement data of broadband connection performance240BnatalityBirth information for the United States from 1969 to 200868MshakespeareWord index for works of Shakespeare164KwikipediaRevision information for Wikipedia articles314M
Some of the sample queries that you can use:
- A few simple SELECT queriesHeaviest 10 children
SELECT weight_pounds, state, year, gestation_weeks FROM publicdata:samples.natality
ORDER BY weight_pounds DESC LIMIT 10;
- A popular myth debunked!
SELECT word FROM publicdata:samples.shakespeare WHERE word=”huzzah”;
- A few count queriesHow many works of Shakespeare are there?
SELECT corpus FROM publicdata:samples.shakespeare GROUP BY corpus;
- Which are the longest works?
SELECT corpus, sum(word_count) AS wordcount FROM publicdata:samples.shakespeare GROUP BY corpus ORDER BY wordcount DESC;
The R code used in the extension is the following:
drv <- JDBC(“net.starschema.clouddb.jdbc.BQDriver”,”bqjdbc-1.4.jar”,identifier.quote=”`”)
conn <- dbConnect(drv, “jdbc:BQDriver:<projectID>?withServiceAccount=false”, “<userID>”, “<keyID>”)
data<-dbGetQuery(conn, “SELECT year, month, day, weight_pounds FROM [publicdata:samples.natality] LIMIT 5”)