Using JDBC and Apache Spark

Overview

The Problem

val df = sqlContext.read.format(“jdbc”).options(Map(“url” -> “jdbc:vertica://[URL]:5433/warehouse”,“dbtable” -> “ratings”,“user” -> “REPLACEME”,“password” -> “REPLACEME”)).load()
val df = sqlContext.read.format(“jdbc”).options(Map(“url” -> “jdbc:vertica://[replaceme]:5433/warehouse”,“dbtable” -> “ratings”,“partitionColumn” -> “tenant_id”,“lowerBound” -> “1”,“upperBound” -> “10000”,“numPartitions” -> “10”,“user” -> “REPLACEME”,“password” -> “REPLACEME”)).load()

Ways of looking at the data

df.mapPartitions(iter => Array(iter.size).iterator).take(300)
  • partition field
  • min
  • max
  • numPartitions

The real problem

First pass at a solution (sorting, etc)

  1. Hopefully we are collecting more data now than we were a year ago, so the data won’t be evenly spread across time slots.
  2. It requires our DB to sort the entire data set. This requires a lot of work (let’s face, there is a reason why data processing systems have a sorting competition) This is only going to get worse as our data set increases. This is expensive, since it requires sending data from node to node over the network.

Second/Final Solution — the Hashing trick

val df = sqlContext.read.format(“jdbc”).options(Map(“url” -> “jdbc:vertica://[replaceme]:5433/warehouse”,“dbtable” -> “ratings”,“partitionColumn” -> “(hash(location_id) % 10) as bucket_id”,“lowerBound” -> “1”,“upperBound” -> “10000”,“numPartitions” -> “10”,“user” -> “REPLACEME”,“password” -> “REPLACEME”)).load()

Hashing in SQL

partition = 100sql = “””select(1 + mod(hash(username), %(partition)s)) as hash_code,username,first_name,last_name,agefrom user_table“”” % {“partition”: partition}df = sqlContext.read.jdbc(url = “jdbc:vertica://db_host:db_port/db_name”,table = “(%s) as tmp” % sql,properties = {“user”: user, “password”: pw},column = ‘hash_code’, lowerBound=1, upperBound=params[‘partition’], numPartitions=partition)df.registerTempTable(“users”)sqlContext.sql(“select * from users where first_name = ‘John’”).show()

Why isn’t the built in to Apache Spark

Related

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store