sparkavro: Manupilate Apache Avro file with sparklyr

Aki Ariga
Democratizing Data
Published in
1 min readMar 26, 2017

I created a simple sparklyr extension to handle Apache Avro file. It is just a simple wrapper of DataBrick’s spark-avro. It is listed in the official document of sparklyr extensions.

Installation

Use {devtools} to install sparkavro.

devtools::install_github("chezou/avrospark")

Simple usage

You can read and write Avro file as follows:

library(sparklyr)
library(sparkavro)
sc <- spark_connect(master = "spark://HOST:PORT")
df <- spark_read_avro(sc, "test_table", "/user/foo/test.avro")
spark_write_avro(df, "/tmp/output")

This is the very first version, so there might be bugs especially around options. If you find any bug, please raise on the GitHub issue.

--

--

Aki Ariga
Democratizing Data

ML Engineer at Arm Treasure Data. Previously Cloudera. Love machine learning, data analysis, Ruby and Python.