Accessing S3 data through SQL with Presto
This post aims to cover our initial experience of providing Presto as a service to access S3 data through SQL.
Context
Schibsted is an international media group with 8000 employees in 22 countries, and we manage a huge volume of data across our Marketplace, Media and smart services companies.
Our people have very diverse skillsets, however, a pain point where we struggle is the gap between data and humans. How can we be data-driven and data-informed if we don’t provide the right tools to access it?
The first step we took to fill this gap was to provide Notebooks as a Service through Jupyter. This solution requires Scala/Python & Spark skills. For users who are more comfortable with SQL skills, we needed to figure out how we could achieve it in a friendly and useful way.
This is where Presto comes on the scene. After analyzing different existing solutions (i.e Snowflake, Impala, Redshift Spectrum, Athena & Google Cloud) over a Technology Assessment (i.e File System Support, Resource management, Security, Interaction, Infrastructure, and Privacy compliance), we decided to bet on Presto.
Note: If you want to know more about the Technology Assessment we did to access data, stay tuned since we will publish another article about that soon.
Providing Presto as a Service
To deploy your own Presto cluster you need to take into account how are you going to solve all the pieces. We abstracted ourselves to see which systems would conform our Service. From the Query Engine to a system to handle the Access.
It’s important to know which Query Engine is going to be used to access the data (Presto, in our case), however, there are other several challenges like who and what is going to be accessed from each user. Also, traceability of the system that you build helps to know how the service is performing.
For Access (Auth*) and Traceability, Presto allows us to define some plugins to use inside the Presto Cluster. These plugins follow some interfaces defined by the SPI contract (Ref: Presto // docs // SPI Overview.).
In the next sections we are going to explain how we have solved each of these challenges in our Presto Cluster implementation:
Authentication
Since we provide the Presto Service to multiple companies from the Schibsted group, we need to authenticate users. By verifying their identity, we can handle the authorisation to access their company data. This step allows us to avoid Impersonation.
To identify which user is querying Presto, we implemented an IAM Authenticator that validates the user keys with an IAM client.
Given an AWS Account Key & AWS Secret Key, the IAM Client will request the user information.
{
“User”: {
“UserName”: “albert.franzi”,
“PasswordLastUsed”: “2018–05–31T09:15:20Z”,
“CreateDate”: “2015–10–13T06:32:28Z”,
“UserId”: “XXXXXXXXXXXXXXXX”,
“Path”: “/”,
“Arn”: “arn:aws:iam::0000000000:user/albert.franzi”
}
}
The GetUser method returns a non sensitive data with the User ARN. This UserArn will be used to verify the account accessing the Presto Cluster. So, it’s not just any valid AWS keys, it’s a valid key from an expected AWS account.
To avoid validating the keys with AWS for each request, we use a LoadingCache from Google // Guava. This cache stores an object after executing a load function. Then, after an expiration time, it will remove the instance. Since it has an expiration time, we avoid having memory leaks with outdated users.
You can check the code on the following Gist // afranzi // Presto - IAM Authenticator.
Authorization
Once you know who is trying to access the data, you need to authorise it. That means validating if a specific user can access the requested data. Since, it has already the Principal instance of the user account launching the query, with some internal patches, we succeed in assuming the user roles for each query.
This assuming role action implies delivering the authorization part to AWS, where each user role contains its own access rules (a.e AmazonS3 // Example bucket policies).
Traceability
Presto plugins enable monitoring by tracking all user activity and reporting it in Datadog. This provides user-data that we can control and overview with Datadog dashboards.
We report metrics inside the following plugins:
- IamAuthenticator.
- SystemAccessControl: It allows everything, however, it acts like “a man in the middle” by reporting to Datadog all the user interactions with Presto.
- EventListener (Logs): Report to Datadog all created and completed queries.
Metastore
Presto, as Hive, needs a Metastore to manage all the metadata. This metastore can be local or remote. In our case, we use the remote one, AWS Glue Metastore, allowing us to centralise it cross services and clusters.
# etc/catalog/hive.properties
hive.metastore=glue
hive.metastore.glue.region=eu-west-1
hive.metastore.glue.assume-role=arn:aws:iam::0000000000:role/presto
The current Glue implementation only allows using the instance credentials or the AWS keys. Since our Glue Metastore is in another AWS account, we decided to empower Glue by adding the feature of assuming a role (PR: github.com/prestodb/presto/pull/10864).
Assuming an AWS role from another account allows to consume the Glue Hive Metastore from that AWS account, otherwise, it will use the one where the instance is running (It could be easier if AWS provided an URL for each Metastore instead of a global one).
Other Patches applied
- Pushdown dereference expression #10064 - Push down all used dereference expressions in the first Project above TableScan
- Support Nested Schema Evolution in Parquet for Presto #10158.
In the end, our Github overview looks like:
Dev / CI / CD
For development, we used Docker containers, allowing us to set up a local image with Presto and all the required dependencies to work (i.e Java8, Python, Jinja2).
These docker containers were used to play & iterate in our local machine without requiring extra setups.
The Docker image created from Docker-prestodb was used later in the Presto Plugins as a base image by just downloading it from our artifactory.
Besides, Presto configuration has been developed by using Jinja2, a templating language for Python. It allows us to define all the configuration as a template that will be filled and built on the Presto start up process.
The template structure is quite basic. All files inside the etc.template folder without ‘template’ in their path will be copied into the etc folder without any change, then the files ending with ‘template’ will be used by the Jinja processor.
Our config.properties looks like the following file where we force to declare the coordinator properties and the http.
# etc.template/config.properties.template
presto.version={{PRESTO_VERSION}}.schibsted
{% include "config.properties.template.coordinator" %}
{% include "config.properties.template.http" %}
{% include "config.properties.template.https" ignore missing %}
...
Then, in the https configuration file, if HTTPs is enabled and defined it tries to fill the gaps.
# etc.template/config.properties.template.https
{%- if HTTPS_ENABLED is defined and HTTPS_ENABLED %}
http-server.https.enabled=true
http-server.https.port={{HTTPS_PORT}}
http-server.https.keystore.path=/opt/keystore/keystore.jks
http-server.https.keystore.key=presto
{% endif %}
{%- if AUTHENTICATION_TYPE is defined %}
http-server.authentication.type={{AUTHENTICATION_TYPE}}
{% endif %}
Jinja script can be found in the following Gist - Gist // afranzi // gen_config.py.
For CI, we use Travis to build the Docker image and Debian package.
The Debian build is done by using the FPM — Packaging made simple tool. The Debian packages mean to provide all the required dependencies and configuration to have a Presto Cluster working properly.
Then for CD, we use Spinnaker, whereby defining a Spinnaker Pipeline, we can have a Continuous Deployment of our Presto development into the Production cluster.
This pipeline is triggered every time that we push commits to the Docker-Prestodb and Presto-Plugins repo. This means every time that a Travis execution ends properly.
Spinnaker allows triggering the pipeline when it’s a Git Release or a specific branch.
All the infrastructure is defined by a Cloudformation file.
The Master instance can only be accessed from an external DNS through 443. Then, the slaves communicate with the master through an internal DNS with 8080. That allows us to protect all the communication inside the cluster.
All the cluster instances have an instance profile that can assume a role to access the data and the Glue Metastore.
Next steps
- Make available more datasets from other data sources (i.e Redshift).
- Have an autoscaling cluster or try to run Presto over our Shared cluster by using Apache Slider.
This post was done to show all our effort done to have a Presto Cluster up and running and to gather feedback from the community. So feel free to ask us for more information and don’t hesitate to share your thoughts & experiences about using Presto to access data.
Links of Interest
Special thanks to all people that shared their work and experiences with Presto.
- Docker // shawnzhu // prestodb — Simple PrestoDB cluster build from Facebook Maven releases.
- Presto Glue Hive Metastore Integration — Documentation about how Glue is integrated into Presto.
- Medium // Presto Parquet Reader
- Github // Teradata // Presto — Github Presto repo from Teradata
- Teradata // Presto // Properties — This is a list and description of most important presto properties that may be used to tune Presto or alter its behaviour when required.
- Teradata // Presto-Admin // EMR — Setting up Presto Admin on an Amazon EMR cluster
- Medium // How to install Presto over a cluster (2014 - Presto 0.60)
- Uber // Engineering Data Analytics with Presto and Parquet at Uber
- Prestodb.io // Presto-YARN Integration