AWS Glue feature overview

Glue is a fully-managed ETL service on AWS. Provides crawlers to index data from files in S3 or relational databases and infers schema using provided or custom classifiers. Indexed metadata is stored in Data Catalog, which can be used as Hive metadata store. Jobs written in PySpark and scheduled either time-based or event-based, transform the data on a fully-managed Spark execution engine. Pyspark code can be tested in Zeppelin notebooks or REPL shell that connects to Glue service.

Crawlers

Indexing

  • only indexes data; does not actually process the data
  • can to detect both new data and changes to existing data
  • automatically adds new tables, new partitions, new versions of table definitions
  • automatic schema discovery via classifiers
  • semi-structured data (mixed data with single columns, nested columns etc.) can be automatically converted to structured relational data

Crawlable Datastores

  • S3 (Several tables can be created automatically from one bucket)
  • Redshift
  • Amazon Relational Database Service (MySQL, PostgreSQL, Aurora, and MariaDB)
  • JDBC (accessible from VPC)

Classifiers

A crawler uses a prioritized list of classifiers and chooses the first matching classifier to infer schema / parse data

Supported formats

  • CSV, JSON, Avro, Parquet, …
  • Predefined data formats: apache log, linux kernel log, ruby log, …
  • Relational database schemas
  • Supported compression codecs: ZIP, BZIP, GZIP, LZ4, Snappy (not Hadoop Snappy)
    Note: If file is compressed it needs to be downloaded to be processed

Custom classifier

Data Catalog

Metastore

  • tables and databases are only metadata objects in the Data catalog, but do not contain the actual data
  • can be used as drop-in replacement to Hive metastore in EMR
  • metadata sources: glue crawlers, HIVE DDL, bulk import from Hive metastore

History per table

  • job execution times
  • number of added rows
  • run time
  • schema version history

Partitions

  • only partitioned tables created by crawlers can be used
  • S3 partitions need to have the same file schema, file format and compression format

Table metadata example

{
“StorageDescriptor”: {
“cols”: {
“FieldSchema”: [
{
“name”: “primary-1”,
“type”: “CHAR”,
“comment”: “”
},
{
“name”: “second “,
“type”: “STRING”,
“comment”: “”
}
]
},
“location”: “s3://aws-logs-111122223333-us-east-1”,
“inputFormat”: “”,
“outputFormat”: “org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat”,
“compressed”: “false”,
“numBuckets”: “0”,
“SerDeInfo”: {
“name”: “”,
“serializationLib”: “org.apache.hadoop.hive.serde2.OpenCSVSerde”,
“parameters”: {
“separatorChar”: “|”
}
},
“bucketCols”: [],
“sortCols”: [],
“parameters”: {},
“SkewedInfo”: {},
“storedAsSubDirectories”: “false”
},
“parameters”: {
“classification”: “csv”
}
}

ETL

Transform script

Development Environment

  • Developer Endpoints to connect to Glue service
  • can be tested in Zeppelin notebook (hosted on EC2) or REPL shell
  • can be connected to IDE

Error handling

  • retries 3 times before sending error notification
  • statistics and errors are send to Cloudwatch
  • errors / success notification can also be used to trigger Lambda functions
  • filtering bad data is handled automatically

Scheduler

Trigger

  • time-based (cron)
  • event-based
  • on-demand

Execution

  • smallest possible execution frequency 5min, for streaming use Kinesis etc.
  • handles dataset dependencies automatically (trigger next step in pipeline on completion etc.)
  • automatic retry
  • Bookmarks to store which partitions have been processed
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.