AWS Glue : Crawler Creation (Step-by-step)

Emrah DABAN
cloudnesil
Published in
3 min readJun 1, 2020

AWS Glue is a serverless ETL (Extract, transform and load) service on AWS cloud. Glue’s serverless architecture makes it very attractive and cost-effective to run infrequent ETL pipelines. It makes it easy for customers to prepare their data for analytics.

Components of AWS Glue

  • Data catalog: The data catalog holds the metadata and the structure of the data.
  • Database: It is used to create or access the database for the sources and targets.
  • Table: Create one or more tables in the database that can be used by the source and target.
  • Crawler and Classifier: A crawler is used to retrieve data from the source using built-in or custom classifiers. It creates/uses metadata tables that are pre-defined in the data catalog.
  • Job: A job is business logic that carries out an ETL task. Internally, Apache Spark with python or scala language writes this business logic.
  • Trigger: A trigger starts the ETL job execution on-demand or at a specific time.

Step-by-step Glue Crawler

Before creating a crawler, we must have public data in our S3 bucket. I uploaded an example data set called ‘animal.csv’ to my bucket.

Then we can create a crawler…

In this section we will give the data path of S3 bucket…

I will choose my ‘animal.csv’ file…

We must have a role for crawlers to access to our S3 bucket. If you don’t have any role, you must create one from the “create an IAM role” section.

I’m choosing my role from “Choose an existing IAM role” section …

Then we can adjust our schedule…

We must have a database for crawler’s output, so we can add a database or choose an existing one. I’ve chosen the sampledb I created before.

By the way, crawler tables are generated with random names. We can add a prefix …

Finally we created the crawler. If you choose the “run on demand” button, it’s automatically running and it will create a table below the database you chose.

Hopefully this article gives you the information you need on AWS Glue. Also I will add a new article soon about AWS Glue with Scala SDK.

--

--