How to run AWS Glue Crawler

Kevin Tang
NTT DATA Cloud
Published in
3 min readJan 22, 2021

Prepare your data

Today we are going to make an AWS Glue Crawler. The first step is to prepare your data. The data I have downloaded is from Taiwan’s open data platform: https://data.gov.tw/. You may pickup the topic you are interested in. As an example of my case, I use the survey on National Health Insurance (NHI), which is the national survey of health insurance of individual Taiwan citizen. The data format should be CSV or JSON.

Upload data to Amazon S3

You need to open an account on AWS, create S3 bucket named kevintang.data.lake and later on upload your files you have just downloaded on S3.

Setting up IAM Role

Also, you need to create an IAM role, allowing AWS Glue to access to Amazon S3. AWSGlueServiceRole is the fisrt AWS built-in policy you can apple. You can create second policy name AWSGlueServiceRole-leo-tokyo in this example.

The policy in AWSGlueServiceRole-leo-tokyo is as following format, for which kevintang.data.lake is the S3 bucket name.

Create AWS Crawler

Visit AWS Glue Crawler and click Add crawler

Enter crawler name

Specify crawler source type

Pickup the S3 bucket you have just created

Pickup the IAM role you have just created

Pickup the frequency

Output the data to database on AWS Glue Data Catalog

Run Crawler

After creating AWS Glue Crawler, you can click “Run crawler” on the top

You can also drill down the logs on the right-hand side. If there is no error, it should be successfully run the crawler.

Go back to the Data Catalog > Tables, you will see two tables, for each table is representing respective data format.

Drill down the CSV one and check the table properties and schema. This means that you successfully import data to AWS Glue via Crawler. You may take use of this data for further data transformation or analysis. For example, you can bundle AWS Glue with AWS Lambda for parsing the string regarding how many times people went to the clinic over 5 times in Taipei city in one year, indicating the severity of outbreak of disease.

Thanks for your watching. If you have further question regarding this topic, please feel free to contact me.

--

--