How to run AWS Glue Crawler
Prepare your data
Today we are going to make an AWS Glue Crawler. The first step is to prepare your data. The data I have downloaded is from Taiwan’s open data platform: https://data.gov.tw/. You may pickup the topic you are interested in. As an example of my case, I use the survey on National Health Insurance (NHI), which is the national survey of health insurance of individual Taiwan citizen. The data format should be CSV or JSON.
Upload data to Amazon S3
You need to open an account on AWS, create S3 bucket named kevintang.data.lake and later on upload your files you have just downloaded on S3.
Setting up IAM Role
Also, you need to create an IAM role, allowing AWS Glue to access to Amazon S3. AWSGlueServiceRole is the fisrt AWS built-in policy you can apple. You can create second policy name AWSGlueServiceRole-leo-tokyo in this example.
The policy in AWSGlueServiceRole-leo-tokyo is as following format, for which kevintang.data.lake is the S3 bucket name.
Create AWS Crawler
Visit AWS Glue Crawler and click Add crawler
Enter crawler name
Specify crawler source type
Pickup the S3 bucket you have just created
Pickup the IAM role you have just created
Pickup the frequency
Output the data to database on AWS Glue Data Catalog
Run Crawler
After creating AWS Glue Crawler, you can click “Run crawler” on the top
You can also drill down the logs on the right-hand side. If there is no error, it should be successfully run the crawler.
Go back to the Data Catalog > Tables, you will see two tables, for each table is representing respective data format.
Drill down the CSV one and check the table properties and schema. This means that you successfully import data to AWS Glue via Crawler. You may take use of this data for further data transformation or analysis. For example, you can bundle AWS Glue with AWS Lambda for parsing the string regarding how many times people went to the clinic over 5 times in Taipei city in one year, indicating the severity of outbreak of disease.
Thanks for your watching. If you have further question regarding this topic, please feel free to contact me.