Audit RedShift Historical Queries With pgbadger

Bhuvanesh
Bhuvanesh
Mar 9 · 5 min read

AWS RedShift is one of the most commonly used services in Data Analytics. In a very busy RedShift cluster, we are running tons of queries in a day. As a Datawarehouse admin, you can do real-time monitoring with the nice graphs provides by the AWS. Also, we have the historical data available on the console, so anytime we can go and search the queries. But many times we don’t need to see all the queries, We just need a consolidated report of overall queries in a particular time frame. It's not possible to filter the queries bases on users. It's always a good practice to audit RedShift historical queries which will help you to understand who is running what kind of queries.

PgBadger:

Those of you with experience of running PostgreSQL in production, may have heard about PgBadger. Its an open-source tool to analyze the PostgreSQL logs. It’ll give you a nice overview of the PostgreSQL cluster including the query metrics. Since RedShift has PostgreSQL under the hood, we used PgBadger to explore and analyze RedShift logs.

Get the Logs:

In RedShift we can export all the queries which ran in the cluster to S3 bucket. But it's not in realtime. Every 1hr we’ll get the past hour log. We’ll get three different log files.

  • Connection log — logs authentication attempts, and connections and disconnections.
  • User log — logs information about changes to database user definitions.
  • User activity log — logs each query before it is run on the database.

We are only interested in analyzing the SQL queries. So we can parse the activity logs file alone and ignore the rest for now. This log is not enabled by default, it needs to be enabled manually.

Enable the activity log:

  • Create a custom parameter group.
  • Set enable_user_activity_logging is TRUE.
  • Then go to your logging S3 bucket assign the below bucket policy. But make sure you should replace the bucket name and the region-acc-id with proper AWS account id which is mentioned here.
{
"Version": "2012-10-17",
"Statement": [{
"Sid": "Put bucket policy needed for audit logging",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::region-acc-id:user/logs"
},
"Action": "s3:PutObject",
"Resource": "arn:aws:s3:::your-logging-bucket/*"
},
{
"Sid": "Get bucket policy needed for audit logging ",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::region-acc-id:user/logs"
},
"Action": "s3:GetBucketAcl",
"Resource": "arn:aws:s3:::your-logging-bucket"
}
]
}
  • Then go to cluster → maintenance and monitor → Audit logging
  • Once its done, in next one hour you can get the log files like below.

Install PgBadger:

The pgbadger is available on the official PostgreSQL repository.

-- Ubuntu
apt-get install pgbadger
-- centos
yum install pgbadger

Download the log files:

The logs are stored in the proper partition format(yyyy/mm/dd). So in our case, we do this analysis on a daily basis. So we download the files daily once (UTC time).

Update: Now RedShift log format is officially supported

But the challenge here is, the current format of RedShift logs are not acceptable by the pgbadger, but we can do some changes to make it parsable.

'2020-03-07T14:42:14Z UTC [ db=dev user=rdsdb pid=16750 userid=1 xid=5301 ]' LOG: SELECT 1
  • 2020–03–07T14:42:14Z — Remove the T and Z
  • LOG: — Add a suffix as statement

This line should be formatted as below.

'2020-03-07 14:42:14 UTC [ db=dev user=rdsdb pid=16750 userid=1 xid=5301 ]' LOG: statement: SELECT 1

Host the HTML file:

Once the file has been analyzed by the pgbadger, then it’ll generate the output file in html format. To view this, we can host it with a tiny ec2 instance or use S3 static hosting. Here we used S3 static hosting to avoid unnecessary costs for this.

Caution: Open this data to the public is not recommended, so use proper security hardenings and etc.

Script for download and parse the logs:

Update: 2020–03–24

Now Redshift log format is officially supported by PgBadger. Please refer the below link and screenshot.
So once you downloaded the log file, instead of customiznig, we can run the following command to generate the report.

pgbadger -f redshift log_file

Reports:

Now you can hit the S3 URL to view your reports. But it’ll give you query level metrics. So directly go to the queries tab.

Conclusion:

If you want the analysis in every hour, download the new log files (you can use s3 event triggers). We are refreshing the data on a daily basis but every day we want to see the last 24hrs data only. If you want to keep past N days/months use --incremental option. To learn more about the pgbadger options read their documentation page.

Searce Engineering

We identify better ways of doing things!

Bhuvanesh

Written by

Bhuvanesh

BigData | Cloud &Database Architect | blogger thedataguy.in

Searce Engineering

We identify better ways of doing things!

More From Medium

More on AWS from Searce Engineering

More on AWS from Searce Engineering

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade