Querying RDS PostgreSQL logs using Amazon Athena

PostgreSQL logs contains useful information which can help in troubleshooting performance issues,auditing and performing historical analysis. Some of items which can be logged are listed below

Client connection and disconnection messages
Failed DB connection request
Database sql statement errors
Database Fatal Errors
Auditing Information
Database Lock history
Long running queries and Query plan logging
High Temp/Sort Usage
Checkpoint Information
Long running Auto-Vaccum operations

RDS PostgreSQL is AWS managed web service which provides access to relational database server but not the underlying hosted operation system. Therefore, we need to interact with DB using AWS console/API’s for any maintenance activity or to view and download logs. If we now want to use the logs for troubleshooting, we will have to download the logs to our machines and go through logs. This is not scalable solution as each user will be copying logs on their desktops. Another challenge is retention of logs, as RDS retains these logs for max 7 days and rotates after it. To solve these challenges, we can devise a way to store these logs on S3 for long term retention and will use Amazon Athena to read these logs directly. Let’s do this now 🙂

As mentioned, PostgreSQL logs can be downloaded using console or API’s e.g you can use download_db_log_file_portion method in boto3 (python library provided by AWS) to download Database logs in 1MB chunk. Here is sample code which uses boto3 to copy these logs (you can use this to build custom solution which suits your need.) We can schedule this as lambda function at regular interval using cloudwatch events.

Following parameters are required to log extra information in PostgreSQL logs.

log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 10240
log_autovacuum_min_duration = 1000
log_min_duration_statement = 5000 [Milli seconds. Modify as per your environment]
pgaudit.log =’ddl,role’
shared_preload_libraries =‘auto_explain,pg_stat_statements,pgaudit’
auto_explain.log_min_duration = 5000 [Milli seconds. Modify as per your environment]
log_duration = off

Typical PostgreSQL logs look like below.

2017-07-30 06:53:50 UTC:10.0.1.2(7969):dba_user@postgres:[8611]:LOG: LOG: duration: 65311.253 ms statement: update test_lock set b=2 where a=2;

Next step is to analyze these logs using Amazon Athena .Amazon Athena is an AWS service , built on presto, which allows to directly read data from S3 and supports various data formats like CSV, JSON, ORC, Avro, and Parquet. We will use Hive regex serde to split the fields into useful column.

CREATE EXTERNAL TABLE `postgres_logs`(
 `logtime` timestamp COMMENT 'Log timestamp', 
 `tz` string COMMENT 'Log timezone', 
 `client` string COMMENT 'Client IP or hostname', 
 `clientport` int COMMENT 'Client port', 
 `username` string COMMENT 'DB username making connection to database', 
 `dbname` string COMMENT ' database name', 
 `serverport` int COMMENT ' server port', 
 `log_level` string COMMENT ' Indicating log level i.e LOG,ERROR,FATAL,DETAILED', 
 `log_type1` string COMMENT ' Classification of event i.e connection, disconnection , audit', 
 `duration` decimal(38,6) COMMENT ' Applicable for timed queries (ms)', 
 `log_type2` string COMMENT '', 
 `message` varchar(40000) COMMENT ' Postgresql log message')
PARTITIONED BY ( 
 `year` int, 
 `month` int, 
 `day` int, 
 `hour` int)
ROW FORMAT SERDE 
 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
 'input.regex'='^(\\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2})\\s(\\S+):\\[?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}|[\\w\\.-]+)?\\]?\\(?(\\d+)?\\)?:\\[?(\\w+)?\\]?@\\[?(\\w+)?\\]?:\\[?(\\d+)?\\]?:(\\w+)?:\\s*(\\w+):?\\s*(\\d+\\.\\d+)?(?:\\s\\w+)?\\s*(\\w+)?:?(.*)', 
 'timestamp.formats'='yyyy-MM-dd HH:mm:ss ') 
STORED AS INPUTFORMAT 
 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
 's3://<bucket>';

In above ddl, we have made it as partitioned table. This helps to minimise the data access which improves query performance. This also reduces AWS bill 🙂 as athena billing is done on amount of data scanned . To make it work, Log copy lambda function will have to modified to store the data in buckets using timestamp prefix e.g s3://db-logs/dbname/2018/04/26/4. You can refer to this blog article for Athena Performance tuning techniques for understanding more on this. We can now read logs from s3 bucket and answer lot of questions.e.g

How many DB connections we had in particular hour?

SELECT count(*) FROM postgres_logs where log_type1='connection' and log_type2='authorized' and year=2018 and month=4 and day=26 and hour=8

Queries which ran more than 5 seconds

SELECT * FROM postgres_logs where duration>5000

Any fatal or error messages in logs?

select * from postgres_logs where log_level in ('ERROR','FATAL')

Querying Audit logs

select * from postgres_logs where log_type1 ='AUDIT'

Logs with queries spawning multiple lines will not be shown properly. I searched for fix but came across article which talks about this being Hive Regex Serde limitation.

Currently Hive does not support recognition of embedded newlines in text formatted data, even via its OpenCSV implementation. This is noted at: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

The reason the RegEx does not work is cause its applied on top of the record reader, which gives the RegEx only a single line input (cause its broken in an upper layer).

Tags: athena, aws, PostgreSQL, rds

Querying RDS PostgreSQL logs using Amazon Athena

Related

Leave a Reply Cancel reply