PostgreSQL

Identifying PostgreSQL Bloat and fixing it without downtime

MVCC (Multi-Version Concurrency Control) feature allows databases to provide concurrent access to data. This allows each SQL statement to see a snapshot of data as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session. To summarize “Readers don’t block writers and writers don’t block readers”

If you are coming from Oracle or MySQL background, you would be aware that during an update/delete ,DML activity will make changes to rows and use separate store called “Undo” to track the old image of data. If any other session want to get consistent image, then database uses undo to provide consistent snapshot of data. Like many other databases, PostgreSQL also supports MVCC but takes different approach to store the old changes.

In PostgreSQL, update or delete of a row (tuple in PostgreSQL) does not immediately remove the old version of the row. When you update a row, it will create a copy of the row with new changes and insert in the table. Then, it will update the old copy header to make it invisible for future transactions. Similarly for delete, it won’t delete the row but update metadata to make them invisible. Eventually, these old rows will no longer be required by transactions and will have to be cleaned up. This cleanup is handled by “Vacuum”. Note that apart from increasing the total database size, table or index bloat also impacts query performance as database need to process bigger objects.

=>vacuum table 'bloated_table';

If you run above command, it will remove dead tuples in tables and indexes and marks the space available for future reuse. But this will not release the space to operating system. As vacuum is manual approach, PostgreSQL has a background process called “Autovacuum” which takes care of this maintenance process automatically. Autovacuum process to delete rows is controlled by 2 parameters autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor (There are other parametes like autovacuum_freeze_max_age which can trigger autovacuum)

vacuum threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * number of tuples

You can find this values by querying pg_settings. For RDS, autovacuum_vacuum_threshold is 50 rows and autovacuum_vacuum_scale_factor is 0.1 i.e 10% of the table size.

select name,setting from pg_settings where name in ('autovacuum_vacuum_threshold','autovacuum_vacuum_scale_factor');
name | setting
--------------------------------+---------
autovacuum_vacuum_scale_factor | 0.1
autovacuum_vacuum_threshold | 50

This means that if there is table with 100 M rows, you should have ~10M changes ( 50+0.1*100M ) before autovacuum is triggered. Additionally, there are limited number of autovacuum worker processes and if autovacuum is not tuned properly, table could have much higher dead rows. It’s advisable to reduce the scale factor to lower value, either at table level or at database level to prevent bloat. You would also need to tune the autovacuum process settings to improve the cleanup process. There is an excellent blog article titled “A Case Study of Tuning Autovacuum in Amazon RDS for PostgreSQL” on AWS database blog which helps to tune autovacuum.

If you wish to reclaim OS space, then you need to execute “Vacuum Full” which will compact tables by writing a complete new version of the table file with no dead tuples. But this comes at a cost. Vacuum full requires “Exclusive lock” on the table and blocks any DML on the table, leading to downtime. This can be problematic as large tables with multiple indexes can take really long time (7-8 hours) to rebuild.

Enter pg_repack !! This utility helps to perform Full vacuum without downtime by making use of trigger to take care of changes happening on parent table. Below table compares the internal working on Table vs Index rebuilds

Table Rebuild Index Rebuild
To perform a full-table repack, pg_repack will:

1. create a log table to record changes made to the original table
2. add a trigger onto the original table, logging INSERTs, UPDATEs and DELETEs into our log table
3.create a new table containing all the rows in the old table
4. build indexes on this new table
5. apply all changes which have accrued in the log table to the new table
6. swap the tables, including indexes and toast tables, using the system catalogs
7. drop the original table

pg_repack will only hold an ACCESS EXCLUSIVE lock for a short period during initial setup (steps 1 and 2 above) and during the final swap-and-drop phase (steps 6 and
For the rest of its time, pg_repack only needs to hold an ACCESS SHARE lock on the original table, meaning INSERTs, UPDATEs, and DELETEs may proceed as usual.
To perform an index-only repack, pg_repack will:

1. create new indexes on the table using CONCURRENTLY matching the definitions of the old indexes
2. swap out the old for the new indexes in the catalogs
3. drop the old indexes

Identifying Bloat!

I have used table_bloat_check.sql and index_bloat_check.sql to identify table and index bloat respectively. Below snippet displays output of table_bloat_check.sql query output. As per the results, this table is around 30GB and we have ~7.5GB of bloat.

databasename | schemaname | tablename | can_estimate | est_rows  | pct_bloat | mb_bloat | table_mb
--------------+------------+-----------+--------------+-----------+-----------+----------+-----------
postgres | askdba | bloated | t | 547962000 | 25 | 7828.23 | 30877.930

Similarly, when we run index_bloat_check.sql query to identify index bloat , we see that pkey_bloated is 65G and has bloat of ~54GB.

database_name | schema_name |      table_name      |        index_name         | bloat_pct | bloat_mb |  index_mb  |  table_mb  | index_scans
---------------+-------------+----------------------+---------------------------+-----------+----------+------------+------------+-------------
postgres | askdba | bloated | pkey_bloated | 83 | 54125 | 65137.813 | 30877.930 | 3820674

Let’s use pg_repack to clear this bloat. To use pg_repack, you need to install extension and a client utility. RDS PostgreSQL supports pg_repack for installations having version of 9.6.3 and higher.Extension version will differ depending on your RDS PostgreSQL version. e.g RDS PostgreSQL 9.6.3 installs pg_repack 1.4.0 extension, whereas 9.6.11 installs pg_repack 1.4.3 extension.

To create extension, connect as master user for RDS database and run create extension command,

postgres=>  create extension pg_extension;
postgres=> \dx pg_repack
List of installed extensions
Name | Version | Schema | Description
-----------+---------+--------+--------------------------------------------------------------
pg_repack | 1.4.3 | public | Reorganize tables in PostgreSQL databases with minimal locks

To install pg_repack client, download the tar bar from here and build the utility. You need to ensure that extension and client utility version matches. pg_repack provides option to perform full vacuum at table level, index level or table+index . If you want to perform vacuum full for table and associated indexes, then it can be done by using below statement

./pg_repack -h [rds-host-name] -p [dbPort] -d [dbname] -U [superuser] --table=[schema.tablename] --no-order -k
Password:
Password:

-k flag is important as RDS master user does not have Postgresql superuser role and omitting this option leads to error “ERROR: pg_repack failed with error: You must be a superuser to use pg_repack”

To run index-only repack, use following statement

./pg_repack -h [rds-host-name] -p [dbPort] -d [dbname] -U [superuser] --table=[schema.tablename] --only-indexes -k

Above statement will create new indexes and will drop the older indexes after all indexes are recreated. If you are performing this action on really big tables, it will take time and will slow down DML activity on the table as you will have 2*n-1 indexes before last one is created (n=number of indexes).Therefore, if there are multiple indexes on the table, it would be better to recreate index one by one using –index clause

./pg_repack -h [rds-host-name] -p [dbPort] -d [dbname] -U [superuser] --index=[schema.indexname] -k

In my scenario, I went with table+index vacuum option.After rebuild, actual table size reduction was 10% instead of 24% and for index , it was 75% instead of 85%. As you can see, there could be 10-20% variance between actual object size (post vacuum) vs estimated size. Therefore, it would be good to carry out testing in clone environment before committing on size savings.

Object Name Original Size Estimated Size after removing bloat Size Actual size after rebuild
bloated 30GB 23GB 27G
pkey_bloated 64GB 10GB 16GB

Pg_repack creates the objects under repack schema and later moves it to the correct schema. To monitor the pg_repack session, use pg_stat_activity view

select * from pg_stat_activity where application_name='pg_repack'

Speeding up Initial data load for Oracle to PostgreSQL using Goldengate and copy command

Oracle Goldengate supports Oracle to PostgreSQL migrations by supporting PostgreSQL as a target database, though reverse migration i.e PostgreSQL to Oracle is not supported. One of the key aspect of these database migrations is initial data load phase where full tables data have to copied to the target datastore. This can be a time consuming activity with time taken to load varying based on the table sizes. Oracle suggests to use multiple Goldengate processes to improve the database load performance or to use native database utilities to perform faster bulk-loads.

To use a database bulk-load utility, you use an initial-load Extract to extract source records from the source tables and write them to an extract file in external ASCII format. The file can be read by Oracle’s SQL*Loader, Microsoft’s BCP, DTS, or SQL Server Integration Services (SSIS) utility, or IBM’s Load Utility (LOADUTIL).

Goldengate for PostgreSQL doesn’t provide native file loader support like bcp for MS SQL and sqlloader for Oracle. As an alternative, we can use FORMATASCII option to write data into csv files (or any custom delimiter) and then load them using PostgreSQL copy command .This approach is not automated approach and you will have to ensure that all files are loaded into target database.

In this post, we will evaluate 2 approaches i.e using Multiple replicat Processes and using ASCII dump files with PostgreSQL copy command to load data and compare their performance. Below diagram shows both the approaches

Description of initsyncbulk.jpg follows
Ref: -https://docs.oracle.com/goldengate/1212/gg-winux/GWUAD/wu_initsync.htm#GWUAD561

To compare the scenarios, I created a test table with 200M rows(12GB) and used a RDS PostgreSQL instance (db.r3.4xlarge with 10k PIOPS)

CREATE TABLE scott.Big_table (
id NUMBER,
small_number NUMBER(5),
big_number NUMBER,
short_string VARCHAR2(50),
created_date DATE,
CONSTRAINT big_table_pkey PRIMARY KEY (id)
) tablespace LRG_TBSP;

--create sequence for PK

create sequence scott.big_table_seq start with 1 increment by 1 cache 500 nomaxvalue;
-- Load data
INSERT /*+ APPEND */ INTO scott.Big_table
SELECT scott.big_table_seq.nextval AS id,
TRUNC(DBMS_RANDOM.value(1,5)) AS small_number,
TRUNC(DBMS_RANDOM.value(100,10000)) AS big_number,
DBMS_RANDOM.string('L',TRUNC(DBMS_RANDOM.value(10,50))) AS short_string
TRUNC(SYSDATE + DBMS_RANDOM.value(0,366)) AS created_date
FROM dual
CONNECT BY level <= 10000;
COMMIT;

INSERT /*+ APPEND */ INTO scott.Big_table
SELECT scott.big_table_seq.nextval AS id,
small_number,
big_number,
short_string,
long_string,
created_date
FROM scott.Big_table;
COMMIT;

--PostgreSQL Table

CREATE TABLE booker.Big_table (id bigint,
small_number int,
big_number bigint,
short_string VARCHAR(50),
created_date TIMESTAMP,
CONSTRAINT big_table_pk PRIMARY KEY (id) )

Approach 1 : Using Oracle Goldengate multiple replicat processes to load data

In this approach, I used multiple Oracle Goldengate Replicat processes (8) using @range filter to load data into PostgreSQL.

We were able to get 5k inserts/sec per thread and were able to load the table in ~88 mins with 8 replicat processes.
One key point to remember is that if you are working with EC2 and RDS databases, you should have EC2 machine hosting trail files and RDS instance in same AZ. During the testing, we noticed that insert rate dropped drastically (~800 insert per sec) when using cross AZ writes. Below is replicat parameter file used for performing data load.

SpecialRUN
END Runtime
SETENV (NLSLANG=AL32UTF8)
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV ( PGCLIENTENCODING = "UTF8" )
SETENV (ODBCINI="/opt/app/oracle/product/ggate/12.2.0.1/odbc.ini" )
TARGETDB GG_Postgres, USERIDALIAS pguser
Extfile /fs-a01-a/databases/ggate/initload/i5
HANDLECOLLISIONS
DISCARDFILE /opt/app/oracle/product/ggate/12.2.0.1/direrr/rinit1.dsc, APPEND, megabytes 20
reportcount every 60 seconds, rate
BATCHSQL;
MAP scott.big_table, TARGET scott.big_table, FILTER (@RANGE (1,8));;

You will need to create additional replicat process files by making change to the range clause.e.g FILTER (@RANGE (2,8)), FILTER (@RANGE (3,8)), etc.

Approach 2: Data load using PostgreSQL copy command

In second approach, we used parameter file with FORMATASCII option(refer to below snippet) for creating a Goldengate Extract process which dumped the data with ‘|’ delimiter and then used PostgreSQL copy command to load data from these dump files.

Extract Parameter file

SOURCEISTABLE
SETENV (ORACLE_SID=ggpoc)
SETENV (NLSLANG=AL32UTF8)
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV (ORACLE_HOME=/opt/app/oracle/product/11.2.0.4/A10db)
SETENV (TNS_ADMIN=/opt/app/oracle/local/network/)
USERIDALIAS gguser
RMTHOST xx.222.xx.78, MGRPORT 8200, TCPBUFSIZE 100000, TCPFLUSHBYTES 300000
RMTHOSTOPTIONS ENCRYPT AES256
FORMATASCII,NOHDRFIELDS,NOQUOTE,NONAMES, DELIMITER '|'
RMTFILE /fs-a01-a/databases/ggate/initload/i4 , megabytes 1000
reportcount every 60 seconds, rate
TABLE scott.BIG_TABLE;

With above parameter file, Goldengate Extract process would send data to remote system and store the data in dump files. These files are then loaded into PostgreSQL using \copy command.

psql> \copy scott.big_table from '/fs-a01-a/databases/ggate/initload/i4000000' with DELIMITER '|';

Data load took  21 mins, which is nearly 4x faster than initial approach. If you remove the Primary key index, then it drops the time taken to ~9 mins to load 200M POC table.

 

Update:

Oracle GoldenGate 19.1 comes with DataDirect 7.1 PostgreSQL Wire Protocol ODBC driver for PostgreSQL connectivity. You can now add a parameter “BatchMechanism=2” to speed up the inserts. After setting this parameter, odbc driver will start batching inserts into memory buffer and will insert them together instead of single row inserts. You can find details here

To add the parameter, update odbc.ini and add BatchMechanism under the database section

 

[apgtarget]
Driver=/oracle/product/ggate/19.1.0.0/lib/GGpsql25.so
Description=DataDirect 7.1 PostgreSQL Wire Protocol
BatchMechanism=2
Database=pgbench
HostName=xxx.xxxxxx.us-east-1.rds.amazonaws.com
PortNumber=5432


In my testing, I saw that insert rate increased from 5.3K rows per second to 35K i.e nearly 7x increase. I also noticed that WriteIOPS on this Aurora Instance increased from 20K to 80-100K IOPS

Querying RDS PostgreSQL logs using Amazon Athena

PostgreSQL logs contains useful information which can help in troubleshooting performance issues,auditing and performing historical analysis. Some of items which can be logged are listed below

  • Client connection and disconnection messages
  • Failed DB connection request
  • Database sql statement errors
  • Database Fatal Errors
  • Auditing Information
  • Database Lock history
  • Long running queries and Query plan logging
  • High Temp/Sort Usage
  • Checkpoint Information
  • Long running Auto-Vaccum operations

RDS PostgreSQL is AWS managed web service which provides access to relational database server but not the underlying hosted operation system.  Therefore, we need to interact with DB using AWS console/API’s for any maintenance activity or to view and download logs. If we now want to use the logs for troubleshooting, we will have to download the logs to our machines and go through logs. This is not scalable solution as each user will be copying logs on their desktops. Another challenge is retention of logs, as RDS retains these logs for max 7 days and rotates after it. To solve these challenges, we can devise a way to store these logs on S3 for long term retention and will use Amazon Athena to read these logs directly. Let’s do this now 🙂

As mentioned, PostgreSQL logs can be downloaded using console or API’s e.g you can use download_db_log_file_portion method in boto3 (python library provided by AWS) to download Database logs in 1MB chunk.  Here is sample code which uses boto3 to copy these logs (you can use this to build custom solution which suits your need.) We can schedule this as lambda function at regular interval using cloudwatch events.

Following parameters are required to log extra information in PostgreSQL logs.

log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 10240
log_autovacuum_min_duration = 1000
log_min_duration_statement = 5000 [Milli seconds. Modify as per your environment]
pgaudit.log =’ddl,role’
shared_preload_libraries =‘auto_explain,pg_stat_statements,pgaudit’
auto_explain.log_min_duration = 5000 [Milli seconds. Modify as per your environment]
log_duration = off

Typical PostgreSQL logs look like below.

2017-07-30 06:53:50 UTC:10.0.1.2(7969):dba_user@postgres:[8611]:LOG: LOG: duration: 65311.253 ms statement: update test_lock set b=2 where a=2;

Next step is to analyze these logs using Amazon Athena .Amazon Athena is an AWS service , built on presto, which allows to directly read data from S3 and supports various data formats like CSV, JSON, ORC, Avro, and Parquet. We will use Hive regex serde to split the fields into useful column.

CREATE EXTERNAL TABLE `postgres_logs`(
 `logtime` timestamp COMMENT 'Log timestamp', 
 `tz` string COMMENT 'Log timezone', 
 `client` string COMMENT 'Client IP or hostname', 
 `clientport` int COMMENT 'Client port', 
 `username` string COMMENT 'DB username making connection to database', 
 `dbname` string COMMENT ' database name', 
 `serverport` int COMMENT ' server port', 
 `log_level` string COMMENT ' Indicating log level i.e LOG,ERROR,FATAL,DETAILED', 
 `log_type1` string COMMENT ' Classification of event i.e connection, disconnection , audit', 
 `duration` decimal(38,6) COMMENT ' Applicable for timed queries (ms)', 
 `log_type2` string COMMENT '', 
 `message` varchar(40000) COMMENT ' Postgresql log message')
PARTITIONED BY ( 
 `year` int, 
 `month` int, 
 `day` int, 
 `hour` int)
ROW FORMAT SERDE 
 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
 'input.regex'='^(\\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2})\\s(\\S+):\\[?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}|[\\w\\.-]+)?\\]?\\(?(\\d+)?\\)?:\\[?(\\w+)?\\]?@\\[?(\\w+)?\\]?:\\[?(\\d+)?\\]?:(\\w+)?:\\s*(\\w+):?\\s*(\\d+\\.\\d+)?(?:\\s\\w+)?\\s*(\\w+)?:?(.*)', 
 'timestamp.formats'='yyyy-MM-dd HH:mm:ss ') 
STORED AS INPUTFORMAT 
 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
 's3://<bucket>';

In above ddl, we have made it as partitioned table. This helps to minimise the data access which improves query performance. This also reduces AWS bill 🙂 as athena billing is done on amount of data scanned . To make it work, Log copy lambda function will have to modified to store the data in buckets using timestamp prefix  e.g s3://db-logs/dbname/2018/04/26/4. You can refer to this blog article for Athena Performance tuning techniques for understanding more on this. We can now read logs from s3 bucket and answer lot of questions.e.g

How many DB connections we had in particular hour?

SELECT count(*) FROM postgres_logs where log_type1='connection' and log_type2='authorized' and year=2018 and month=4 and day=26 and hour=8

Queries which ran more than 5 seconds

SELECT * FROM postgres_logs where duration>5000

Any fatal or error messages in logs?

select * from postgres_logs where log_level in ('ERROR','FATAL')

Querying Audit logs

select * from postgres_logs where log_type1 ='AUDIT'

Logs with queries spawning multiple lines will not be shown properly. I searched for fix but came across article which talks about this being Hive Regex Serde limitation.

Currently Hive does not support recognition of embedded newlines in text formatted data, even via its OpenCSV implementation. This is noted at: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

The reason the RegEx does not work is cause its applied on top of the record reader, which gives the RegEx only a single line input (cause its broken in an upper layer).

Migrating Function based indexes from Oracle to PostgreSQL

In this post, I will share  issue faced while working with function based index in PostgreSQL . I was performing database migration from Oracle to PostgreSQL and used AWS Schema Conversion Tool(SCT) for converting the database schema.AWS Schema conversion tool takes care of automatically converting schema from one database engine to other. You can get more information about it on AWS documentation or  AWS Database Blogs articles

AWS SCT converted a function based index but while executing ddl in PostgreSQL, it failed to execute  with error

 “ERROR: functions in index expression must be marked IMMUTABLE”

-- Oracle Index DDL
 
CREATE UNIQUE INDEX "APP"."STUDENT_IDX1" ON "APP"."STUDENT" (CASE "IS_VALID" WHEN 1 THEN TO_CHAR("STUDENT_ID")||'-'||TO_CHAR("COURSE_ID") ELSE NULL END ) TABLESPACE "USERS" ; 
 
-- PostgreSQL DDL (Recommended by SCT)
 
CREATE UNIQUE INDEX student_idx1 ON student
USING BTREE ((CASE is_valid
    WHEN 1 THEN CONCAT_WS('', student_id::TEXT, '-', course_id::TEXT)
    ELSE NULL
END) ASC);

As per PostgreSQL docs, function could be of 3 types

  • Volatile – It can return different results on successive calls with the same arguments
  • Stable – It is guaranteed to return the same results given the same arguments for all rows within a single statement
  • Immutable – It is guaranteed to return the same results given the same arguments forever

Let’s understand this with example. In below example, we are using current_timestamp function and output would change based on client timezone setting

postgres=> set timezone to 'US/Pacific';
SET
postgres=> select current_timestamp;
              now
-------------------------------
 2018-01-17 20:22:32.023775-08
(1 row)
postgres=> set timezone to 'UTC';
SET
postgres=> select current_timestamp;
              now
-------------------------------
 2018-01-18 04:22:46.227855+00
(1 row)

Current_timestamp is tagged as STABLE, since their values do not change within a transaction but it will change in next transaction even though input is same.

But Immutable function won’t change result. e.g Below two sql will give same result

select * from emp where empid=4 

OR

select * from emp where empid = 2+2

In PostgreSQL, function/expression used to create function based index needs to be immutable i.e function is guaranteed to return same results on giving same arguments to avoid data corruption.

We can identify whether function is immutable/stable by quering pg_proc view . Value of provalite field will indicate the type

SELECT proname, provolatile, proargtypes, proargtypes[0]::regtype AS argtype, prosrc FROM pg_proc WHERE proname like 'concat%';
proname  | provolatile | proargtypes | argtype |     prosrc
-----------+-------------+-------------+---------+----------------
concat    | s           | 2276        | "any"   | text_concat
concat_ws | s           | 25 2276     | text    | text_concat_ws
(2 rows)

Definition from docs

provolatile tells whether the function’s result depends only on its input arguments, or is affected by outside factors. It is i for “immutable” functions, which always deliver the same result for the same inputs. It is s for “stable” functions, whose results (for fixed inputs) do not change within a scan. It is v for “volatile” functions, whose results might change at any time. (Use v also for functions with side-effects, so that calls to them cannot get optimized away.)

We can see that concat and concat_ws are listed as stable function and their value can depend on client setting. E.g setting parameters like extra_float_digits will impact the output of concat function

postgres=> select concat(1.0000000000003::float8,56);
      concat
-----------------
1.000000000000356
(1 row)
postgres=> set extra_float_digits TO 3;
SET
postgres=> select concat(1.0000000000003::float8,56);
     concat
-----------------------
1.0000000000002999856
(1 row)

We were able to workaround by creating new function with immutable type which accepts input as text

CREATE OR REPLACE FUNCTION immutable_concat ( text, text, text, text ) RETURNS text AS 'SELECT $1 || $2 || $3 || $4;' LANGUAGE SQL IMMUTABLE;

New index statement

CREATE UNIQUE INDEX student_idx1
ON student
USING BTREE ((CASE is_valid
    WHEN 1 THEN immutable_concat('', student_id::TEXT, '-', course_id::TEXT)
    ELSE NULL
END) ASC);