dbms_redefinition

Identifying PostgreSQL Bloat and fixing it without downtime

MVCC (Multi-Version Concurrency Control) feature allows databases to provide concurrent access to data. This allows each SQL statement to see a snapshot of data as it was some time ago, regardless of the current state of the underlying data. This prevents statements from viewing inconsistent data produced by concurrent transactions performing updates on the same data rows, providing transaction isolation for each database session. To summarize “Readers don’t block writers and writers don’t block readers”

If you are coming from Oracle or MySQL background, you would be aware that during an update/delete ,DML activity will make changes to rows and use separate store called “Undo” to track the old image of data. If any other session want to get consistent image, then database uses undo to provide consistent snapshot of data. Like many other databases, PostgreSQL also supports MVCC but takes different approach to store the old changes.

In PostgreSQL, update or delete of a row (tuple in PostgreSQL) does not immediately remove the old version of the row. When you update a row, it will create a copy of the row with new changes and insert in the table. Then, it will update the old copy header to make it invisible for future transactions. Similarly for delete, it won’t delete the row but update metadata to make them invisible. Eventually, these old rows will no longer be required by transactions and will have to be cleaned up. This cleanup is handled by “Vacuum”. Note that apart from increasing the total database size, table or index bloat also impacts query performance as database need to process bigger objects.

=>vacuum table 'bloated_table';

If you run above command, it will remove dead tuples in tables and indexes and marks the space available for future reuse. But this will not release the space to operating system. As vacuum is manual approach, PostgreSQL has a background process called “Autovacuum” which takes care of this maintenance process automatically. Autovacuum process to delete rows is controlled by 2 parameters autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor (There are other parametes like autovacuum_freeze_max_age which can trigger autovacuum)

vacuum threshold = autovacuum_vacuum_threshold + autovacuum_vacuum_scale_factor * number of tuples

You can find this values by querying pg_settings. For RDS, autovacuum_vacuum_threshold is 50 rows and autovacuum_vacuum_scale_factor is 0.1 i.e 10% of the table size.

select name,setting from pg_settings where name in ('autovacuum_vacuum_threshold','autovacuum_vacuum_scale_factor');
name | setting
--------------------------------+---------
autovacuum_vacuum_scale_factor | 0.1
autovacuum_vacuum_threshold | 50

This means that if there is table with 100 M rows, you should have ~10M changes ( 50+0.1*100M ) before autovacuum is triggered. Additionally, there are limited number of autovacuum worker processes and if autovacuum is not tuned properly, table could have much higher dead rows. It’s advisable to reduce the scale factor to lower value, either at table level or at database level to prevent bloat. You would also need to tune the autovacuum process settings to improve the cleanup process. There is an excellent blog article titled “A Case Study of Tuning Autovacuum in Amazon RDS for PostgreSQL” on AWS database blog which helps to tune autovacuum.

If you wish to reclaim OS space, then you need to execute “Vacuum Full” which will compact tables by writing a complete new version of the table file with no dead tuples. But this comes at a cost. Vacuum full requires “Exclusive lock” on the table and blocks any DML on the table, leading to downtime. This can be problematic as large tables with multiple indexes can take really long time (7-8 hours) to rebuild.

Enter pg_repack !! This utility helps to perform Full vacuum without downtime by making use of trigger to take care of changes happening on parent table. Below table compares the internal working on Table vs Index rebuilds

Table Rebuild Index Rebuild
To perform a full-table repack, pg_repack will:

1. create a log table to record changes made to the original table
2. add a trigger onto the original table, logging INSERTs, UPDATEs and DELETEs into our log table
3.create a new table containing all the rows in the old table
4. build indexes on this new table
5. apply all changes which have accrued in the log table to the new table
6. swap the tables, including indexes and toast tables, using the system catalogs
7. drop the original table

pg_repack will only hold an ACCESS EXCLUSIVE lock for a short period during initial setup (steps 1 and 2 above) and during the final swap-and-drop phase (steps 6 and
For the rest of its time, pg_repack only needs to hold an ACCESS SHARE lock on the original table, meaning INSERTs, UPDATEs, and DELETEs may proceed as usual.
To perform an index-only repack, pg_repack will:

1. create new indexes on the table using CONCURRENTLY matching the definitions of the old indexes
2. swap out the old for the new indexes in the catalogs
3. drop the old indexes

Identifying Bloat!

I have used table_bloat_check.sql and index_bloat_check.sql to identify table and index bloat respectively. Below snippet displays output of table_bloat_check.sql query output. As per the results, this table is around 30GB and we have ~7.5GB of bloat.

databasename | schemaname | tablename | can_estimate | est_rows  | pct_bloat | mb_bloat | table_mb
--------------+------------+-----------+--------------+-----------+-----------+----------+-----------
postgres | askdba | bloated | t | 547962000 | 25 | 7828.23 | 30877.930

Similarly, when we run index_bloat_check.sql query to identify index bloat , we see that pkey_bloated is 65G and has bloat of ~54GB.

database_name | schema_name |      table_name      |        index_name         | bloat_pct | bloat_mb |  index_mb  |  table_mb  | index_scans
---------------+-------------+----------------------+---------------------------+-----------+----------+------------+------------+-------------
postgres | askdba | bloated | pkey_bloated | 83 | 54125 | 65137.813 | 30877.930 | 3820674

Let’s use pg_repack to clear this bloat. To use pg_repack, you need to install extension and a client utility. RDS PostgreSQL supports pg_repack for installations having version of 9.6.3 and higher.Extension version will differ depending on your RDS PostgreSQL version. e.g RDS PostgreSQL 9.6.3 installs pg_repack 1.4.0 extension, whereas 9.6.11 installs pg_repack 1.4.3 extension.

To create extension, connect as master user for RDS database and run create extension command,

postgres=>  create extension pg_extension;
postgres=> \dx pg_repack
List of installed extensions
Name | Version | Schema | Description
-----------+---------+--------+--------------------------------------------------------------
pg_repack | 1.4.3 | public | Reorganize tables in PostgreSQL databases with minimal locks

To install pg_repack client, download the tar bar from here and build the utility. You need to ensure that extension and client utility version matches. pg_repack provides option to perform full vacuum at table level, index level or table+index . If you want to perform vacuum full for table and associated indexes, then it can be done by using below statement

./pg_repack -h [rds-host-name] -p [dbPort] -d [dbname] -U [superuser] --table=[schema.tablename] --no-order -k
Password:
Password:

-k flag is important as RDS master user does not have Postgresql superuser role and omitting this option leads to error “ERROR: pg_repack failed with error: You must be a superuser to use pg_repack”

To run index-only repack, use following statement

./pg_repack -h [rds-host-name] -p [dbPort] -d [dbname] -U [superuser] --table=[schema.tablename] --only-indexes -k

Above statement will create new indexes and will drop the older indexes after all indexes are recreated. If you are performing this action on really big tables, it will take time and will slow down DML activity on the table as you will have 2*n-1 indexes before last one is created (n=number of indexes).Therefore, if there are multiple indexes on the table, it would be better to recreate index one by one using –index clause

./pg_repack -h [rds-host-name] -p [dbPort] -d [dbname] -U [superuser] --index=[schema.indexname] -k

In my scenario, I went with table+index vacuum option.After rebuild, actual table size reduction was 10% instead of 24% and for index , it was 75% instead of 85%. As you can see, there could be 10-20% variance between actual object size (post vacuum) vs estimated size. Therefore, it would be good to carry out testing in clone environment before committing on size savings.

Object Name Original Size Estimated Size after removing bloat Size Actual size after rebuild
bloated 30GB 23GB 27G
pkey_bloated 64GB 10GB 16GB

Pg_repack creates the objects under repack schema and later moves it to the correct schema. To monitor the pg_repack session, use pg_stat_activity view

select * from pg_stat_activity where application_name='pg_repack'

12c:Specifying lock timeout for finish_redef_table

Oracle 12c database has introduced enhancement to DBMS_REDEFINITION for specifying a a lock timeout (in seconds) which will allow FINISH_REDEF_TABLE to acquire an exclusive lock for swapping the source and interim tables.If timeout expires, then operation exits. This will help to   avoid cancellation by user or indefinite wait. This does not put our session in queue which will be shown by below example

1. Let’s test this on EMP table. I have created table in my schema and copied records.

CREATE TABLE "AMITBANS"."EMP"
   (	"EMPNO" NUMBER(4,0),
	"ENAME" VARCHAR2(10),
	"JOB" VARCHAR2(9),
	"MGR" NUMBER(4,0),
	"HIREDATE" DATE,
	"SAL" NUMBER(7,2),
	"COMM" NUMBER(7,2),
	"DEPTNO" NUMBER(2,0),
	 CONSTRAINT "PK_EMP" PRIMARY KEY ("EMPNO")
  USING INDEX) tablespace users;

2. Lets hash partition the table on empno. Note that I have excluded primary key constraint/index as we will copy it using dbms_redefinition

 CREATE TABLE "AMITBANS"."INT_EMP"
   (	"EMPNO" NUMBER(4,0),
	"ENAME" VARCHAR2(10),
	"JOB" VARCHAR2(9),
	"MGR" NUMBER(4,0),
	"HIREDATE" DATE,
	"SAL" NUMBER(7,2),
	"COMM" NUMBER(7,2),
	"DEPTNO" NUMBER(2,0)) tablespace users
  partition by hash(empno) 
  partitions 4;

3. Lets check if dbms_redefinition will work on this table

set serverout on
exec  DBMS_REDEFINITION.CAN_REDEF_TABLE('AMITBANS','EMP',DBMS_REDEFINITION.CONS_USE_PK);

There are no errors, so we are safe to proceed.

4. Next Start the redefinition

BEGIN
DBMS_REDEFINITION.START_REDEF_TABLE(UNAME =>'AMITBANS', ORIG_TABLE=>'EMP',INT_TABLE=>'INT_EMP',OPTIONS_FLAG=>dbms_redefinition.cons_use_pk);
END;
/

5. Copy dependent objects. (Automatically create any triggers, indexes, materialized view logs, grants, and constraints on AMITBANS.INT_EMP.)

DECLARE
l_num_errors PLS_INTEGER;
BEGIN
  DBMS_REDEFINITION.COPY_TABLE_DEPENDENTS(UNAME =>'AMITBANS', ORIG_TABLE=>'EMP',INT_TABLE=>'INT_EMP',COPY_INDEXES=> DBMS_REDEFINITION.CONS_ORIG_PARAMS,
   COPY_TRIGGERS=>TRUE,COPY_CONSTRAINTS=> TRUE, COPY_PRIVILEGES=>TRUE, IGNORE_ERRORS =>TRUE,NUM_ERRORS=> l_num_errors,COPY_STATISTICS=>TRUE);
END;
/

We need to check if there were any errors while copying constraints. Use below query

select object_name, base_table_name, ddl_txt from DBA_REDEFINITION_ERRORS;

6. We can sync any new inserts by issuing following command.

 BEGIN 
  DBMS_REDEFINITION.SYNC_INTERIM_TABLE(UNAME =>'AMITBANS', ORIG_TABLE=>'EMP',INT_TABLE=>'INT_EMP');
END;
/

7. Let’s do finish redefinition, but before that we will do two insert in emp table from different session without committing transaction.
We will do it in following order to show that finish_redef_table needs to have no active dml on table.

session 1: Insert record 1

insert into emp values(8000,'HENRY','ANALYST',7698,sysdate,1500,null,10);

Session 2: Issue finish_redef_table to swap tables

BEGIN
DBMS_REDEFINITION.FINISH_REDEF_TABLE(UNAME =>'AMITBANS', ORIG_TABLE=>'EMP',INT_TABLE=>'INT_EMP',dml_lock_timeout=>300);
END;
/

Session 3: Open one more session and do insert

insert into emp values(8001,'MICHAEL','ANALYST',7698,sysdate,1500,null,10);

If you now commit transaction in session 1, redefinition (session 2) will  wait  (to acquire exclusive lock) for session 3 to commit .This shows that it is not in a queue and needs all active transactions to finish. When we commit session 3,  finish_redef_table succeeds.