GoldenGate

Speeding up Initial data load for Oracle to PostgreSQL using Goldengate and copy command

Oracle Goldengate supports Oracle to PostgreSQL migrations by supporting PostgreSQL as a target database, though reverse migration i.e PostgreSQL to Oracle is not supported. One of the key aspect of these database migrations is initial data load phase where full tables data have to copied to the target datastore. This can be a time consuming activity with time taken to load varying based on the table sizes. Oracle suggests to use multiple Goldengate processes to improve the database load performance or to use native database utilities to perform faster bulk-loads.

To use a database bulk-load utility, you use an initial-load Extract to extract source records from the source tables and write them to an extract file in external ASCII format. The file can be read by Oracle’s SQL*Loader, Microsoft’s BCP, DTS, or SQL Server Integration Services (SSIS) utility, or IBM’s Load Utility (LOADUTIL).

Goldengate for PostgreSQL doesn’t provide native file loader support like bcp for MS SQL and sqlloader for Oracle. As an alternative, we can use FORMATASCII option to write data into csv files (or any custom delimiter) and then load them using PostgreSQL copy command .This approach is not automated approach and you will have to ensure that all files are loaded into target database.

In this post, we will evaluate 2 approaches i.e using Multiple replicat Processes and using ASCII dump files with PostgreSQL copy command to load data and compare their performance. Below diagram shows both the approaches

Description of initsyncbulk.jpg follows
Ref: -https://docs.oracle.com/goldengate/1212/gg-winux/GWUAD/wu_initsync.htm#GWUAD561

To compare the scenarios, I created a test table with 200M rows(12GB) and used a RDS PostgreSQL instance (db.r3.4xlarge with 10k PIOPS)

CREATE TABLE scott.Big_table (
id NUMBER,
small_number NUMBER(5),
big_number NUMBER,
short_string VARCHAR2(50),
created_date DATE,
CONSTRAINT big_table_pkey PRIMARY KEY (id)
) tablespace LRG_TBSP;

--create sequence for PK

create sequence scott.big_table_seq start with 1 increment by 1 cache 500 nomaxvalue;
-- Load data
INSERT /*+ APPEND */ INTO scott.Big_table
SELECT scott.big_table_seq.nextval AS id,
TRUNC(DBMS_RANDOM.value(1,5)) AS small_number,
TRUNC(DBMS_RANDOM.value(100,10000)) AS big_number,
DBMS_RANDOM.string('L',TRUNC(DBMS_RANDOM.value(10,50))) AS short_string
TRUNC(SYSDATE + DBMS_RANDOM.value(0,366)) AS created_date
FROM dual
CONNECT BY level <= 10000;
COMMIT;

INSERT /*+ APPEND */ INTO scott.Big_table
SELECT scott.big_table_seq.nextval AS id,
small_number,
big_number,
short_string,
long_string,
created_date
FROM scott.Big_table;
COMMIT;

--PostgreSQL Table

CREATE TABLE booker.Big_table (id bigint,
small_number int,
big_number bigint,
short_string VARCHAR(50),
created_date TIMESTAMP,
CONSTRAINT big_table_pk PRIMARY KEY (id) )

Approach 1 : Using Oracle Goldengate multiple replicat processes to load data

In this approach, I used multiple Oracle Goldengate Replicat processes (8) using @range filter to load data into PostgreSQL.

We were able to get 5k inserts/sec per thread and were able to load the table in ~88 mins with 8 replicat processes.
One key point to remember is that if you are working with EC2 and RDS databases, you should have EC2 machine hosting trail files and RDS instance in same AZ. During the testing, we noticed that insert rate dropped drastically (~800 insert per sec) when using cross AZ writes. Below is replicat parameter file used for performing data load.

SpecialRUN
END Runtime
SETENV (NLSLANG=AL32UTF8)
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV ( PGCLIENTENCODING = "UTF8" )
SETENV (ODBCINI="/opt/app/oracle/product/ggate/12.2.0.1/odbc.ini" )
TARGETDB GG_Postgres, USERIDALIAS pguser
Extfile /fs-a01-a/databases/ggate/initload/i5
HANDLECOLLISIONS
DISCARDFILE /opt/app/oracle/product/ggate/12.2.0.1/direrr/rinit1.dsc, APPEND, megabytes 20
reportcount every 60 seconds, rate
BATCHSQL;
MAP scott.big_table, TARGET scott.big_table, FILTER (@RANGE (1,8));;

You will need to create additional replicat process files by making change to the range clause.e.g FILTER (@RANGE (2,8)), FILTER (@RANGE (3,8)), etc.

Approach 2: Data load using PostgreSQL copy command

In second approach, we used parameter file with FORMATASCII option(refer to below snippet) for creating a Goldengate Extract process which dumped the data with ‘|’ delimiter and then used PostgreSQL copy command to load data from these dump files.

Extract Parameter file

SOURCEISTABLE
SETENV (ORACLE_SID=ggpoc)
SETENV (NLSLANG=AL32UTF8)
SETENV (NLS_LANG="AMERICAN_AMERICA.AL32UTF8")
SETENV (ORACLE_HOME=/opt/app/oracle/product/11.2.0.4/A10db)
SETENV (TNS_ADMIN=/opt/app/oracle/local/network/)
USERIDALIAS gguser
RMTHOST xx.222.xx.78, MGRPORT 8200, TCPBUFSIZE 100000, TCPFLUSHBYTES 300000
RMTHOSTOPTIONS ENCRYPT AES256
FORMATASCII,NOHDRFIELDS,NOQUOTE,NONAMES, DELIMITER '|'
RMTFILE /fs-a01-a/databases/ggate/initload/i4 , megabytes 1000
reportcount every 60 seconds, rate
TABLE scott.BIG_TABLE;

With above parameter file, Goldengate Extract process would send data to remote system and store the data in dump files. These files are then loaded into PostgreSQL using \copy command.

psql> \copy scott.big_table from '/fs-a01-a/databases/ggate/initload/i4000000' with DELIMITER '|';

Data load took  21 mins, which is nearly 4x faster than initial approach. If you remove the Primary key index, then it drops the time taken to ~9 mins to load 200M POC table.

 

Update:

Oracle GoldenGate 19.1 comes with DataDirect 7.1 PostgreSQL Wire Protocol ODBC driver for PostgreSQL connectivity. You can now add a parameter “BatchMechanism=2” to speed up the inserts. After setting this parameter, odbc driver will start batching inserts into memory buffer and will insert them together instead of single row inserts. You can find details here

To add the parameter, update odbc.ini and add BatchMechanism under the database section

 

[apgtarget]
Driver=/oracle/product/ggate/19.1.0.0/lib/GGpsql25.so
Description=DataDirect 7.1 PostgreSQL Wire Protocol
BatchMechanism=2
Database=pgbench
HostName=xxx.xxxxxx.us-east-1.rds.amazonaws.com
PortNumber=5432


In my testing, I saw that insert rate increased from 5.3K rows per second to 35K i.e nearly 7x increase. I also noticed that WriteIOPS on this Aurora Instance increased from 20K to 80-100K IOPS

GoldenGate Setup for Real-Time Data Synchronization

In this post we discussed about GoldenGate(GG) concepts and setting GG without DataPump Process.
Now I will talk about DataPump process and show how to setup GG replication which also uses DataPump.

In GG environment, extract process will write the data to local trail files and then pump process reads from this local trail file and sends the data, over the network, to destination trail file where this will be read by replicat process to write the changes to destination database.
The advantage of using datapump process is to improve the performance of overall replication as it will free-up the extract process from sending the data to destination side.

To setup this, download and extract the GG software as discussed in earlier post(Step 1-7).
Once the GG software is extracted, go the directory where it is extracted, know as GG home directory.
In this case it will be /u01/app/GGS.

1. Configure and start Manager process:

GGSCI (gg1.company.com) 1> edit params mgr

It will open the parameter file for manager. Enter the following in the file:

PORT 7809

save and quit the file.

2. Start the Manager process:

GGSCI (gg1.company.com) 2> start mgr

Manager started.

3.

GGSCI (gg1.company.com) 3> info all

Program     Status      Group       Lag           Time Since Chkpt   MANAGER    RUNNING
MANAGER    RUNNING

4. Now add and configure extract process:

The syntax to be used is:

ADD EXTRACT <extract name>, TRANLOG, BEGIN NOW
GGSCI (gg1.company.com) 2>add extract ext_gg1, tranlog, begin now
Extract Added

GGSCI (gg1.company.com) 1> info *

EXTRACT    EXT_GG1   Initialized   2012-06-10 16:15   Status STOPPED
Checkpoint Lag       00:00:00 (updated 00:01:45 ago)
Log Read Checkpoint  Oracle Redo Logs
2012-06-10 16:15:20  Seqno 0, RBA 0

5. Configure the trail file to which extract will write the records:

Syntax:

ADD EXTTRAIL <extract trail path/two character trail id> ,EXTRACT <extract name>, Megabytes <n>
GGSCI (gg1.company.com) 2> add exttrail /u01/app/GGS/dirdat/lt, extract ext_gg1, megabytes 50
EXTTRAIL added.

6. Create extract parameter file:

GGSCI (gg1.company.com) 1> edit params ext_gg1

extract ext_gg1
userid gggate, password oracle
exttrail /u01/app/GGS/dirdat/lt

7. Configure Pump Process which will read the local extract trail file and send data to remote trail:

Syntax:
ADD EXTRACT <pump name>, EXTTRAILSOURCE <extract trail Path/two character trail id>

GGSCI (gg1.company.com) 1> add extract pump_gg1, exttrailsource /u01/app/GGS/dirdat/lt
EXTRACT added.
GGSCI (gg1.company.com) 2> info *

EXTRACT    EXT_GG1   Initialized   2012-06-10 16:15   Status STOPPED
Checkpoint Lag       00:00:00 (updated 00:16:56 ago)
Log Read Checkpoint  Oracle Redo Logs
2012-06-10 16:15:20  Seqno 0, RBA 0

EXTRACT    PUMP_GG1  Initialized   2012-06-10 16:32   Status STOPPED
Checkpoint Lag       00:00:00 (updated 00:00:04 ago)
Log Read Checkpoint  File /u01/app/GGS/dirdat/lt000000
First Record  RBA 0

8. Configure a remote trail file name with Pump process:

Syntax:
ADD RMTTRAIL <pump trail path/two character trail id> ,EXTRACT <pump name>, Megabytes <n>

GGSCI (gg1.company.com) 4> add rmttrail /u01/app/GGS/dirdat/rt, extract pump_gg1
RMTTRAIL added.

9. Add remote host information to Pump process:

— Remote host and remort manager port to write trail

RMTHOST <Remote hostname>, MGRPORT <Target manager port number>

— Remote trail info

RMTTRAIL <extract trail path/two character trail id>

GGSCI (gg1.company.com) 1> edit params pump_gg1

[oracle@gg1 dirprm]$ cat pump_gg1.prm
extract pump_gg1
passthru
userid gggate, password oracle
rmthost gg2.company.com, mgrport 7809
rmttrail /u01/app/GGS/dirdat/rt
[oracle@gg1 dirprm]$ pwd
/u01/app/GGS/dirprm

Target system:
==============
1. On target database, we need to configure Replicat process which will apply the changes
to target database after reading from remote trail file.

Syntax:

ADD REPLICAT <REPLICAT name>, EXTTRAIL <trail file path/two -- character trail id> Name of the replicat process.

GGSCI (gg2.company.com) 8> add replicat rep_gg2, exttrail /u01/app/GGS/dirdat/rt
ERROR: No checkpoint table specified for ADD REPLICAT.

2. Add checkpoint table.

A checkpoint table is a small table in the target database used by the Replicat
process to maintain checkpoints. A checkpoint records a known position in the trail from which
to start after an expected or unexpected shutdown. By default, a record of these checkpoints is maintained
in a file on disk in the GoldenGate directory. Optionally, the checkpoint record can also be maintained in a
checkpoint table in the target database.

More Information about check point table can be found in MOS note: 965698.1

GGSCI (gg2.company.com) 10>  add checkpointtable chktbl

Successfully created checkpoint table CHKTBL.

GGSCI (gg2.company.com) 13>  add replicat rep_gg2, exttrail /u01/app/GGS/dirdat/rt, checkpointtable gggate.chktbl
REPLICAT added.

GGSCI (gg2.company.com) 14> info *

REPLICAT   REP_GG2   Initialized   2012-06-10 16:58   Status STOPPED
Checkpoint Lag       00:00:00 (updated 00:00:12 ago)
Log Read Checkpoint  File /u01/app/GGS/dirdat/rt000000
First Record  RBA 0

Now start all the processes on Source system.

GGSCI (gg1.company.com) 11> start manager

Manager started.
GGSCI (gg1.company.com) 12> start extract ext_gg1

Sending START request to MANAGER ...
EXTRACT EXT_GG1 starting
GGSCI (gg1.company.com) 13> start extract pump_gg1

Sending START request to MANAGER ...
EXTRACT PUMP_GG1 starting
GGSCI (gg1.company.com) 15> info *

EXTRACT    EXT_GG1   Last Started 2012-06-10 17:02   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:00:00 ago)
Log Read Checkpoint  Oracle Redo Logs
2012-06-10 17:02:20  Seqno 108, RBA 14447104

EXTRACT    PUMP_GG1  Last Started 2012-06-10 17:02   Status RUNNING
Checkpoint Lag       00:00:00 (updated 00:30:18 ago)
Log Read Checkpoint  File /u01/app/GGS/dirdat/lt000000
First Record  RBA 0

[oracle@gg1 GGS]$ ls -ltr /u01/app/GGS/dirdat/lt000000
-rw-rw-rw- 1 oracle oinstall 983 Jun 10 17:02 /u01/app/GGS/dirdat/lt000000

GGSCI (gg1.company.com) 2> exit

Once the processes are strated, we will create the sample tables for replication using Oracle provided scripts.
These scripts can be found under /u01/app/GGS

[oracle@gg1 GGS]$ ls -ltr *demo*
-r--r--r-- 1 oracle oinstall 2275 Oct 15  2010 demo_ora_misc.sql
-r--r--r-- 1 oracle oinstall 4015 Oct 15  2010 demo_ora_lob_create.sql
-r--r--r-- 1 oracle oinstall  821 Oct 15  2010 demo_ora_insert.sql
-r--r--r-- 1 oracle oinstall  883 Oct 15  2010 demo_ora_create.sql
-r--r--r-- 1 oracle oinstall 2520 Oct 15  2010 demo_ora_pk_befores_updates.sql
-r--r--r-- 1 oracle oinstall 1227 Oct 15  2010 demo_ora_pk_befores_insert.sql
-r--r--r-- 1 oracle oinstall 1269 Oct 15  2010 demo_ora_pk_befores_create.sql
-r--r--r-- 1 oracle oinstall  967 Oct 15  2010 demo_more_ora_insert.sql
-r--r--r-- 1 oracle oinstall 1217 Oct 15  2010 demo_more_ora_create.sql

we will create tables for mapping:

Source:
======

SQL> connect scott/tiger
Connected.
SQL> @demo_ora_create.sql
DROP TABLE tcustmer
*
ERROR at line 1:
ORA-00942: table or view does not exist

Table created.

DROP TABLE tcustord
*
ERROR at line 1:
ORA-00942: table or view does not exist

Table created.

SQL>

Target:
=======

SQL> conn scott/tiger
Connected.
SQL> @demo_ora_create.sql
DROP TABLE tcustmer
*
ERROR at line 1:
ORA-00942: table or view does not exist

Table created.

DROP TABLE tcustord
*
ERROR at line 1:
ORA-00942: table or view does not exist

Table created.

SQL> select count(*) from tcustmer;

COUNT(*)
----------
0

SQL>

Once the structure is created, we will tell the extract, pump and replicat process
to capture/apply the changes done under these tables. For that we need to configure
the respective parameter files and these should look like:

GGSCI (gg1.company.com) 13> edit params ext_gg1

extract ext_gg1
userid gggate, password oracle
exttrail /u01/app/GGS/dirdat/lt
Table SCOTT.TCUSTMER;
Table SCOTT.TCUSTORD;


GGSCI (gg1.company.com) 14> edit params pump_gg1
extract pump_gg1
passthru
userid gggate, password oracle
rmthost gg2.company.com, mgrport 7809
rmttrail /u01/app/GGS/dirdat/rt
Table SCOTT.TCUSTMER;
Table SCOTT.TCUSTORD;

GGSCI (gg2.company.com) 4> edit params rep_gg2
REPLICAT rep_gg2
ASSUMETARGETDEFS
userid gggate, password oracle
discardfile /u01/app/GGS/discard/rep_gg2_discard.txt, append, megabytes 10
MAP SCOTT.TCUSTMER, Target SCOTT.TCUSTMER;
MAP SCOTT.TCUSTORD, Target SCOTT.TCUSTORD;

Once the parameter files are updated with the table information, we can use the
Oracle provided scripts to test the replication i.e “demo_ora_insert.sql”

After running the insert on source, we can see the count(*) on both source and target,
which will confirm the changes applied at the target side.