AWS

Using SSL to securely connect to Amazon Redshift Cluster

Amazon Redshift is fully managed Cloud Datawarehouse from AWS for running analytic workloads. Lot of customers have requirements to encrypt data in transit as part of security guidelines. Redshift provides support for SSL connections to encrypt data and server certificates to validate the server certificate that the client connects to.

Before we proceed how SSL works with Redshift, lets understand why we need SSL. SSL is required to encrypt client/server communications and provides protection against three types of attack:

Eavesdropping – In case of un-encrypted connections, a hacker could use network tools to inspect traffic between client and server and steal data and database credentials. With SSL, all the traffic is encrypted.

Man in the middle (MITM) – In this case a hacker could hack DNS and redirect the connection to a different server than intended. SSL uses certificate verification to prevent this, by authenticating the server to the client.

Impersonation – In this hacker could use database credentials to connect to server and gain access to data. SSL uses client certificates to prevent this, by making sure that only holders of valid certificates can access the server.

For more details, refer to this link .

You can easily enable SSL on Amazon Redshift Cluster by setting require_ssl to True in Redshift parameter group.

In this blog post, I would discuss 3 sslmode settings – require, verify-ca, and verify-full , through use of psql and python.

Setup details

I have a Amazon Redshift cluster with publicly accessible set to “No” and I would be accessing it via my local machine. Since database is in private subnet, I would need to use port forwarding via bastion host. Make sure this bastion host ip is whitelisted in Redshift security group to allow connections

## Add the key in ssh agent
ssh-add <your key>

## Here bastion host ip is 1.2.3.4 and we would like to connect to a redshift cluster in Singapore running on port 5439. We would like to forward traffic on localhost , port 9200 to redshift 
ssh -L 9200:redshift-cluster.xxxxxx.ap-southeast-1.redshift.amazonaws.com:5439 [email protected]

When we enable require_ssl to true, we have instructed Redshift to allow encrypted connections. So if any client tries to connect without SSL, then those connections are rejected. To test this , let’s modify sslmode settings for psql client, by setting PGSSLMODE environment variable.

export PGSSLMODE=disable
psql -h localhost -p 9200 -d dev -U dbmaster
psql: FATAL: no pg_hba.conf entry for host "::ffff:172.31.xx.9", user "dbmaster", database "dev", SSL off

As we can see, all database connections are rejected. Let’s now discuss SSL mode – require, verify-ca, and verify-full

1.  sslmode=require

With sslmode setting of require, we are telling that we need a encrypted connection. If a certificate file is present, then it will also make use of it to validate the server and behavior will be same as verify-ca. But if the certificate file is not present, then it won’t complain (unlike verify-ca) and connect to Redshift cluster.

export PGSSLMODE=require
psql -h localhost -p 9200 -d dev -U dbmaster
psql (11.5, server 8.0.2)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

We can see that psql has made a SSL connection and is using TLS 1.2.

2. sslmode=verify-ca

If we use, verify-ca, then the server is verified by checking the certificate chain up to the root certificate stored on the client. At this point, we need to also give Redshift certificate which has tobe  be downloaded from download link provided in Amazon Redshift documentation. To demonstrate, that we really need a certificate, let’s try connecting without certificate

export PGSSLMODE=verify-ca 
psql -h localhost -p 9200 -d dev -U dbmaster 
psql: error: could not connect to server: root certificate file "/Users/amit/.postgresql/root.crt" does not exist
Either provide the file or change sslmode to disable server certificate verification.

psql is complaining that it couldn’t find the certificate and we should either provide ssl certificate or change sslmode settings. Let’s download the certificate and store under home directory

mkdir ~/.postgresql 
cd .postgresql

curl -o root.crt https://s3.amazonaws.com/redshift-downloads/redshift-ca-bundle.crt

After downloading certificate , and placing under the desired directory, our connection attempt succeeds

3. sslmode=verify-full

Next, if we want to prevent Man in the Middle Attack (MITM), we need to enable sslmode=verify-full . In this case the server host name provided in psql host argument will be matched against the name stored in the server certificate. If hostname matches, the connection is successful, else it will be rejected.

export PGSSLMODE=verify-full
psql -h localhost -p 8192 -d dev -U awsuser
psql: error: could not connect to server: server certificate for "*.xxxxxx.ap-southeast-1.redshift.amazonaws.com" does not match host name "localhost"

In our test connection  fails, as we are using port forwarding and localhost doesn’t match the redshift hostname pattern

What this means is that if you use any services like route53 to have friendly names, verify-full won’t work as the hostname specified in psql command and host presented in certificate don’t match. If your security team is ok with verify-ca option, then you can revert to that setting, else you will have to get rid of aliases and use actual hostname.

In my case, I can resolve the error by connecting to Redshift Cluster from bastion host (instead of my local host tunnel setup) and using psql command  with actual hostname

psql -h redshift-cluster.xxxxxx.ap-southeast-1.redshift.amazonaws.com -p 5439 -d dev -U dbmaster

SSL Connection using Python

Next, let’s see how you can connect to Redshift cluster using python code. This is useful, if you have Lambda code or other client applications which are written in Python. For this example, we will use PyGreSQL module for connecting to Redshift Cluster.

$ python
Python 3.7.9 (default, Aug 31 2020, 12:42:55)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> import pgdb
>>> rs_port =5439
>>> rs_user = 'dbmaster'
>>> rs_passwd='****'
>>> rs_dbname = 'dev'
>>> rs_host='redshift-cluster.xxxxx.ap-southeast-1.redshift.amazonaws.com'
>>> rs_port =5439

>>> conn = pgdb.connect(dbname=rs_dbname,host=rs_host,port=rs_port , user=rs_user, password=rs_passwd,sslmode=rs_sslmode)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ec2-user/miniconda3/envs/venv37/lib/python3.7/site-packages/pgdb.py", line 1690, in connect
cnx = _connect(dbname, dbhost, dbport, dbopt, dbuser, dbpasswd)
pg.InternalError: root certificate file "/home/ec2-user/.postgresql/root.crt" does not exist
Either provide the file or change sslmode to disable server certificate verification.

Before running above code, I removed the certificate file to show that pgdb.connect requires SSL certificate. Let’s now add the certificate to non-default location like “/home/ec2-user/root.crt” and use sslrootcert argument to pass the location

>>> rs_cert_path='/home/ec2-user/root.crt'
>>> conn = pgdb.connect(dbname=rs_dbname,host=rs_host,port=rs_port , user=rs_user, password=rs_passwd,sslmode=rs_sslmode,sslrootcert=rs_cert_path)
>>> cursor=conn.cursor()
>>> cursor.execute("select current_database()")
<pgdb.Cursor object at 0x7fa73bfe15d0>
>>> print(cursor.fetchone())
Row(current_database='dev')

As you can see above, after passing the SSL certificate, connection succeeds and we can fetch data from Redshift cluster.

AWS Glue Python shell job timeout with custom Libraries

This is short post on Timeout errors faced using custom libraries with AWS Glue Python shell job.  I referred the steps listed in AWS docs to create a custom library , and submitted the job with timeout of 5 minutes.  But the job timed out without any errors in logs. Cloudwatch log reported following messages


2020-06-13T12:02:28.821+05:30 Installed /glue/lib/installation/redshift_utils-0.1-py3.7.egg
2020-06-13T12:02:28.822+05:30 Processing dependencies for redshift-utils==0.1
2020-06-13T12:12:45.550+05:30 Searching for redshift-module==0.1
2020-06-13T12:12:45.550+05:30 Reading https://pypi.org/simple/redshift-module/

On searching for error, I came across this AWS Forum post ,where it was recommended  to use python3.6. I referred back documentation and it confirmed that AWS Glue shell jobs are compatible with python 2.7 and 3.6. I was using python3.7 virtualenv for my testing, so this had to be fixed. 

To easily manage multiple environments, I installed miniconda on my Mac which allows to create virtual environment with different python version. Post installation, I created a new python3.6 env with conda and created the egg file

conda create -n venv36 python=3.6
conda activate venv36
python setup.py bdist_egg

Saving AWS RDS and Redshift Billing Costs using Reserved Instances

 

In this post, we would understand what are reserved instances, when do you need them and various ways to monitor them in your environment.

As more enterprises are embracing AWS cloud technologies, we need to understand AWS’s breadth of services and pricing options to right size our compute resources and run efficient infrastructure. Cost optimization is one of the 5 pillars of AWS Well-Architected Framework, a framework developed by AWS to help cloud architects build secure, high-performing, resilient, and efficient infrastructure for their applications. If you would like to read more about the Cost optimization pillar, you can refer to this whitepater.

Introduction

Reserved Instance(RI) is a billing concept which provides pricing discounts for your running instances in AWS accounts to help lower costs. In case of EC2, they also provide capacity reservation i.e AWS will ensure that you always have access to EC2 capacity when you need it, for as long as you need it . The discounted usage price is fixed for the term of the RI, allowing you to predict costs over the term of the reservation. If you are expecting consistent heavy use (e.g., Database Services), RIs can provide significant cost savings as compared to using On-Demand instances. Important point to note is that RI is not a physical instance.i.e to achieve savings, you don’t have to modify your running instance.

When you purchase RI, the RI is automatically applied to running RDS/Redshift nodes with the same instance parameters (Region, Engine/Node Type, RDS DB Instance Class, and Instance Count). RI does not renew automatically; you can continue to use the RDS/Redshift instances/nodes without interruption, but you will be charged On-Demand rates if you do not have sufficient number of RI counts in your AWS account. RI can be visualised as discount coupons that can be applied to your final monthly bill.

Below table provides a list of parameters that you can set when purchasing RI. If there is mismatch in any of the parameter, RI discount will not be applied.

Database Type RI Region RI Type RI Class RI Count
Amazon RDS Region Engine Type DB Instance Class Instance Count
Redshift Region   Node Type Node Count

For example, let’s consider that you have been running 2 PostgreSQL RDS (r3.db.4xlarge) instances and 2 MySQL RDS (r3.db.4xlarge) instances in us-east-1 since December 2017. On January 1 2018, you purchased RI for 3 PostgreSQL r3.db.4xlarge instances in us-east1. At the end of January 2018, your monthly bill would include RDS RI discount on 2 PostgreSQL RDS instances since you only have 2 running PostgreSQL RDS instances in us-east-1 region. As a result, your January 2018 monthly bill would include 2 MySQL RDS with On-Demand rate and 1 unused RI available for RDS PostgreSQL database engine (see Table 2).

Region Engine Type DB Instance Class Running Instance RI Available RI Applied
us-east-1 PostgreSQL db.r3.4xlarge 2 3 2
us-east-1 MySQL db.r3.4xlarge 2 0 0

In order to optimize your monthly billing for all the RDS instances, you would need to purchase additional RDS RI. To do so, you would purchase additional RI in us-east-1 with RDS MySQL database Engine Type, Instance type/size and count.

RDS Instance Size Flexibility

Amazon RDS Reserved Instances provide size flexibility for the MySQL, MariaDB, PostgreSQL, and Amazon Aurora database engines. With size flexibility, your RI’s discounted rate will automatically apply to usage of any size in the instance family (using the same database engine).

e.g If you are running a db.r4.8xlarge with corresponding RI ,and have now decided to upgrade to db.r4.16xlarge, then you can purchase reservation for 1 db.r4.8xlarge to avoid on-demand charges. Below diagram tries to depict that with 1 db.r4.16xlarge reservation, you can run multiple combinations – (2 db.r4.8xlarge ) or (4 db.r4.4xlarge) or (1 db.r4.8xlarge and 2 db.r4.4x.large).

 

AWS uses Normalized units to arrive at these calculations. The following table shows the number of normalized units for each DB instance size.

Instance Size Single-AZ Normalized Units Multi-AZ Normalized Units
micro 0.5 1
small 1 2
medium 2 4
large 4 8
xlarge 8 16
2xlarge 16 32
4xlarge 32 64
8xlarge 64 128
10xlarge 80 160
16xlarge 128 256

When to Purchase Reserved Instance?

In order to utilise Reservations optimally, buy RI instances if you plan to use resource for long duration. AWS provides 1-year and 3-year offerings along with option to pay No upfront, partial upfront and All Upfront and hourly rates varies based on the option chosen. You can refer to Redshift Pricing and RDS Pricing for the rates.

Note:- Redshift and RDS RI do not come with a capacity guarantee as they are only a billing discount.

Considering that these are fixed commitments, you will be charged regardless of usage. Therefore, if the resource is going to be short-lived or you are not sure if it’s the right instance type, then it would be recommended to use On-Demand instances. Also, if you need to upgrade capacity for short duration i.e say for 10 days in a calendar year, it would make sense to pay on-demand charges for that extra capacity rather than purchasing reservation. You can make use of AWS Simple Monthly Calculator to arrive at the total cost and make informed decision.

Monitoring

You should monitor your reservations as there could be mismatch which could impact your monthly bill. Mismatch could be due to following scenarios:

  • Scaling: In case of scaling activity like adding new RDS read replica or adding additional nodes to cluster (Redshift), new reservations need to be bought else there can be mismatch in Total running vs Total Reserved cluster.
  • Expired Reservations: Reservations have fixed term of 1 or 3 years and expire after completion of term. They are not auto-renewed which can lead to resources running with On-Demand pricing

I have listed below commands to get information about running and reserved resources (RDS and Redshift). You can also get this information from AWS console under the respective service.

List running RDS DB instances

$aws rds describe-db-instances --query  "DBInstances[*].[DBInstanceIdentifier,DBInstanceClass,Engine,DBInstanceStatus,MultiAZ]" --output table

List RDS reservations

$aws rds describe-reserved-db-instances --query "ReservedDBInstances[*].[ReservedDBInstanceId,ProductDescription,DBInstanceClass,StartTime,Duration,DBInstanceCount,MultiAZ]" --output table

List Running Redshift Clusters

$aws redshift describe-clusters --query "Clusters[*].[ClusterIdentifier,NodeType,ClusterStatus,ClusterCreateTime,NumberOfNodes]" --output table

List Active Redshift Reservations

$ aws redshift describe-reserved-nodes --query "ReservedNodes[?State=='active'].[ReservedNodeOfferingId,NodeType,StartTime,Duration,NodeCount]" --output table

References

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_WorkingWithReservedDBInstances.html

https://docs.aws.amazon.com/redshift/latest/mgmt/purchase-reserved-node-instance.html

Querying RDS PostgreSQL logs using Amazon Athena

PostgreSQL logs contains useful information which can help in troubleshooting performance issues,auditing and performing historical analysis. Some of items which can be logged are listed below

  • Client connection and disconnection messages
  • Failed DB connection request
  • Database sql statement errors
  • Database Fatal Errors
  • Auditing Information
  • Database Lock history
  • Long running queries and Query plan logging
  • High Temp/Sort Usage
  • Checkpoint Information
  • Long running Auto-Vaccum operations

RDS PostgreSQL is AWS managed web service which provides access to relational database server but not the underlying hosted operation system.  Therefore, we need to interact with DB using AWS console/API’s for any maintenance activity or to view and download logs. If we now want to use the logs for troubleshooting, we will have to download the logs to our machines and go through logs. This is not scalable solution as each user will be copying logs on their desktops. Another challenge is retention of logs, as RDS retains these logs for max 7 days and rotates after it. To solve these challenges, we can devise a way to store these logs on S3 for long term retention and will use Amazon Athena to read these logs directly. Let’s do this now 🙂

As mentioned, PostgreSQL logs can be downloaded using console or API’s e.g you can use download_db_log_file_portion method in boto3 (python library provided by AWS) to download Database logs in 1MB chunk.  Here is sample code which uses boto3 to copy these logs (you can use this to build custom solution which suits your need.) We can schedule this as lambda function at regular interval using cloudwatch events.

Following parameters are required to log extra information in PostgreSQL logs.

log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 10240
log_autovacuum_min_duration = 1000
log_min_duration_statement = 5000 [Milli seconds. Modify as per your environment]
pgaudit.log =’ddl,role’
shared_preload_libraries =‘auto_explain,pg_stat_statements,pgaudit’
auto_explain.log_min_duration = 5000 [Milli seconds. Modify as per your environment]
log_duration = off

Typical PostgreSQL logs look like below.

2017-07-30 06:53:50 UTC:10.0.1.2(7969):dba_user@postgres:[8611]:LOG: LOG: duration: 65311.253 ms statement: update test_lock set b=2 where a=2;

Next step is to analyze these logs using Amazon Athena .Amazon Athena is an AWS service , built on presto, which allows to directly read data from S3 and supports various data formats like CSV, JSON, ORC, Avro, and Parquet. We will use Hive regex serde to split the fields into useful column.

CREATE EXTERNAL TABLE `postgres_logs`(
 `logtime` timestamp COMMENT 'Log timestamp', 
 `tz` string COMMENT 'Log timezone', 
 `client` string COMMENT 'Client IP or hostname', 
 `clientport` int COMMENT 'Client port', 
 `username` string COMMENT 'DB username making connection to database', 
 `dbname` string COMMENT ' database name', 
 `serverport` int COMMENT ' server port', 
 `log_level` string COMMENT ' Indicating log level i.e LOG,ERROR,FATAL,DETAILED', 
 `log_type1` string COMMENT ' Classification of event i.e connection, disconnection , audit', 
 `duration` decimal(38,6) COMMENT ' Applicable for timed queries (ms)', 
 `log_type2` string COMMENT '', 
 `message` varchar(40000) COMMENT ' Postgresql log message')
PARTITIONED BY ( 
 `year` int, 
 `month` int, 
 `day` int, 
 `hour` int)
ROW FORMAT SERDE 
 'org.apache.hadoop.hive.serde2.RegexSerDe' 
WITH SERDEPROPERTIES ( 
 'input.regex'='^(\\d{4}-\\d{2}-\\d{2}\\s\\d{2}:\\d{2}:\\d{2})\\s(\\S+):\\[?(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}|[\\w\\.-]+)?\\]?\\(?(\\d+)?\\)?:\\[?(\\w+)?\\]?@\\[?(\\w+)?\\]?:\\[?(\\d+)?\\]?:(\\w+)?:\\s*(\\w+):?\\s*(\\d+\\.\\d+)?(?:\\s\\w+)?\\s*(\\w+)?:?(.*)', 
 'timestamp.formats'='yyyy-MM-dd HH:mm:ss ') 
STORED AS INPUTFORMAT 
 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
 's3://<bucket>';

In above ddl, we have made it as partitioned table. This helps to minimise the data access which improves query performance. This also reduces AWS bill 🙂 as athena billing is done on amount of data scanned . To make it work, Log copy lambda function will have to modified to store the data in buckets using timestamp prefix  e.g s3://db-logs/dbname/2018/04/26/4. You can refer to this blog article for Athena Performance tuning techniques for understanding more on this. We can now read logs from s3 bucket and answer lot of questions.e.g

How many DB connections we had in particular hour?

SELECT count(*) FROM postgres_logs where log_type1='connection' and log_type2='authorized' and year=2018 and month=4 and day=26 and hour=8

Queries which ran more than 5 seconds

SELECT * FROM postgres_logs where duration>5000

Any fatal or error messages in logs?

select * from postgres_logs where log_level in ('ERROR','FATAL')

Querying Audit logs

select * from postgres_logs where log_type1 ='AUDIT'

Logs with queries spawning multiple lines will not be shown properly. I searched for fix but came across article which talks about this being Hive Regex Serde limitation.

Currently Hive does not support recognition of embedded newlines in text formatted data, even via its OpenCSV implementation. This is noted at: https://cwiki.apache.org/confluence/display/Hive/CSV+Serde

The reason the RegEx does not work is cause its applied on top of the record reader, which gives the RegEx only a single line input (cause its broken in an upper layer).