This Question have 1 answers right now.

AWS Redshift JDBC insert performance

By : dty
Source: Stackoverflow.com
Question!

I am writing a proof-of-concept app which is intended to take live clickstream data at the rate of around 1000 messages per second and write it to Amazon Redshift.

I am struggling to get anything like the performance some others claim (for example, here).

I am running a cluster with 2 x dw.hs1.xlarge nodes (+ leader), and the machine that is doing the load is an EC2 m1.xlarge instance on the same VPC as the Redshift cluster running 64 bit Ubuntu 12.04.1.

I am using Java 1.7 (openjdk-7-jdk from the Ubuntu repos) and the Postgresql 9.2-1002 driver (principally because it's the only one in Maven Central which makes my build easier!).

I've tried all the techniques shown here, except the last one.

I cannot use COPY FROM because we want to load data in "real time", so staging it via S3 or DynamoDB isn't really an option, and Redshift doesn't support COPY FROM stdin for some reason.

Here is an excerpt from my logs showing that individual rows are being inserted at the rate of around 15/second:

2013-05-10 15:05:06,937 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Beginning batch of 170
2013-05-10 15:05:18,707 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:05:18,708 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Beginning batch of 712
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Done
2013-05-10 15:06:03,078 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Beginning batch of 167
2013-05-10 15:06:14,381 [pool-1-thread-2] INFO  uk.co...redshift.DatabaseWriter - Done

What am I doing wrong? What other approaches could I take?

By : dty


Answers

I've been able to achieve 2,400 inserts/second by batching writes into transactions of 75,000 records per transaction. Each record is small, as you might expect, being only about 300 bytes per record.

I'm querying a MariaDB installed on an EC2 instance and inserting the records into RedShift from the same EC2 instance that Maria is installed on.

UPDATE

I modified the way I was doing writes so that it loads the data from MariaDB in 5 parallel threads and writes to RedShift from each thread. That increased performance to 12,000 writes/second.

So yeah, if you plan it correctly you can get great performance from RedShift writes.



Video about AWS Redshift JDBC insert performance