Here at Vero, we're having a ton of fun with Amazon's RedShift database. It is a powerful cloud-based columnar database designed for analytical workloads. We love it so far, but there have been moments some of us let a few curse words slip. I'm going to share our experiences with you relating to loading data into RedShift and the lessons learned.
As a background, Vero is a Java Application designed to help analysts, business users, and business intelligence pros prepare data across data sources. You can Download a Free Trial or learn more here. Being a Java Application, we use JDBC to connect and query RedShift and the AWS SDK to interact with other Amazon services like S3.
INserts and multi-value inserts
The first lesson we learned is that there is no way to win with plain insert on big data. Unless you are inserting a handful of rows, I would stay away from plain insert statements. RedShift is not write optimized, it is a read optimized databases solution. Being compressed columnar storage type, RedShift writes can be painfully slow. If your data is less than 10K rows, you should use multi-value insert statements. As I mentioned in my Amazon RedShift Importer post, a multi-value insert is a grouping of several value sets or tuples into a single insert command as follows:
insert into mytable(col1, col2) values (row1val1, row1val2), (row2val1,row2val2), (rowNval1, rowNval2)...
Even with this strategy you have to be careful. There are limits on the statement size and there is a point of diminishing returns. The time to insert and commit a multi-value statement increases exponentially as the insert statement size increases. By size of statement, I do not mean the number rows, rather we are talking about the string size in bytes. We found that 300KB to 500KB is the optimal size for a single insert statement.
As I alluded to earlier, we had eagerly upgraded to the latest RedShift JDBC driver 1.1.13, but we found several issues. When inserting data like URL's with question marks in it, the driver craps out. Apparently, the question mark makes it think we are in a prepared statement rather than a straight multi-value insert. Next, we recommend using the Amazon RedShift JDBC Driver with version 1.1.10 for now. Some of you maybe using a Postgres 8 driver, we recommend moving onto the RedShift driver.
gotchas with multi-value inserts
- You have to escape single quotes within a string
- You can replace a single quote with '' or \\'
- If your string contains a single quote and a semicolon, you're screwed
The last issue is a bug with RedShift. There is no way to escape a single quote when a semicolon is also present within the string.
RedShift COPY Command
If you are working with big data, you will come to love the COPY command. It is the single fastest way to load data into RedShift, but requires setup and configuration beyond a traditional JDBC application. Using the COPY command allows you to bypass all of the limitations imposed by the insert statement.
Setup S3 Bucket For Data Staging
The COPY command only supports loading data from files located in an S3 bucket. It is important that this bucket be created in a specific region rather than the National region. It may be better to create the bucket in the same region as RedShift but it is not required.
Parallel Processing for Super Fast Loads
You could try to load a single massive file using the COPY command, but it would be much faster to split your data into several files. The COPY command is designed to concurrently load data from several files into a single table. We found that more files won't cause the load time to increase by much. It's nearly constant time. Here is an example COPY command:
Notice the last parameter of the from URL is a file prefix. RedShift will load all the files in the S3 bucket that starts with the prefix. For example, if your file prefix was vero_rocks then all of the following files will be loaded into the specified table:
In our tests, the optimal file size was around 20MB. We could have gone smaller but we were comfortable with the performance here. Check out this demo of us loading nearly 3M rows and 13 CSV files into RedShift to see Vero in action.
Unlike multi-value inserts, with the COPY command you don't have to escape single quotes and other characters. We found that the COPY command handles these special characters reasonably well.
COPY Command Gotchas
- You have to wrap strings with leading or trailing single quotes in double quotes. You could also strip out the single quotes
- The COPY command does not like the new line character. Multi line strings will blow up. We simply replaced new line character with a space.
Finally, you should gzip each of the files used here. It will of course improve time to upload data to S3 but surprisingly it also improves the performance of the COPY command itself. Like with anything in programming, things may go wrong here is a view you can create to make debugging your COPY operations easier:
If you would rather not write all this code, you could use the Vero Designer to make loading RedShift easy.