We recently launched LODR.IO to make loading data into RedShift dead simple. Its built on a super fast serverless backend. We've seen our own data loads get 5x faster and 100x cheaper. Even better, you don't have to define table schemas when loading data.
You can upload zip files and more directly into RedShift.
We recently announced the availability of Vero Designer 2.0, which includes many improvements but most notably our RedShift integration. Amazon RedShift is a cloud based database targeting data warehousing and analytical use cases. Data is stored in a compressed columnar format making it ideal for big data analytics. In this post, I will describe how you can use Vero to quickly load data into a RedShift table, describe some of the technology underpinning our high speed data loader, and a few gotchas we learned the hard way.
Loading a redshift table with Vero
Launch the Vero Designer and open the Datasource Manager by clicking the datasource icon at the top left corner of the navigation bar:
Click on the "Add Datasource" button beneath the search bar to configure your RedShift connection:
In the pop up datasource configuration wizard enter your RedShift credentials. Be sure to connect to a RedShift database where you have privileges to create tables. When you import data to a table, Vero will drop any existing table with the same name and create a new table for your data. You will also be able to import data into tables in a specific schema (more on that later).
CONFIGURE S3 Credentials
RedShift loads data much faster when using the COPY command, which requires data to be first loaded into an S3 bucket. Click the AWS S3 tab as show below to configure your S3 credentials:
Your Amazon infrastructure administrators should be able to provide you with an Access Key ID and Access Key. To learn more about creating and retrieving access keys checkout this tutorial on Managing Access Keys for IAM users. The keys you use here are associated with an IAM user that should have S3 upload privileges to the S3 bucket defined above. Click "Finish" at the bottom to complete the connection configuration.
Next, drop a CSV, TXT, or Log file into the file drop zone or use the browse button to locate your file:
You will initially see a pop up where you can set field and row delimiters and change the file character encoding type. Most of the time this will be auto detected and you can just click "Ok." In the next screen, you can preview the data and apply data transformations by using the wrangling menu as shown below:
Now, set the target table name by configuring the "Name" parameter on the left hand side as show below:
If you wished to import data into another schema other than the default schema (public is the default schema of RedShift), then enter the table name as schema_name.table_name. Click "Import" at the bottom left of the screen and you will see a popup dialog box as follows:
Here you can choose to import data into a Local Vero DB, export to a file, or into one of the RedShift data sources you've configured within the Vero Designer. Choose the RedShift data source we configured earlier and Click "Ok". Vero will first wrangle the data based on the transformation you may have applied and then load the data into your target table. When the data load finishes, Vero will import the table schema and metadata into the project. You can locate this table under the RedShift data source you chose earlier and right click on it to create a new report:
From this point you can use the Vero Designer as usual to build complex reports, blend data across data sources, and export reports into reporting tables for faster visualizations.
Start Importing Data into Amazon RedShift
How Vero loads Redshift
I'm going to cover this briefly here with more posts to come with more detail in the near future. There are three ways to load data into Redshift:
- Insert Statements
- Multi-value Inserts
- The COPY Command
The Vero Designer is a java application and as a result we use JDBC drivers to handle interactions with the RedShift database. Unlike the Postgres or even the Microsoft SQL Server JDBC drivers, the RedShift JDBC driver does not come with Bulk Insert API. We ended up having to manually write the bulk loader strategies for both 2 and 3.
For the most part single value insert statement (option 1) is pretty useless unless you want to import just a handful of rows. Most of the use cases with Vero involve 100s to millions of rows. In our tests, we were able to import a 1 million row csv file beginning to end in less than 2 minutes!
Multi-value Insert Statements
In most JDBC implementations there is an option to accumulate many insert statements into batches using the addBatch method, then you can push the batch by calling the executeBatch method. Amazon RedShift JDBC driver does not support this option but you can simulate the option by manually writing muti-value insert statements as follows:
insert into mytable(col1, col2) values (row1val1, row1val2), (row2val1,row2val2), (rowNval1, rowNval2)...
We found this technique works well when batching statements in chunks of 2000 and up to 10000 rows. A sample JDBC implementation may look this:
the copy command
While the multi-value option works well for sub 10K rows, it really gets painfully slow as you get to bigger and bigger data. Even 10k can seem slow as the number of columns increase. This is where the COPY command comes to the rescue. We are going to write a ton more on this topic in the future. I'll just lay out some of the highlights here:
- Your data must be located in an S3 Bucket
- The S3 bucket you create must be in a specific region rather than the National Region
- Vero splits your data into 500K row files, zips them locally, and pushes it to your configured S3 bucket
- More files doesn't seem to increase the time to run the copy command on a quiet system
- You have to take care in how you set the field delimiters of your file. We had trouble with multi-line strings and strings with quotes and single quotes
Overall, we love working with RedShift. Unlike other MPP databases like Teradata, RedShift is extremely developer friendly. The technology and documentation has improved tremendously since my first experiences with RedShift at Netflix. However, there are still problems with the JDBC driver not fully implementing the API. The driver doesn't handle plainly inserting string values with prepared statements. For example, I had to manually escape single quotes in strings even when using a JDBC prepared statement. Finally, our customers love the speed RedShift brings to their data. Their data is often very wide and a columnar storage strategy performs superbly here.