This post is part 1 of a two part series. In part 2, we will demonstrate how you can solve Fact to Fact blending with Vero.
Ooh, fact tables, an ancient artifact of a bygone Business Intelligence era. Do people still have fact tables and the sort in the age of big data?
I kid, I kid...
Maybe the Kimball vision of the Data Warehouse and Conformed Facts has evolved but we still collect transactional and event data. Whether they are highly denormalized, part of a perfect Kimball Star Schema, stored in NoSQL databases, or distributed in files on a Hadoop cluster, they have one common trait: Fact data is rich with duplicate values on purpose.
Fact data / tables
If you are new to Analytics, this might sound confusing so let me clarify. When you checkout at Target with a box of Cheerios and a pair of socks, the data gets ultimately stored in a single table. There will be an entry for each product with additional details like who made this purchase, when, and which store. If you bought a pair of socks from the same Target store everyday, each transaction would be identical except for "when".
|Date||Store Number||Customer Number||Product||Sale Amount|
|March 1, 2016||45||123||Socks||5.50|
|March 2, 2016||45||123||Socks||5.50|
|March 3, 2016||45||123||Socks||5.50|
In the above table to get to a unique event (record) you have to combine Date, Store Number, Customer Number, and Product. This is known as the granularity of the table.
On the flip side dimension / lookup data would only have one entry for each item it captures. For example, if we had a Customers table it would only have one entry to identify you. In the above case, it would have the ID 123 with your name and other attributes that describe you (i.e. address, email etc.).
Blending Fact TaBLEs In Tableau
In the real world of data and analytics you often have more than one fact table. In larger organizations you could have hundreds, each with a different granularity. When you need to build a visualization that spans fact tables you might initially be tempted to create a Tableau data source with a join between the fact tables. As you may have learned the hard way, there are several drawbacks.
Problems with Joining Fact Tables
- Direct fact to fact joins are expensive! Millions, billions, or even trillions of rows joined together make for slow queries. You might get an angry call from your DBA
- If the granularity of the fact tables are different (90% chance that is the case), then you will end up double counting your Measures
Rule of Thumb: Avoid joining Fact Tables Directly
See my article on Mastering SQL for more on the issues with fact to fact joins.
Blending Data Sources
With the idea of joining fact tables directly mostly ruled out, we are left with the option of blending data sources in Tableau. Here each fact table and relevant dimension tables would be brought in as separate data sources. In my example, I have an an orders fact table called order_yr_prd_agg with granularity Customer ID, Product ID, Year and another fact table called product_complaints with granularity Customer ID, Product ID, Complaint Date.
Here is how I added the OrderYrPrdAgg data source:
Notice that I included the necessary dimension tables in this step. Tableau will automatically include exclude these tables based on the needs of the Visualization. This is known as join culling (make sure you set your data source to "Assume Referential Integrity").
Here is the other fact table added as the ProductComplaints data source:
With the data sources setup we will create a simple Visualization that displays Product Name, Revenue (from OrderYrPrdAgg), and Complaints (from ProductComplaints). We will start with OrderYrPrdAgg and set it as our primary data source. Here is the resulting visualization:
Notice the orange link icon on the secondary data source browser. This indicates how the two sources will ultimately be joined together. Tableau made this decision automatically based on the Dimensionality of the my visualization. In other words, I have Product Name in the rows shelf and it is the only dimension so the results will be aggregated on this key.
Internals of a Data Blend with Tableau
Internally, Tableau executes a query for each data source and joins the two sources inside the visualization (your desktop or your Tableau server).
Tableau SQL for OrderYrPrdAgg
Tableau SQL for ProductComplaints
Awesome, we have our fully blended Visualization! Not quite. When blending data in Tableau, there are many considerations to be made; not least of all is the volume of data being blended.
Tableau Data Blending Issues
Join Type Between Sources
Tableau does a left join between the primary data source and the secondary. In our case, if we had products that never sold, but those products have complaints, we won't see them in the final result.
To get all products in the final result we need to execute a full join and to only get products that exist on both sides we need an inner join. As far as I know, there is no solution for this besides writing custom SQL.
If you had millions of rows being blended between the two sources, you could kiss performance good bye. Tableau blends data at the Visualization layer. It might be your laptop or it might be in your Tableau Server. Needless to say, this is not going to scale.
There is only one way to solve this in Tableau and that is by writing custom SQL. Instead of using two data sources, I will create one data source with a custom SQL that combines the two SQL queries above into a single pass as follows:
Problem solved? Unfortunately not. When you use custom SQL, you lose Join Culling. This means every request to your database would run the SQL above with some variation in the selections and outer filters. You are likely to feel the most pain with slow quick filters. Additionally, you are hard coding your query logic inside your visualization, which is a recipe for silo proliferation and maintenance headaches.
Finally, what if the two fact tables were in different sources? Say our Product Complaints fact data was in a CSV file or another database, perhaps MySQL. Custom SQL could not help us here. We would have to use two separate data sources and incur a performance cost of blending data inside Tableau.
Data preparation solutions like Vero, helps deal with these issues and more. In part 2 of this post, I will show you how you can solve this problem with the Vero Designer. Vero has automatic in database blending, which will make scenarios like this a non-issue. Stay tuned!
If you thought this post was great, then you will love our new service Lodr.IO. Loading files into RedShift just got easy.