Fact to Fact Blending in Tableau

This post is part 1 of a two part series. In part 2, we will demonstrate how you can solve Fact to Fact blending with Vero.

Fact to Fact Direct Joins Leads to Duplicates

Ooh, fact tables, an ancient artifact of a bygone Business Intelligence era. Do people still have fact tables and the sort in the age of big data?

I kid, I kid...

Maybe the Kimball vision of the Data Warehouse and Conformed Facts has evolved but we still collect transactional and event data.  Whether they are highly denormalized, part of a perfect Kimball Star Schema, stored in NoSQL databases, or distributed in files on a Hadoop cluster, they have one common trait: Fact data is rich with duplicate values on purpose. 

Fact data / tables

If you are new to Analytics, this might sound confusing so let me clarify.  When you checkout at Target with a box of Cheerios and a pair of socks, the data gets ultimately stored in a single table.  There will be an entry for each product with additional details like who made this purchase, when, and which store.  If you bought a pair of socks from the same Target store everyday, each transaction would be identical except for "when".  

Date Store Number Customer Number Product Sale Amount
March 1, 2016 45 123 Socks 5.50
March 2, 2016 45 123 Socks 5.50
March 3, 2016 45 123 Socks 5.50

In the above table to get to a unique event (record) you have to combine Date, Store Number, Customer Number, and Product.  This is known as the granularity of the table.

On the flip side dimension / lookup data would only have one entry for each item it captures.  For example, if we had a Customers table it would only have one entry to identify you.  In the above case, it would have the ID 123 with your name and other attributes that describe you (i.e. address, email etc.).

Blending Fact TaBLEs In Tableau

Joining Tables

In the real world of data and analytics you often have more than one fact table.  In larger organizations you could have hundreds, each with a different granularity. When you need to build a visualization that spans fact tables you might initially be tempted to create a Tableau data source with a join between the fact tables.  As you may have learned the hard way, there are several drawbacks.

Problems with Joining Fact Tables

  1. Direct fact to fact joins are expensive!  Millions, billions, or even trillions of rows joined together make for slow queries.  You might get an angry call from your DBA
  2. If the granularity of the fact tables are different (90% chance that is the case), then you will end up double counting your Measures

Rule of Thumb: Avoid joining Fact Tables Directly

See my article on Mastering SQL for more on the issues with fact to fact joins. 

Blending Data Sources

With the idea of joining fact tables directly mostly ruled out, we are left with the option of blending data sources in Tableau.  Here each fact table and relevant dimension tables would be brought in as separate data sources.  In my example, I have an an orders fact table called order_yr_prd_agg with granularity Customer ID, Product ID, Year and another fact table called product_complaints with granularity Customer ID, Product ID, Complaint Date.

Here is how I added the OrderYrPrdAgg data source:

setting up a fact table joined with dimension tables in tableau

Notice that I included the necessary dimension tables in this step.  Tableau will automatically include exclude these tables based on the needs of the Visualization. This is known as join culling (make sure you set your data source to "Assume Referential Integrity").

Here is the other fact table added as the ProductComplaints data source:

Fact to fact join in tableau - setup up second fact

With the data sources setup we will create a simple Visualization that displays Product Name, Revenue (from OrderYrPrdAgg), and Complaints (from ProductComplaints).  We will start with OrderYrPrdAgg and set it as our primary data source.  Here is the resulting visualization:

Notice the orange link icon on the secondary data source browser.  This indicates how the two sources will ultimately be joined together.  Tableau made this decision automatically based on the Dimensionality of the my visualization. In other words, I have Product Name in the rows shelf and it is the only dimension so the results will be aggregated on this key.

Internals of a Data Blend with Tableau

Internally, Tableau executes a query for each data source and joins the two sources inside the visualization (your desktop or your Tableau server).

Tableau SQL for OrderYrPrdAgg

Tableau SQL for ProductComplaints

Awesome, we have our fully blended Visualization! Not quite. When blending data in Tableau, there are many considerations to be made; not least of all is the volume of data being blended.

Tableau Data Blending Issues

Join Type Between Sources

Tableau does a left join between the primary data source and the secondary.  In our case, if we had products that never sold, but those products have complaints, we won't see them in the final result. 

From MySQLTutorial.com

From MySQLTutorial.com

To get all products in the final result we need to execute a full join and to only get products that exist on both sides we need an inner join.  As far as I know, there is no solution for this besides writing custom SQL.

Scale

If you had millions of rows being blended between the two sources, you could kiss performance good bye.  Tableau blends data at the Visualization layer.  It might be your laptop or it might be in your Tableau Server.  Needless to say, this is not going to scale.

There is only one way to solve this in Tableau and that is by writing custom SQL. Instead of using two data sources, I will create one data source with a custom SQL that combines the two SQL queries above into a single pass as follows:

Problem solved?  Unfortunately not. When you use custom SQL, you lose Join Culling.  This means every request to your database would run the SQL above with some variation in the selections and outer filters.  You are likely to feel the most pain with slow quick filters.  Additionally, you are hard coding your query logic inside your visualization, which is a recipe for silo proliferation and maintenance headaches.

Finally, what if the two fact tables were in different sources? Say our Product Complaints fact data was in a CSV file or another database, perhaps MySQL. Custom SQL could not help us here.  We would have to use two separate data sources and incur a performance cost of blending data inside Tableau.

Ultimate Solution

Data preparation solutions like Vero, helps deal with these issues and more.  In part 2 of this post, I will show you how you can solve this problem with the Vero Designer.  Vero has automatic in database blending, which will make scenarios like this a non-issue.  Stay tuned!


If you thought this post was great, then you will love our new service Lodr.IO. Loading files into RedShift just got easy.