Category Archives: Joins

An Introduction to Data Blending – Part 3 (Benefits of Blending Data)

Readers:

In Part 2 of this series on data blending, we delved deeper into understanding what data blending is. We also examined how data blending is used in Hans Rosling’s well-known Gapminder application.

Today, in Part 3 of this series, we will dig even deeper by examining the benefits of blending data.

Again, much of Parts 1, 2 and 3 are based on a research paper written by Kristi Morton from The University of Washington (and others) [1].

You can learn more about Ms. Morton’s research as well as other resources used to create this blog post by referring to the References at the end of the blog post.

Best Regards,

Michael

Benefits of Blending Data

In this section, we will examine the advantages of using the data blending feature for integrating datasets. Additionally, we will review another illustrative example of data blending using Tableau.

Integrating Data Using Tableau

In Ms. Morton’s research, Tableau was equipped with two ways of integrating data. First, in the case where the data sets are collocated (or can be collocated), Tableau formulates a query that joins them to produce a visualization. However, in the case where the data sets are not collocated (or cannot be collocated), Tableau federates queries to each data source, and creates a dynamic, blended view that consists of the joined result sets of the queries. For the purpose of exploratory visual analytics, Ms. Morton (et al) found that data blending is a complementary technology to the standard collocated approach with the following benefits:

  • Resolves many data granularity problems
  • Resolves collocation problems
  • Adapts to needs of exploratory visual analytics

Figure 1 - Company Tables

Image: Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau. [1]

Resolving Data Granularity Problems

Often times a user wants to combine data that may not be at the same granularity (i.e. they have different primary keys). For example, let’s say that an employee at company A wants to compare the yearly growth of sales to a competitor company B. The dataset for company B (see Figure 1 above) contains a detailed quarterly growth of sales for B (quarter, year is the primary key), while company A’s dataset only includes the yearly sales (year is the primary key). If the employee simply joins these two datasets on yearly earnings, then each row from A will be duplicated for each quarter in B for a given year resulting in an inaccurate overestimate of A’s yearly earnings.

This duplication problem can be avoided if for example, company B’s sales dataset were first aggregated to the level of year, then joined with company A’s dataset. In this case, data blending detects that the data sets are at different granularities by examining their primary keys and notes that in order to join them, the common field is year. In order to join them on year, an aggregation query is issued to company B’s dataset, which returns the sales aggregated up to the yearly level as shown in Figure 1. This result is blended with company A’s dataset to produce the desired visualization of yearly sales for companies A and B.

The blending feature does all of this on-the-fly without user-intervention.

Resolves Collocation Problems

As mentioned in Part 1, managed repository is expensive and untenable. In other cases, the data repository may have rigid structure, as with cubes, to ensure performance, support security or protect data quality. Furthermore, it is often unclear if it is worth the effort of integrating an external data set that has uncertain value. The user may not know until she has started exploring the data if it has enough value to justify spending the time to integrate and load it into her repository.

Thus, one of the paramount benefits of data blending is that it allows the user to quickly start exploring their data, and as they explore the integration happens automatically as a natural part of the analysis cycle.

An interesting final benefit of the blending approach is that it enables users to seamlessly integrate across different types of data (which usually exist in separate repositories) such as relational, cubes, text files, spreadsheets, etc.

Adapts to Needs of Exploratory Visual Analytics

A key benefit of data blending is its flexibility; it gives the user the freedom to view their blended data at different granularities and control how data is integrated on-the-fly. The blended views are dynamically created as the user is visually exploring the datasets. For example, the user can drill-down, roll-up, pivot, or filter any blended view as needed during her exploratory analysis. This feature is useful for data exploration and what-if analysis.

Another Illustrative Example of Data Blending

Figure 2 (below) illustrates the possible outcomes of an election for District 2 Supervisor of San Francisco. With this type of visualization, the user can select different election styles and see how their choice affects the outcome of the election.

What’s interesting from a blending standpoint is that this is an example of a many-to-one relationship between the primary and secondary datasets. This means that the fields being left-joined in by the secondary data sources match multiple rows from the primary dataset and results in these values being duplicated. Thus any subsequent aggregation operations would reflect this duplicate data, resulting in overestimates. The blending feature, however, prevents this scenario from occurring by performing all aggregation prior to duplicating data during the left-join.

Figure 2 - San Francisco Election

 Image: Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau. [1]

Next: Data Blending Design Principles

——————————————————————————————————–

References:

[1] Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau, University of Washington and Tableau Software, Seattle, Washington, March 2012, http://homes.cs.washington.edu/~kmorton/modi221-mortonA.pdf.

[2] Hans Rosling, Wealth & Health of Nations, Gapminder.org, http://www.gapminder.org/world/.

Bryan Redux #1: Left Joins in MicroStrategy

Readers:

I am occasionally going to re-blog posts from my friend, Bryan Brandow’s MicroStrategy site.

I consider Bryan one of the best in the business, but his passions lie in other areas these days.

I will denote these blogs by beginning them with “Bryan Redux.” If you want to visit Bryan old site, the URL is http://www.bryanbrandow.com.

Best Regards,

Michael

Bryan posted this on Tuesday, March 22, 2011.

Left Joins in MicroStrategy

Bryan BrandowAn interesting stance by MicroStrategy is that they really push you for proper warehouse modeling (or at least what they consider proper).  At the same time, the tool’s flexibility can really handle just about any model, and I’ve seen the SQL Engine come through in some amazing scenarios where other vendors laughed and walked out.  One commonly requested feature is the ability to left join two tables in a report.  That is, left join Dimension to Fact, not left joining multiple passes.  There are plenty of valid reasons you would need this feature, and for many years I would joke “MicroStrategy can do everything .. except Left Joins”.  Imagine my surprise when I discovered an extremely buried feature that does enable left joins!  I stumbled on this a several months ago and have no idea if it’s been there all along or was introduced recently.  Based on forum and friend activity, not many other people are aware of it either.  Today, I’ll show you the secret.

Build a normal report with Attribute1, Attribute2 and a Metric.  The SQL will come out like this:

select a12.Attribute1  Attribute1, a13.Attribute2  Attribute2, sum(a11.Fact)  Metric

from FactTable a11 join DimAttribute1 a12

on (a11.Attribute1Key = a12.Attribute1Key)

join DimAttribute2 a13   on (a11.Attribute2Key = a13.Attribute2Key)

group by a12.Attribute1, a13.Attribute2

But let’s say that you need to left join DimAttribute2 to FactTable.  Simply follow these steps:

Step 1: Edit the Attribute

  1. In the attribute editor, go to Tools -> VLDB Properties.
  2. Change the property Joins -> Preserve all final pass result elements to the third option, Preserve all elements of final pass result table with respect to lookup table but not relationship table.
  3. Update Schema.

Step 2: Edit the Report

  1. In the report editor, go to Data -> VLDB Properties.
  2. Change the property Joins -> Preserve all final pass result elements to the fourth option, Do not listen to per report level setting, preserve elements of the final pass according to the setting at the attribute level.

With those two options combined, the resulting report now generates this SQL:

select a12.Attribute1  Attribute1, a13.Attribute2  Attribute2, sum(a11.Fact)  Metric

from FactTable a11 join DimAttribute1 a12

on (a11.Attribute1Key = a12.Attribute1Key) left outer join DimAttribute2 a13 

on (a11.Attribute2Key = a13.Attribute2Key)

group by a12.Attribute1, a13.Attribute2

Conclusion Note that since you need to turn on a report level setting, changing the attribute won’t modify your entire system.  This is nice because you can choose to let some reports to left join on that attribute while not others.  One side effect I have experienced is that this attribute is no longer eligible for Intelligent Cubes. If you can live with that, this becomes a pretty handy trick.

Bryan’s Blog Entry Link:  http://www.bryanbrandow.com/2011/03/left-joins-in-microstrategy.html