Tag Archives: Data

Infographic: The World’s Biggest Data Breaches (Information is Beautiful)

Readers:

Happy New Year!

In late November, presumed North Korean hackers targeted Sony Pictures Entertainment in an unprecedented cyber attack. This resulted in the exposure of thousands of sensitive emails from Sony executives and threats to release more if the release of the film “The Interview” wasn’t canceled.

While this breach was indeed historically devastating, it’s not the first successful cyber attack on a big corporate powerhouse.

David McCandless and the folks over at Information Is Beautiful have put together an amazing infographic with the biggest data breaches in recent history. You can see when the attack happened, who it happened to, and how large the impact was.

I always encourage my social media friends to reset all of your passwords each new year. Now is the time to do so.

Safe blogging.

Michael

[Click on image to use the interactive version]

1276_worlds-biggest-data-breaches_jun143

12 JavaScript Libraries for Data Visualization

Readers:

This is from a blog post by Thomas Greco.

Thomas GrecoThomas is a web developer / graphic designer living in New York City. When Thomas isn’t striving towards front­end perfection, he enjoys hanging with friends, going to concerts, and exploring through the wilderness!

Thomas has provided twelve JavaScript frameworks that are extremely useful for data visualization. Thomas feels that a more heavy focus is being placed on JavaScript as a data visualization tool.

I tried the demos for these JavaScript frameworks and they are very impressive. I hope you enjoyed this information as much as I did.

Best regards,

Michael

Dygraphs.js

The Dygraphs.js library allows developers to create interactive charts using the X and Y axis to display powerful diagrams. The more data being parsed, the higher the functionality of the graph. That being said, Dygraphs was built for these visualizations to contain a multitude of views. For example, Dygraphs.js makes it capable to analyze separate portions of a data-set, such as specific months, in addition to the timeframe in its entirety. Also, the Dygraphs.js library is compatible across all major web browsers, and can responds to touch sensitivity, making it a thoroughougly solid choice as a data visualization framework.

D3.js

Eventually becoming the successor to Protovis.js, D3 is capable of creating stunning graphics via dynamically updating the DOM. An acronym for Data-Driven Document, D3.js makes use of chained methods when scripting visualizations, subsequently creating dynamic code that is also reusable. Due to its reliance on the DOM, D3 has been created in accordance with W3C web standards so that the library may render correctly across web browsers. Lastly, D3′s path generator function, defined as d3.svg.line(), gives developers the capability to produce a handful of SVGs by defining different paths, and their properties.

InfoVis

Commonly referred to as InfoVis, the JavaScript InfoVis Toolkit (JIT) also earned its stripes as a JavaScript library for data visualization. Equipped with WebGL support, InfoVis has been trusted by names like Mozilla and AlJazeera, showing its solidarity as a visualization tool. Along with the D3 framework, InfoVis also makes use of chained methods to manipulate the DOM, making it a reliable library for developers of any skill set.

The Google Visualization API

Hailing from the Google Developers Console (GDC), Google’s Visualization API can be called with barely any code. In addition to easy DOM modification, this Google API makes it easy for its user to easily define custom modifier functions that can then be placed into custom groups. Furthermore, this interface’s usability, matched with its support from the GDC’s open source network, place it among the top of the list of data visualization tools.

Springy.js

Springy.js is a JavaScript library that relies on an algorithm to create force-directed graphs, resulting in nodes reacting in a spring-like manner on the web page. Although Springy.js comes configured with a predefined algorithm, options such as spring stiffness and damping can easily be passed as parameters. Springy.js was developed by Dennis Hotson as a library for developers to build off of – a fact that he makes clear.

Polymaps.js

Polymaps.js makes use of SVGs to generate interactive web maps with cross browser compatibility in mind. At the heart of Polymaps lies vector tiles, which help ensure both optimal load speeds and optimal zoom functionality. Although it may come configured with components, Polymaps.js is easily customized, and is able to read data in the form of vector geometry, GeoJSON Files, and more. Check out the graph below of the U.S. created by the U.S. Census borough.

Dimple

This past January, the Dimple API was developed so that analysts at Align-Alytics could develop strong data visualizations without having to possess much development knowledge. That being said, Dimple makes it easy for anyone, analyst or not, to develop stunning, three dimensional graphics without any real JavaScript training. Moreover, dimplejs.org displays several demonstrations, which can be easily manipulated by one’s personal data to render a graph with the same configuration, but different values. So, if you, or anyone you know is trying to segway into the depths of JavaScript, then these examples are perfect for beginners to vist and poke around.

Sigma.js

For people looking to build highly advanced line graphs, Sigma.js provides an unbelievable amount of interactive settings inside its library, and also within its plug-ins. Hailing a motto that states “Dedicated to Graph Drawing”, those developing using Sigma.js cannot help but feel like they have chosen a reliable library to work with. Moreover, Sigma’s developers encourage people to re-configure this library and create plug-ins, which has resulted in a large open-source network. Having said all that, I was extremely pleased with various aspects of Sigma, and it is among my favorite libraries for creating graphical representations in JavaScript.

Raphael.js

The Raphael.js library was created with an emphasis on browser compatibility. The framework follows the SVG W3C Recommendation, which is a set of standards that ensure images are completely scalable and without pixelation. In addition to the use of SVGs, Raphael.js even reverts to the Vector Model Language (VML) if rendered in Internet Explorer browsers prior to IE9. Although VML is very rarely used today, the support for it does a great job of showing the attention to detail that the Raphael.js team placed on this project when developing the library.

gRaphaël

Although Raphael.js is a library used to for the creation of SVGs, it was not built with a total focus on the representation of large datasets. In turn, the gRaphaël JavaScript library was created. Weighing in at a mere 10KB, gRaphaël.js has proven to be a worthy extension to Raphael.js. Although it may have not been developed behind things like a force-driven algorithm, nor does it come pre-configured with any physics properties, gRaphaël is still a well respected library for reasons ranging from its cross-compatible SVG structure, to its ease of use. As long as it coincides with the task at hand, I believe that gRaphaël.js should always be looked at as a viable resource to complete a project.

Leaflet

Whether developing for a smartphone, tablet, or desktop, the Leaflet JavaScript library has ranked atop the list of interactive mapping libraries for several reasons. Lead by the founder of MapBox, Vladimir Agafonkin, the Leaflets team of developers worked to create a library “designed with simplicity, performance, and usability in mind.” Along with Polymaps, Leaflet shares the ability to render SVG pattens via vector tiles, however only Leaflet has been developed to support Retina display. Furthermore, Leaflet can interpret various forms of data such as GeoJSON, making it perfect for a number of tasks.

Ember Charts

For those who already use the juggernaut that is Ember.js, the developers at Addepar Open Source have created a few add-on libraries to extend the Ember experience: Ember Table, Ember Widgets, and Ember Charts. A child of Ember.js and D3.js, Ember Charts utilizes the properties of flat-design. Although limited, the library does have a handful of options that deal with properties such as color and size, making it fairly simple to create impressive visualizations. Nonetheless, Ember’s presence in the front end could really help Ember Chart’s popularity in the future.

Stephen Few: Why Do We Visualize Quantitative Data?

Readers:

Stephen_FewIt has been a while since I have discussed some of the latest creative thoughts on data visualization from Stephen Few. I have read all of Steve’s books, attended several classes from him, and religiously follow his blog and newsletter on his website, Perceptual Edge.

For those of you who don’t know, Stephen Few is the Founder & Principal of Perceptual Edge. Perceptual Edge, founded in 2003, is a consultancy that was established to help organizations learn to design simple information displays for effective analysis and communication.

Steve has stated that his company will probably always be a company of one or two people, which is the perfect size for him. With 25 years of experience as an innovator, consultant, and educator in the fields of business intelligence and information design, he is now considered the leading expert in data visualization for data sense-making and communication.

Steve writes a quarterly Visual Business Intelligence Newsletter, speaks and teaches internationally, and provides design consulting. In 2004, he wrote the first comprehensive and practical guide to business graphics entitled Show Me the Numbers, now in its second edition. In 2006, he wrote the first and only guide to the visual design of dashboards, entitled Information Dashboard Design, also now in its second edition. In 2009, he wrote the first introduction for non-statisticians to visual data analysis, entitled Now You See It.

Here is his latest thoughts from his newsletter.

Best regards,

Michael

 

Why Do We Visualize Quantitative Data?

Per Stephen Few, we visualize quantitative data to perform three fundamental tasks in an effort to achieve three essential goals:

Web

These three tasks are so fundamental to data visualization, Steve used them to define the term, as follows:

Data visualization is the use of visual representations to explore, make sense of, and communicate data.

Steve poses the question of why is it that we must sometimes use graphical displays to perform these tasks rather than other forms of representation? Why not always express values as numbers in tables? Why express them visually rather than audibly?

Essentially, there is only one good reason to express quantitative data visually: some features of quantitative data can be best perceived and understood, and some quantitative tasks can be best performed, when values are displayed graphically. This is so because of the ways our brains work. Vision is by far our dominant sense. We have evolved to perform many data sensing and processing tasks visually. This has been so since the days of our earliest ancestors who survived and learned to thrive on the African savannah. What visual perception evolved to do especially well, it can do faster and better than the conscious thinking parts of our brains. Data exploration, sensemaking, and communication should always involve an intimate collaboration between seeing and thinking (i.e., visual thinking).

Despite this essential reason for visualizing data, people often do it for reasons that are misguided. Steve dispels a few common myths about data visualization.

Myth #1: We visualize data because some people are visual learners.

While it is true that some people have greater visual thinking abilities than others and that some people have a greater interest in images than others, all people with normal perceptual abilities are predominantly visual. Everyone benefits from data visualization, whether they consider themselves visual learners or not, including those who prefer numbers.

Myth #2: We visualize data for people who have difficulty understanding numbers.

While it is true that some people are more comfortable with quantitative concepts and mathematics than others, even the brightest mathematicians benefit from seeing quantitative information displayed visually. Data visualization is not a dumbed-down expression of quantitative concepts.

Myth #3: We visualize data to grab people’s attention with eye-catching but inevitably less informative displays.

Visualizations don’t need to be dumbed down to be engaging. It isn’t necessary to sacrifice content in lieu of appearance. Data can always be displayed in ways that are optimally informative, pleasing to the eye, and engaging. To engage with a data display without being well-informed of something useful is a waste.

Myth #4: The best data visualizers are those who have been trained in graphic arts.

While training in graphic arts can be useful, it is much more important to understand the data and be trained in visual thinking and communication. Graphic arts training that focuses on marketing (i.e., persuading people to buy or do something through manipulation) and artistry rather than communication can actually get in the way of effective data visualization.

Myth #5: Graphics provide the best means of telling stories contained in data.

While it is true that graphics are often useful and sometimes even essential for data-based storytelling, it isn’t storytelling itself that demands graphics. Much of storytelling is best expressed in words and numbers rather than images. Graphics are useful for storytelling because some features of data are best understood by our brains when they’re presented visually.

We visualize data because the human brain can perceive particular quantitative features and perform particular quantitative tasks most effectively when the data is expressed graphically. Visual data processing provides optimal support for the following:

1. Seeing the big picture

Graphs reveal the big picture: an overview of a data set. An overview summarizes the data’s essential characteristics, from which we can discern what’s routine vs. exceptional.

The series of three bar graphs below provides an overview of the opinions that 15 countries had about America in 2004, not long after the events of 9/11 and the military campaigns that followed.

graph-of-country-opinions

Steve first discovered this information in the following form on the website of PBS:

table-of-country-opinions

Based on this table of numbers, he had to read each value one at a time and, because working memory is limited to three or four simultaneous chunks of information at a time, he couldn’t use this display to construct and hold an overview of these countries’ opinions in his head. To solve this problem, he redisplayed this information as the three bar graphs shown above, which provided the overview that he wanted. Steve was able to use it to quickly get a sense of these countries’ opinions overall and in comparison to one another.

Bonus: Here is a link to where Steve discusses the example above on his website.

2. Easily and rapidly comparing values

Try to quickly compare the magnitudes of values using a table of numbers, such as the one shown above. You can’t, because numbers must be read one at a time and only two numbers can be compared at a time. Graphs, however, such as the bar graphs above, make it possible to see all of the values at once and to easily and rapidly compare them.

3. Seeing patterns among values

Many quantitative messages are revealed in patterns formed by sets of values. These patterns describe the nature of change through time, how values are distributed, and correlations, to name a few.

Try to construct the pattern of monthly change in either domestic or international sales for the entire year using the table below.

table-of-sales-data

Difficult, isn’t it? The line graph below, however, presents the patterns of change in a way that can be perceived immediately, without conscious effort.

graph-of-sales-data

You can thank processes that take place in your visual cortex for this. The visual cortex perceives patterns and then the conscious thinking parts of our brains make sense of them.

4. Comparing patterns

Visual representations of patterns are easy to compare. Not only can the independent patterns of domestic and international sales be easily perceived by viewing the graph above, but they can also be compared to one another to determine how they are similar and different.

In Summary

These four quantitative features and activities require visual displays. This is why we visualize quantitative data.

Data Blending: Why are Some Metric Values Blank in Documents Using Multiple Datasets in MicroStrategy Analytics Enterprise 9.4.1 (Part 8)

MicroStrategy Analytics Enterprise

Introduction

Starting with MicroStrategy Analytics Enterprise 9.4.1, Report Services documents can contain grids with objects coming from more than one dataset.

Multiple Datasets in a Single Grid/Graph/Widget Object in MicroStrategy Web 9.4 [2]

Users now have the ability to add attributes and/or metrics from multiple datasets to a single grid, graph, or widget. For example, if Dataset #1 contains Category and Revenue and Dataset #2 contains Category and Profit, a grid can be created which contains Category, Revenue, and Profit.

Part 8 - 1

Administrators can control the use of multiple datasets in a single grid, graph, or widget through the Analytical Engine VLDB properties window at the project level.

  1. Right mouse click (RMC) on the project name.
  2. Select Project Configuration.
  3. Click on Project Definition.
  4. Select ‘Advanced’.
  5. Click “Configure” under the Analytical engine VLDB.

Part 8 - 2

NOTE:

The default value is set to: “Objects in document grids must come from the grid’s source dataset only”.

Users can set the set the source of the grid to a particular dataset or choose no dataset (in which case, the MicroStrategy engine will determine the best suited dataset). [1]

The MicroStrategy Analytical Engine displays no data for metrics in ambiguous cases or when there is a conflict. Ambiguous cases can arise in cases where multiple datasets contain the same objects.  Examples based on the MicroStrategy Tutorial project have been provided to explain this information.

Note: When the MicroStrategy Analytical Engine cannot resolve the correct datatset as explained in the cases below, the data displayed for these will correspond to the value chosen for the missing object display under Project Configuration > Report definition > Null values > Missing Object Display. The default value for this blank.

Case1:

Multiple datasets have the same metric. Only one dataset does not contain this metric and this dataset is set as the source of the grid.

This case is explained with an example based on the MicroStrategy Tutorial project.

1. Create the following objects:

a. Dataset DS1 with the attribute ‘Year’ and metric ‘Profit’.

b. Dataset DS2 with the attribute ‘Year’ and metrics ‘Profit’, ‘Revenue’.

c. Dataset DS3 with the attribute ‘Quarter’ and metric ‘Cost’.

2.  Create a document based on the above datasets and create a grid object on the document with the following objects: ‘Year’, ‘Quarter’, ‘Profit’. Set the source of this grid to be the dataset ‘DS3’.

3. In the executed document, no data is displayed for the metric ‘Profit’ as shown below.

Part 8 - 3

In the above example, the metric ‘Profit’ does not exist in the source dataset ‘DS3’ and exists in more than dataset which are in the document i.e., it exists in both ‘DS1’ and ‘DS2’. Since the engine cannot just randomly pick one of the two available datasets, it chooses not to display any data for this metric. If users do not want such blank columns to be displayed, set the source dataset so that such ambiguity does not arise.

Case 2:

The same metric exists multiple times on the grid. For example, users can have a smart compound metric and a component metric of this compound smart metric on the grid in the document. The smart metric and the component metric are from different datasets.

This case is also explained with an example based on the MicroStrategy Tutorial project.

1. Create the following objects:

a. Dataset DS1 with attribute ‘Year’ and metric ‘Profit’.

b. Dataset DS2 with attribute ‘Year’ and metrics ‘Revenue’, ‘Profit’, ‘Profit Margin’ (this is a compound smart metric calculated from metrics Revenue and Profit).

2. Create a document based on the above datasets and create a grid object on the document with the following objects: ‘Year’, ‘Revenue’, ‘Profit’ and ‘Profit Margin’. The source of this grid object is set to DS1.

3. In the executed document, no data is displayed for the metric ‘Profit Margin’, as shown below.

Part 8 - 4

In the above example, since the source of the dataset is set to ‘DS1’, the ‘Profit’ metric is sourced from this dataset and the metric ‘Revenue’ is sourced from the dataset ‘DS2’ (as this is the ONLY datatset with this metric). However, for the metric ‘Profit Margin’, the component metric ‘Profit’ exists on dataset ‘DS1’, so this becomes a conflict metric and is not displayed. If the source of the grid is changed to ‘DS2’, the data is displayed correctly as shown below.

Part 8 - 5

References:

[1] MicroStrategy Knowledgebase, Why are some metric values blank in documents using multiple datasets in MicroStrategy Analytics Enterprise 9.4.1, TN Key: 44517, 12/16/2013, https://resource.microstrategy.com/support/mainsearch.aspx.

[2] MicroStrategy Knowledgebase, Multiple datasets in a single grid/graph/widget object in MicroStrategy Web 9.4, TN Key: 44944, 09/30/2013, https://resource.microstrategy.com/support/mainsearch.aspx.

NOTE: You may need to register to view MicroStrategy’s Knowledgebase.

How Data Blending Affects the Analytical Engine’s Behavior in MicroStrategy (Part 7)

MicroStrategy Analytics PlatformWith the release of MicroStrategy Analytics Enterprise 9.4.1, the Analytical Engine logic has been enhanced with respect to joining data from multiple datasets in a Report Services Document. One of the features that is available with this release is the ability to use objects (e.g., attributes, metrics) from multiple datasets in a single grid in a document.

If an attribute on a grid has elements that can be obtained from multiple datasets used in the document, the elements displayed will be from the global lookup table. Additionally, if one or more of the datasets containing the attribute has missing attribute form data or has different attribute form from the other datasets, the Analytical Engine will follow the rules noted below to compose the final output:

Rule 1:

If there is attribute form with null value, the Analytical Engine will use the non-null form value from other datasets instead of the null form.

Rule 2:

If several datasets have different attribute form information for the attribute element, the Analytical Engine will use the attribute form from the biggest dataset.

Rule 3:

If several datasets have different attribute form information for the attribute element, and those datasets have same number of rows, the Analytical Engine will use the first dataset in the document for the attribute form value (according to the dataset adding sequence).

NOTE: Users should note that the rules are applied for each individual attribute element in the result at the row level rather than at the dataset level.

Example 1:

Users may consider the following datasets – C01 is a dataset with Customer City, Customer and Order:

Part 7 - 1a

 

C02 is a dataset with Customer, Order and a profit metric. Users may note that the Customer attribute is missing the DESC form in the second dataset:

Part 7 - 2a

If a Report Services Document is built with both these datasets, and the attributes are placed on a grid, the following results may be seen. As noted in Rule 1, the Analytical Engine will display the non-Null values from C01 for the Customer attribute elements:

Part 7 - 3a

Example 2:

Now users may consider a different dataset as C02 – similar to the initial dataset, but here the Customer name (DESC) form contains values instead of NULLs. This time the values for the attributes are not consistent – see that Customer ID ‘1’ has different values for the DESC form for different Orders (1 & 6).

Customer Name Customer ID Order Profit
Customer D 1 1 100
Customer B 2 2 200
Customer C 3 3 300
Xia D 4 4 400
Kris Du 5 5 500
Customer A 1 6 610
Customer E 2 7 720
Customer F 6 8 860
Customer G 7 9 970
Customer H 8 10 1080

If a report is built for this dataset users will observe that the first attribute element value in the dataset is used as as the DESC form for the Orders 1 & 6 even if the value is different in subsequent rows (this is the same as previous Analytical Engine behavior). Part 7 - 5

When these datasets are used in the grid in a Report Services Document, the Analytical Engine will choose the attribute element values from dataset C02 to display in the attribute element values from. This is because of Rule 2 explained above.

Part 7 - 6a

Example 3:

Consider the following dataset:

Customer Name Customer ID Order Profit
Customer D 1 1 100
Customer E 2 7 720
Xia D 4 4 400
Kris Du 5 5 500
Customer G 7 9 970

A report built off this dataset appears as follows:

Part 7 - 8a

After replacing the dataset ‘C02‘ from the previous example with the new dataset, the following results are seen. As noted in Rule 3, because both C01 an C02 have the same number of rows, the elements displayed for the Customer attribute will be filled from from the first dataset to be added to the document – in this case C01. However for the first row in the results, where there is no corresponding customer in the dataset C01, Rule 1 will be applied and instead of a NULL value, the non-null Customer Name field ‘Customer G’ is picked from C02. (Rules are applied at the individual element level).

Part 7 - 9a

Next: Why are some metric values blank in documents using multiple datasets in MicroStrategy Analytics Enterprise 9.4.1

———————————————————————————-

References:

[1] MicroStrategy Knowledgebase, Engine behavior for grids on a Report Services Document or dashboard with multiple datasets where some attribute forms are missing or have different values the datasets in MicroStrategy Analytics Enterprise 9.4.1 and newer releases, TN Key: 45463, 03/13/2014, https://resource.microstrategy.com/support/mainsearch.aspx.

NOTE: You may need to register to view MicroStrategy’s Knowledgebase.

Forbes: Data Visualization Is The Future – Here’s Why

Readers:

Dorie ClarkI read this blog post from Dorie Clark back in March. I keep notes on interesting blogs and articles I come across and wanted to share this one with you regards the importance of data visualization.

Dorie Clark is a marketing strategist and professional speaker who teaches at Duke University’s Fuqua School of Business. Learn more about her new book Reinventing You: Define Your Brand, Imagine Your Future (Harvard Business Review Press) and follow her on Twitter.

I hope you find this helpful in your data visualization endeavors.

Best regards,

Michael

Data Visualization Is The Future – Here’s Why

We’ve all heard that Big Data is the future. But according to Phil Simon’s new book The Visual Organization: Data Visualization, Big Data, and the Quest for Better Decisions, that may not be quite right. Big Data is a powerful discovery tool for companies seeking to glean new insights. But without the right framework for understanding it, much of that knowledge may go unrecognized. Oftentimes, it’s data visualization that allows Big Data to unleash its true impact.

The Visual Organization is fundamentally about how progressive organizations today are using a wide array of data visualization (dataviz) tools to ask better questions of their data – and make better business decisions,” says Simon, citing the example of companies such as Amazon, Apple , Facebook, Google, Twitter, and Netflix, among others.

Phil Simon
Data visualization allows Big Data to unleash its true impact, as author Phil Simon explains.

Two recent factors have conspired to make this the moment for data visualization. First, says Simon, is the rise of Big Data and the growing public awareness of its power. “Today more than ever, professionals are being asked to argue their cases and make their decisions based on data,” he says. “A new, data-oriented mind-set is permeating the business world.”

But that push outside IT circles means that many non-technical professionals must now produce and comprehend insights from Big Data. Visualization can help, and a raft of new tools makes that possible. “IBM, Cognos, SAS, and other enterprise BI (business intelligence) stalwarts are still around, but they are no longer the only game in town,” he says. “Today, an organization need not spend hundreds of thousands or millions of dollars to get going with dataviz. These new tools have become progressively more powerful and democratic over the last decade. Long gone are the days in which IT needed to generate reports for non-technical employees. They have made it easier than ever to for employees to quickly discover new things in increasingly large datasets. Examples include Visual.ly, Tableau, Vizify, D3.js, R, and myriad others.”

———————————————————————–
Source: Dorie Clark, Data Visualization Is The Future – Here’s Why, Forbes, March 10, 2014, http://www.forbes.com/sites/dorieclark/2014/03/10/data-visualization-is-the-future-heres-why/.

An Introduction to Data Blending – Part 6 (Data Blending using MicroStrategy)

Readers:

In Part 5 of this series on data blending, we reviewed Tableau’s Data Blending Architecture. With Part 5, I have wrapped up the Tableau portion of this series.

I am now going to post, over the next week or so, several parts discussing how we do data blending using MicroStrategy. Fortunately, MicroStrategy just publish a nice technical note on their Knowledgebase (TN Key: 46940) [1] discussing this. Most of what I am sharing today is derived from that technical note.

I probably will have 2-4 parts for this topic in my Data Blending series including how the MicroStrategy Analytical Engine deals with multiple datasets.

I want to thank Kristi Morton (et al) for the wonderful research paper she wrote at The University of Washington [2]. It helped me provide some real insight into the topic and mechanics of data blending, particularly with Tableau. You can learn more about Ms. Morton’s research as well as other resources used to create this blog post by referring to the References at the end of the blog post.

So, let’s now dig into how MicroStrategy provides us data blending capabilities.

Best Regards,

Michael

Data Blending using MicroStrategy

In Part 6, we will begin examining using data blending in MicroStrategy. We will first look at how to use attributes from multiple datasets in the same Visual Insight dashboard and link them to existing attributes using the Data Blend feature in MicroStrategy Analytics Enterprise Web 9.4.1.

Prior to v9.4.1 of MicroStrategy, data blending was referred to as Cube Joining.

In MicroStrategy Analytics Enterprise Web 9.4.1, the new Report Services Documents Engine automatically links common attributes using the modeled schema whenever possible. The manual linking is not allowed between different modeled attributes. Just in case the requirement needs to link different attributes, this can be done by using MicroStrategy Architect at the schema level. The join behavior by default for linking related attributes is done using a full outer join. In case there is no relationship between the attributes, then a cross join is used.

The manual attribute linking can be done as shown in the images below.

Part 6 - 1

 

2. Browse the file to match the existing data and select Continue.

Part 6 - 2

 

3. Set the attribute forms if needed. MicroStrategy will automatically assign the detected ones.

Part 6 - 3

4. The attributes can be mapped manually by selecting Link to Project Attribute.

Part 6 - 4

5. Select the attribute form that matches the desired join:

Part 6 - 5

6. The attribute should appear similar to the ones existing in the schema as shown below.

Part 6 - 6

 

7. Save the recently created dataset.

Part 6 - 7

8. Now there are two cubes used as datasets in the same Visual Insight dashboard, as shown below.

Part 6 - 7a

Automatic Linking

The attributes icons now have a blue link, as shown below. This indicates that MicroStrategy has automatically linked them to elements in the Information dataset.

Part 6 - 8

Next: How Data Blending Affects the Analytical Engine’s Behavior in MicroStrategy

———————————————————————————-

References:

[1] MicroStrategy Knowledgebase, How to use attributes from multiple datasets in the same Visual Insight dashboard and link them to existing attributes using the Data Blend feature in MicroStrategy Analytics Enterprise Web 9.4.1, TN Key: 46940, 04/24/2014, https://resource.microstrategy.com/support/mainsearch.aspx.

NOTE: You may need to register to view MiroStrategy’s Knowledgebase.

[2] Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau, University of Washington and Tableau Software, Seattle, Washington, March 2012, http://homes.cs.washington.edu/~kmorton/modi221-mortonA.pdf.

An Introduction to Data Blending – Part 5 (Tableau’s Data Blending Architecture)

Readers:

In Part 4 of this series on data blending, we reviewed Tableau’s Data Blending Principles. We also reviewed an example of data blending in Jonathan Drummey’s Patient Falls Dashboard. [3]

Today, in Part 5 of this series, we will peel the onion a bit more and look at Tableau’s Data Blending Architecture.

Again, much of Parts 1 – 5 are based on a research paper written by Kristi Morton from The University of Washington (and others) [1].

You can learn more about Ms. Morton’s research as well as other resources used to create this blog post by referring to the References at the end of the blog post.

Best Regards,

Michael

Integrating Data in Tableau

In Part 5, we discuss in greater detail how data blending works. Then we discuss how a user builds visualizations using data blending using several large datasets involving airline statistics.

Data Blending Architecture

Part 5 - Figure 1

The data blending system, shown in Figure 1 above, takes as input the VizQL query workload generated by the user’s GUI actions and data source schemas, and automatically infers how to query the data sources remotely and combine their results on-the-fly. The system features a two-tier mediator-based architecture in which the VizQL query workload is analyzed and partitioned at runtime based on the corresponding data source fields being used. The primary mediator initiates this process by removing the visual encodings from the VizQL query workload to yield an abstract query. The abstract query is partitioned for further processing by the primary mediator and one or more secondary mediators. The primary mediator creates the mediated schema for the given query workload. It then federates the abstract queries to the primary data source as well as the secondary mediators and their respective data sources. The wrappers compile the abstract queries into concrete SQL or MDX queries and instantiate the semantic mappings between the data sources and the mediated schema for each query. The primary mediator joins all the result sets returned from all data sources to produce the mediated result set used by the rendering system. [1]

Part 5 - Figure 2

Post-aggregate Join

A visualization is organized by its discrete fields into pages, partitions, colors, etc., and like a GROUP BY clause in SQL, these grouping fields comprise the primary key of the visualization. In a blended visualization, the grouping fields from the primary data source become the primary key of the mediated schema. In Figure 2 above, these are shown as the dark-green fields in the primary data source, and the light green fields represent the aggregated data. Each secondary data source must contain at least one field that matches a visualization grouping field in order to blend into the mediated schema. The matching fields in a secondary data source comprise its join key, and fields appear in the GROUP BY clause issued by the secondary mediator wrappers. The aggregated data from the secondary data source, shown in light-purple, is then left-joined along its join key into the mediated result set.  Morton (et al) refer to this left-join of aggregated result sets as a post-aggregate join. [1]

Primary Key Cardinality

many mapping between the domain values of the primary key and those of the secondary join key, because the secondary join key is a subset of the primary key and contains only unique values in the aggregated secondary result set. Morton (et al) find that this approach is the most natural for augmenting a visualization with secondary data sources of uncertain value or quality, which is a common scenario for Tableau users.

Data blending supports many-to-one relationships between the primary and each secondary. This can occur when the secondary data source contains coarser-grained data than the mediated result set, as discussed in Part 3 of this series.

Since the join key in a secondary result set may match a subset of the blended result set primary key, portions of the secondary result set may be duplicated across repeated values in the mediated result set. This does not pose risk of double-counting measure values, becaused all aggregation is performed prior to the join. When a blended visualization uses multiple secondary data sources, each secondary join key may match any subset of the primary key. The primary mediator handles duplicating each secondary result set as needed to join with the mediated result set.

Finally, a secondary dimension which is not part of the join key (and thus not a grouping field in the secondary query) can still be used in the visualization. If it is functionally dependent on the join key, a secondary dimension can be used without affecting the result set cardinality. Tableau references this kind of non-grouping dimension using both MIN and MAX aggregations in the query issued to the secondary data source, which allows Tableau to determine if the dimension is functionally dependent on the join key. For each row in the secondary result set, if the two aggregated values are the same then the value is used as-is, reflecting the functional dependence on the grouping fields. If the aggregated values differ, Tableau represents the value using a special form of NULL called ManyValues. This is represented in the visualization as a ‘*’, but retains the behavior of NULL
when used in calculated fields or other computations. The visual feedback allows a user to distinguish this lack of data from the NULLs which occur due to missing or mismatched data.

Inferring Join Keys

Tableau uses very simple rules for automatically detecting candidate join keys:

  1. The secondary data source field name must match a field with the same name in the primary data source.
  2. The data types must match
  3. If they are date/time fields, they must represent the same granularity date bin in the date/time hierarchy, e.g. both are MONTH. A user can intervene to force a match either by providing field captions to rename fields within the Tableau data model, or by explicitly defining a link between fields using a simple user interface.

Part 5 - Figure 3

Another Simple Blending Example

A Tableau data blending scenario is shown in Figure 3 above, which includes multiple views that were composed in minutes by uniquely mashing up four different airline datasets, the largest of which include a 324 million row ticket pricing database and a 140 million row on-time performance database. A user starts by dragging fields from any dataset on to a blank visual canvas, iteratively building a VizQL statement which ultimately produces a visualization. In this example, the user first drags the VizQL fields, YEAR(Flight Date) and AVG(Airfare), from the pricing dataset onto the visual canvas.

Data blending occurs when the user adds fields from a separate dataset to an existing VizQL statement in order to augment their analysis. Tableau assigns the existing dataset to the primary mediator and uses secondary mediators to manage each subsequent dataset added to the VizQL. The mediated schema has a primary key composed of the grouping VizQL fields from the primary dataset (e.g. YEAR(Flight Date)); the remaining fields in the mediated schema are the aggregated VizQL fields from the primary dataset along with the VizQL fields from each secondary dataset.

Continuing our example, the user wishes to drag AVG(Total Cost per Gallon) from the fuel cost dataset to the visualization. The schema matching algorithm examines
the secondary dataset for one or more fields whose name exactly matches a field in the primary key of the mediated schema. While the proposed matches are often sufficient and acceptable, the user can specify an override. Since the fuel cost dataset has a field named Date, the user provides a caption of Flight Date to resolve the schema discrepancy. At this point the mediated schema is created and the VizQL workload is then federated to the wrappers for each dataset. Each wrapper compiles VizQL to SQL or MDX for the given workload, executes the query, and maps the result set into the intermediate form expected by the primary mediator.

The mapping is performed dynamically, since both the VizQL and the data model evolve during a user’s iterative analytical workflow. Finally, the primary mediator
performs a left-join of each secondary result set along the primary key of the mediated schema. In this example, the mediated result set is rendered to produce the visualization shown in Figure 3(a).

Evolved Blending Example

Figure 3(b) above shows further evolution of the analysis of airline datasets, and demonstrates several key points of data blending. First, the user adds a unique ID field named unique carrier from the primary dataset to the VizQL to visualize results for each airline ID over time. The mediated schema adapts by adding this field to its primary key, and the secondary mediator automatically queries the fuel cost dataset at this finer granularity since it too has a field named uniquecarrier. Next, the user decorates the visualization with descriptive airline names for each airline ID by dragging a field named Carrier Name from a lookup table.

This dataset is at a coarser granularity than the existing mediated schema, since it does not represent changes to the carrier name over time. Morton’s (et al) system automatically handles this challenge by allowing the left-join to use a subset of the mediated result set primary key, and replicating the carrier name across the mediated result set. Figure 4 below demonstrates this effect using a tabular view of a portion of the mediated result set, along with portions of the primary and secondary result sets.

The figure also demonstrates how the left-join preserves data for years which have no fuel cost records. Last, the user adds average airline delays from a 140 million row dataset which matches on Flight Date and uniquecarrier. This is a fast operation, since the wrapper performs mapping operations on the relatively small, aggregated result set produced by the remote database. Note that none of these additional analytical tasks required the user to intervene in data integration tasks, allowing their focus to remain on finding insight in the data.

Part 5 - Figure 4

Filtering

Tableau provides several options for filtering data. Data may be filtered based on aggregate conditions, such as excluding including airlines having a low total count of flights. A user can filter aggregate data from the primary and secondary data sources in this fashion, which results in rows being removed from the mediated result set. In contrast, row level filters are only allowed for the primary data source. To improve performance of queries sent to the secondary data sources, Tableau will filter the join keys to exclude values which are not present in the domain of the primary data source result set, since these values would be discarded by the left-join.

Data Cleaning Capabilities

As mentioned in the Inferring Join Keys section above, Tableau supports user intervention in resolving field names when schema matching fails. And once the schemas match and data is blended, the visualization can help provide feedback regarding the validity of the underlying data values and domains. If there are any data inconsistencies, users can provide aliases for a field’s data values which will override the original values in any query results involving that field. The primary mediator performs a left-join using the aliases of the data values, allowing users to blend data despite discrepancies from data entry errors and spelling variations. Tableau provides a simple user interface for editing field aliases. Calculated fields are another aspect of Tableau’s data model which support data cleaning. Calculated fields support arbitrary transformations of original data values into new data values, such as trimming whitespace from a string or constructing a date from an epoch-based integer timestamp.

As with database fields, calculated fields can be used as primary keys or join keys.

Finally, Tableau allows users to organize a field’s related data values into groups. These ad-hoc groups can be used for entity resolution, such as binding multiple variations of business names to a canonical form. Ad-hoc groups also allow constructing coarser-grained structures, such as grouping states into regions. Data blending supports joins between two ad-hoc groups, as well as joins between an ad-hoc group and a string field.

 Next: Data Blending Using MicroStrategy

———————————————————————————-

References:

[1] Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau, University of Washington and Tableau Software, Seattle, Washington, March 2012, http://homes.cs.washington.edu/~kmorton/modi221-mortonA.pdf.

[2] Hans Rosling, Wealth & Health of Nations, Gapminder.org, http://www.gapminder.org/world/.

[3] Jonathan Drummey, Tableau Data Blending, Sparse Data, Multiple Levels of Granularity, and Improvements in Version 8, Drawing with Numbers, March 11, 2013, http://drawingwithnumbers.artisart.org/tableau-data-blending-sparse-data-multiple-levels-of-granularity-and-improvements-in-version-8/.

 

An Introduction to Data Blending – Part 4 (Data Blending Design Principles)

Readers:

In Part 3 of this series on data blending, we examining the benefits of blending data. We also reviewed an example of data blending that illustrated the possible outcomes of an election for the District 2 Supervisor of San Francisco.

Today, in Part 4 of this series, we will discuss data blending design principles and show another illustrative example of data blending using Tableau.

Again, much of Parts 1, 2, 3 and 4 are based on a research paper written by Kristi Morton from The University of Washington (and others) [1].

You can learn more about Ms. Morton’s research as well as other resources used to create this blog post by referring to the References at the end of the blog post.

Best Regards,

Michael

Data Blending Design Principles

In Part 3, we describe the primary design principles upon which Tableau’s data blending feature was based. These principles were influenced by the application needs of Tableau’s end-user. In particular, we designed the blending system to be able to integrate datasets on-the-fly, be responsive to change, and driven by the visualization. Additionally, we assumed that the user may not know exactly what she is looking for initially, and needs a flexible, interactive system that can handle exploratory visual analysis.

Push Computation to Data and Minimize Data Movement

Tableau’s approach to data visualization allows users to leverage the power of a fast database system. Tableau’s VizQL algebra is a declarative language for succinctly describing visual representations of data and analytics operations on the data. Tableau compiles the VizQL declarative formalism representing a visual specification into SQL or MDX and pushes this computation close to the data, where the fast database system handles computationally intensive aggregation and filtering operations. In response, the database provides a relatively small result set for Tableau to render. This is an important factor in Tableau’s choice of post-aggregate data integration across disparate data sources – since the integrated result sets must represent a cognitively manageable amount of information, the data integration process operates on small amounts of aggregated, filtered data from each data source. This approach avoids the costly migration effort to collocate massive data sets in a single warehouse, and continues to leverage fast databases for performing expensive queries close to the data.

Automate as Much as Possible, but Keep User in Loop

Tableau’s primary focus has been on ease of use since most of Tableau’s end-users are not database experts, but range from a variety of domains and disciplines: business analysts, journalists, scientists, students, etc. This lead them to take a simple, pay-as-you-go integration approach in which the user invests minimal upfront effort or time to receive the benefits of the system. For example, the data blending system does not require the user to specify schemas for their data sets, rather the system tries to infer this information as well as how to apply schema matching techniques to blend them for a given visualization. Furthermore, the system provides a simple drag-and-drop interface for the user to specify the fields for a visualization, and if there are fields from multiple data sources in play at the same time, the blending system infers how to join them to satisfy the needs of the visualization.

In the case that something goes wrong, for example, if the schema matching could not succeed, the blending system provides a simple interface for specifying data source relationships and how blending should proceed. Additionally, the system provides several techniques for managing the impact of dirty data on blending, which we discuss in more in Part 5 of this series.

Another Example: Patient Falls Dashboard [3]

NOTE: The following example is from Jonathan Drummey via the Drawing with Numbers blog site. The example uses Tableau v7, but at the end of the instructions on how he creates this dashboard in Tableau v7, Mr. Drummey includes instructions how the steps became more simplied in Tableau v8. I have included a reference to this blog post on his site in the reference section of my blog entry. The “I”, “me” voice you read in this example is that of Mr. Drummey.

As part of improving patient safety, we track all patient falls in our healthcare system, and the number of patient days – the total of the number of days of inpatient stays at the hospital. Every month report we report to the state our “fall rate,” a metric of the number of falls with injury for certain units in the hospital per 1000 patient days, i.e. days that patients are at the hospital. Our annualized target is to have less than 0.7 falls with injury per 1000 patient days.

A goal for our internal dashboard is to show the last 13 months of fall rates as a line chart, with the most recent fall events as a bar chart, in a combined chart, along with a separate text table showing some details of each fall event. Here’s the desired chart, with mocked-up data:

 

combo bars and lines

On the surface, blending this data seems really straightforward. We generate a falls rate very month for every reporting unit, so use that as the primary, then blend in the falls as they happen. However, this has the following issues:

  • Sparse Data – As I’m writing this, it’s March 7th. We usually don’t get the denominator of the patient days for the prior month (February) for a few more days yet, so there won’t be any February row of measure data to use as the primary to get the February fall events to show on the dashboard. In addition, there still wouldn’t be any March data to get the March fall events. Sometimes when working with blend, the solution is to flip our choices for the primary and secondary datasource. However, that doesn’t work either because a unit might go for months or years without a patient fall, so there wouldn’t be any fall events to blend in the measure data.
  • Falls With and Without Injury – In the bar chart, we don’t just want to show the number of patient falls, we want to break down the falls by whether or not they were falls with injury – the numerator for the fall rate metric – and all other falls. The goal of displaying that data is to help the user keep in mind that as important as it is to reduce the number of falls with injury, we also need to keep the overall number of falls down as well. No fall = no chance of fall with injury.
  • Unit Level of Detail – Because the blend needs to work at the per-unit level of detail as well as across all reporting units, that means (in version 7 at least) that the Unit needs to be in the view for the blend to work. But we want to display a single falls rate no matter how many units are selected.

Sparse Data

To deal with issue of sparse data, there are a few possible solutions:

  • Change the combined line and bar chart into separate charts. This would perhaps be the easiest, though it would require some messing about with filters, hidden reference lines, and continuous date axes to ensure that the two charts had similar axis ranges no matter what. However, that would miss out on the key capability of the combined chart to directly see how a fall contributes to the fall rate. In addition, there would be no reason to write this blog post. :)
  • Perform padding in the data source, either via a query/view or Custom SQL. In an earlier version of this project I’d built this, and maintaining a bunch of queries with Cartesian joins isn’t my favorite cup of tea.
  • Building a scaffold data source with all combinations of the month and unit and using the scaffold as the primary data source. While possible, this introduces maintenance issues when there’s a need for additional fields at a finer level of detail. For example, the falls measure actually has three separate fall rates – monthly, quarterly, and annual. These are generated as separate rows in our measures data and the particular duration is indicated by the Period field. So the scaffold source would have to include the Period field to get the data, but then that could be too much detail for the blended fall event data, and make for more complexity in the calculations to make sure the aggregations worked properly.
  • Do a tiny bit of padding in the query, then do the rest in Tableau via Show Missing Values aka domain padding. As I’d noted in an earlier post on blending, domain padding occurs before data is blended so we can pad out the measure data through the current date and then include all the falls. This is the technique I chose, for the reason that padding one row to the data is trivial and turning on Show Missing Values is a couple of mouse clicks. Here’s how I did that:

In my case, the primary data source is a Microsoft Access query that gets the falls measure results from a table that also holds results for hundreds of other metrics that we track. I created a second query with the same number of columns that returns Null for every field except the Measure Date, which has a value of 1/1/1900. Then a third query UNION’s those two queries together, and that’s what is used as the data source in Tableau.

Then, in Tableau, I added a calculated field called Date with the following formula:

//used for padding out display to today
IF [Measure Date] == #1/1/1900# THEN 
    TODAY() 
ELSE 
    [Measure Date] 
END

The measure results data contains a row per measure, reporting unit, and the period. These are pre-calculated because the data is used in a variety of different outputs. Since in this dashboard we are combining the results across units, we can’t just use the rate, we need to go back to the original numerator and denominator. So, I also created a new field for the Calculated Rate:

SUM([Numerator])/SUM([Denominator])

Now it’s possible to start building the line chart view:

  1. Put the Month(Date) – the full month/year version as a discrete – on Columns, Calculated Rate on Rows, Period on the Color Shelf. This only shows the data that exists in the data source, including the empty value for the current month (March in this case):

 

Screen Shot 2013-03-09 at 1.11.25 PM

 

  1. Turn on Show Missing Values for Month(Date) to start domain padding. Now we can see the additional column(s) for the month(s) – February in this case between January to the current month that Tableau has added in:

 

Screen Shot 2013-03-09 at 1.14.19 PM

 

With a continuous (green pill) date, this particular set-up won’t work in version 8. Tableau’s domain padding is not triggered when the last value of the measure is Null. I’m hoping this is just an issue with the beta, I’ll revise this section with an update once I find out what’s going on.

Even though the measure data only has end of month dates, instead of using Exact Date for the month I used Month(Date) because of two combined factors: One is that the default import of most date fields from MS Jet sources turns them into DateTime fields, the second is that Show Missing Values won’t work on an Exact Date for a DateTime field, you have to assign an aggregation to a DateTime (even Second will work). This is because domain padding at this level can create an immense number of new rows and cause Tableau to run out of memory, so Tableau keeps the option off unless you want it. Also note that you can turn on Show Missing Values for an Exact Date for a Date Field.

  1. Now for some cleanup steps: for the purposes of this dashboard, filter Period to remove Monthly (we do quarterly reporting), but leave in Null because that’s needed for the domain padding.
  2. Right-click Null on the Color Legend and Hide it. Again, we don’t exclude this because this would cause the extra row for the domain padding to fail.
  3. Set up a relative date filter on the Date field for the last 13 months. This filter works just fine with the domain padding.

Filtering on Unit

Here’s a complicating factor: If we add a filter on Unit, there’s a Null listed here:

 

Screen Shot 2013-03-09 at 1.18.31 PM

I’d just want to see the list of units. But if we filter that Null out, then we lose the domain padding, the last date is now January 2013:

 

Screen Shot 2013-03-09 at 1.18.58 PM

 

One solution here would be to alter the padding to add a padding row for every unit, instead of just one unit. Since Tableau doesn’t let us just hide elements in a filter, and we actually have more reporting units in our data than we are displaying on the dashboards, I chose to use a parameter filter because there are more reporting units in our production data than we are displaying on the dashboards, yet the all-unit rate needs to include all of the data. Setting this up included a parameter with All and each of the units, and a calculated field called “Chosen Unit Filter” with the following formula, that is set to Filter on False:

[Choose Unit] == "All" OR [Choose Unit] == [Unit]

Falls With and Without Injury

In a fantasy world, to create the desired stacked bars I’d be able to drag the Number of Records from the secondary datasource, i.e. the number of fall events, drag an Injury indicator onto the Color Shelf, and be done. However, that runs into the issue of having a finer level of detail in the secondary than in the primary, which I’ll walk through solutions for in the next section. In this case, since there are only two different numbers, the easy way is to generate two separate measures, then use Measure Names/Measure Values to create the stacked bars – Measure Values on Rows, and Measure Names on the Color Shelf. Here’s the basic calculation for Falls with Injury:

SUM(IF [Injury] != "None" THEN 1 ELSE 0 END)

We’re using a row-level calculated field to generate the measure, and a slightly different calc for Falls w/out Injury.

Unit Level of Detail

When we want to blend in Tableau at a finer level of detail and aggregate to a higher level, historically there have been three options:

  • Don’t use blending at all, instead use a query to perform the “blend” outside of Tableau. In the case that there are totally different data sources, this can be more difficult but not impossible by using one of the systems or a different system to create a federated data source, for example by adding your Oracle table as an ODBC connection to your Excel data, then making the query on that. In this case, we don’t have to do that.
  • Use Tableau’s Primary Groups feature “push” the detail from the secondary into the primary data source. This is a really helpful feature, the one drawback is that it’s not dynamic so any time there are new groupings in the secondary it would have to be re-run. Personally, I prefer automating as much as possible so I tend not to use this technique.
  • Set up the view with the needed dimensions in the view – on the Level of Detail Shelf, for example – and then use table calculations to do the aggregation. This is how I’ve typically built this kind of view.

Tableau version 8 adds a fourth option:

  • Tell Tableau what fields to blend on, then bring in your measures from the secondary.

I’ll walk through the table calculation technique, which works the same in version 7 and version 8, and then how to take advantage of v8′s new feature.

Using Table Calculations to Aggregate Blended Data

In order to blend the the falls data at the hospital unit level to make sure that we’re only showing falls for the selected unit(s), the Unit has to be in the view (on the Rows, Columns, or Pages Shelves, or on the Marks Card). Since we don’t actually need to display the Unit, the Level of Detail Shelf is where we’ll put that dimension. However, just adding that to the view leads to a bar for each unit, for example for April 2012 one unit had one fall with injury and another had two, and two units each had two falls without injury.

 

Screen Shot 2013-03-09 at 1.30.27 PM

 

To control things like tooltips (along with performance in some cases), it’s a lot easier to have a single bar for each month/measure. To do that, we turn to a table calculation, here’s the Falls w/Injury for v7 Blend calculated field, set up in the secondary data source:

IF FIRST()==0 THEN
	TOTAL([Falls w/Injury])
END

This table calculation has a Compute Using of Unit, so it partitions on the Month of Date. The IF FIRST()==0 part ensures that there is only one mark per partition. I’m using the TOTAL() aggregation here because it’s easier to set up and maintain. The alternative is to use WINDOW_SUM(), but in Tableau prior to version 7 there are some performance issues, so the calc would be:

IF FIRST()==0 THEN
	WINDOW_SUM(SUM(Falls w/Injury]), 0, IIF(FIRST()==0,LAST(),0))
END

The ,0 IIF(FIRST()==0,LAST(),0 part is necessary in version 7 to optimize performance, you can get rid of that in version 8.

You can also do a table calculation in the primary that accesses fields in the secondary, however TOTAL() can’t be used across blended data sources, so you’d have to use the WINDOW_SUM version.

With a second table calculation for the Falls w/out Injury, now the view can be built, starting with the line chart from above:

  1. Add Measure Names (from the Primary) to Filters Shelf, filter it for a couple of random measures.
  2. Put Measure Values on the Rows Shelf.
  3. Click on the Measure Values pill on Rows to set the Mark Type to Bar.
  4. Drag Measure Names onto the Color Shelf (for the Measure Values marks).
  5. Drag Unit onto the Level of Detail Shelf (for the Measure Values marks).
  6. Switch to the Secondary to put the two Falls for v7 Blend calcs onto the Measure Values Shelf.
  7. Set their Compute Usings to Unit.
  8. Remove the 2 measures chosen in step 1.
  9. Clean up the view – turn on dual axes, move the secondary axis marks to the back, change the axis tick marks to integers, set axis titles, etc.

This is pretty cool, we’re using domain padding to fill in for non-existent data and then having a blend happening at one level of detail while aggregating to another, just for the second axis. Here’s the v7 workbook on Tableau Public:

Patient Falls Dashboard - Click on Image to go to Tableau Public

Patient Falls Dashboard – Click on image above to go to Tableau Public

Tableau Version 8 Blending – Faster, Easier, Better

For version 8, Tableau made it possible to blend data without requiring the linking fields in the view. Here’s how I build the above v7 view in v8:

  1. Add Measure Names (from the Primary) to Filters Shelf, filter it for a couple of random measures.
  2. Put Measure Values on the Rows Shelf.
  3. Click on the Measure Values pill on Rows to set the Mark Type to Bar.
  4. Drag Measure Names onto the Color Shelf (for the Measure Values marks).
  5. Switch to the Secondary and click the chain link icon next to Unit to turn on blending on Unit.
  6. Drag the Falls w/Injury and Falls w/out Injury calcs onto the Measure Values Shelf.
  7. Remove the 2 measures chosen in step 1.
  8. Clean up the view – turn on dual axes, move the secondary axis marks to the back, change the axis tick marks to integers, set axis titles, etc.

The results will be the same as v7.

Next: Tableau’s Data Blending Architecture

———————————————————————————-

References:

[1] Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau, University of Washington and Tableau Software, Seattle, Washington, March 2012, http://homes.cs.washington.edu/~kmorton/modi221-mortonA.pdf.

[2] Hans Rosling, Wealth & Health of Nations, Gapminder.org, http://www.gapminder.org/world/.

[3] Jonathan Drummey, Tableau Data Blending, Sparse Data, Multiple Levels of Granularity, and Improvements in Version 8, Drawing with Numbers, March 11, 2013, http://drawingwithnumbers.artisart.org/tableau-data-blending-sparse-data-multiple-levels-of-granularity-and-improvements-in-version-8/.

 

An Introduction to Data Blending – Part 3 (Benefits of Blending Data)

Readers:

In Part 2 of this series on data blending, we delved deeper into understanding what data blending is. We also examined how data blending is used in Hans Rosling’s well-known Gapminder application.

Today, in Part 3 of this series, we will dig even deeper by examining the benefits of blending data.

Again, much of Parts 1, 2 and 3 are based on a research paper written by Kristi Morton from The University of Washington (and others) [1].

You can learn more about Ms. Morton’s research as well as other resources used to create this blog post by referring to the References at the end of the blog post.

Best Regards,

Michael

Benefits of Blending Data

In this section, we will examine the advantages of using the data blending feature for integrating datasets. Additionally, we will review another illustrative example of data blending using Tableau.

Integrating Data Using Tableau

In Ms. Morton’s research, Tableau was equipped with two ways of integrating data. First, in the case where the data sets are collocated (or can be collocated), Tableau formulates a query that joins them to produce a visualization. However, in the case where the data sets are not collocated (or cannot be collocated), Tableau federates queries to each data source, and creates a dynamic, blended view that consists of the joined result sets of the queries. For the purpose of exploratory visual analytics, Ms. Morton (et al) found that data blending is a complementary technology to the standard collocated approach with the following benefits:

  • Resolves many data granularity problems
  • Resolves collocation problems
  • Adapts to needs of exploratory visual analytics

Figure 1 - Company Tables

Image: Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau. [1]

Resolving Data Granularity Problems

Often times a user wants to combine data that may not be at the same granularity (i.e. they have different primary keys). For example, let’s say that an employee at company A wants to compare the yearly growth of sales to a competitor company B. The dataset for company B (see Figure 1 above) contains a detailed quarterly growth of sales for B (quarter, year is the primary key), while company A’s dataset only includes the yearly sales (year is the primary key). If the employee simply joins these two datasets on yearly earnings, then each row from A will be duplicated for each quarter in B for a given year resulting in an inaccurate overestimate of A’s yearly earnings.

This duplication problem can be avoided if for example, company B’s sales dataset were first aggregated to the level of year, then joined with company A’s dataset. In this case, data blending detects that the data sets are at different granularities by examining their primary keys and notes that in order to join them, the common field is year. In order to join them on year, an aggregation query is issued to company B’s dataset, which returns the sales aggregated up to the yearly level as shown in Figure 1. This result is blended with company A’s dataset to produce the desired visualization of yearly sales for companies A and B.

The blending feature does all of this on-the-fly without user-intervention.

Resolves Collocation Problems

As mentioned in Part 1, managed repository is expensive and untenable. In other cases, the data repository may have rigid structure, as with cubes, to ensure performance, support security or protect data quality. Furthermore, it is often unclear if it is worth the effort of integrating an external data set that has uncertain value. The user may not know until she has started exploring the data if it has enough value to justify spending the time to integrate and load it into her repository.

Thus, one of the paramount benefits of data blending is that it allows the user to quickly start exploring their data, and as they explore the integration happens automatically as a natural part of the analysis cycle.

An interesting final benefit of the blending approach is that it enables users to seamlessly integrate across different types of data (which usually exist in separate repositories) such as relational, cubes, text files, spreadsheets, etc.

Adapts to Needs of Exploratory Visual Analytics

A key benefit of data blending is its flexibility; it gives the user the freedom to view their blended data at different granularities and control how data is integrated on-the-fly. The blended views are dynamically created as the user is visually exploring the datasets. For example, the user can drill-down, roll-up, pivot, or filter any blended view as needed during her exploratory analysis. This feature is useful for data exploration and what-if analysis.

Another Illustrative Example of Data Blending

Figure 2 (below) illustrates the possible outcomes of an election for District 2 Supervisor of San Francisco. With this type of visualization, the user can select different election styles and see how their choice affects the outcome of the election.

What’s interesting from a blending standpoint is that this is an example of a many-to-one relationship between the primary and secondary datasets. This means that the fields being left-joined in by the secondary data sources match multiple rows from the primary dataset and results in these values being duplicated. Thus any subsequent aggregation operations would reflect this duplicate data, resulting in overestimates. The blending feature, however, prevents this scenario from occurring by performing all aggregation prior to duplicating data during the left-join.

Figure 2 - San Francisco Election

 Image: Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau. [1]

Next: Data Blending Design Principles

——————————————————————————————————–

References:

[1] Kristi Morton, Ross Bunker, Jock Mackinlay, Robert Morton, and Chris Stolte, Dynamic Workload Driven Data Integration in Tableau, University of Washington and Tableau Software, Seattle, Washington, March 2012, http://homes.cs.washington.edu/~kmorton/modi221-mortonA.pdf.

[2] Hans Rosling, Wealth & Health of Nations, Gapminder.org, http://www.gapminder.org/world/.