Using the R Integration functionality, how to perform Text Mining on a MicroStrategy report and display the result
Here is another great post in the MicroStrategy Community from Jaime Perez (photo, right) and his team. A lot of work when into the preparation of this post and it shows some great ways to use the “R” integration with MicroStrategy.
Contributors from Jaime’s team include:
Text Mining Using R Integration in MicroStrategy
Users may wish to perform text mining using R on the result of any arbitrary MicroStrategy report and display the result. One of the problems that hinders the users from achieving it is that the number of output elements is not always consistent. For example, a report may have three attributes named ‘Age groups’, ‘Reviewer’, and ‘Survey feedback’ and the report might display four rows of feedback as follows:
If the above report result is sent to R as an input and the R script breaks down each sentence of the feedback into the term frequency that is grouped by the age groups, it will have 18 rows.
Since the number of output elements is greater than the number of the MicroStrategy report rows, the report execution will fail. Using the objects in the Tutorial project, this technical note (TN207734) describes one way to display the result of text mining on a MicroStrategy report, using the R integration functionality.
– Following the instructions in TN43665, the MicroStrategy R Integration Pack has already been installed on the Intelligence Server.
The Steps Involved
STEP 1: Decide on the input values that need to be sent to R via R metrics
The first step is to decide on which data you wish to perform text mining. In this technical note, the sample report will let users select one year element, the arbitrary number of category elements, and specify the Revenue amount in prompts. The report will then display the value of the normalized TF-IDF (term frequency and inverse document frequency) for every word showing up in the qualified Item attribute elements, grouped by the Category elements.
A user may select the following values for each prompt and the report may look as shown below.
- Year: 2012
- Category: Books, Movies, and Music
- Revenue: greater than $15,000
Eventually, the user may want to see the normalized TF-IDF for every word showing up in the Item attribute elements as shown below:
Since the final output displays each word from the Item attribute and it is grouped by the Category elements, the necessary input values to R are as follows
- The elements of the Category attribute.
- The elements of the Item attribute.
STEP 2: Create metrics to pass the input values to R
The input values to R from MicroStrategy must be passed via metrics. Hence, on top of the current grid objects, additional metrics need to be created. For this sample report, since the inputs are the elements of two attributes, create two metrics with the following definitions so that the elements are displayed as metrics.
STEP 3: R script – Phase 1: Define input and output variables and write R script to obtain what you wish to display in a MicroStrategy report
In the R script, define (1) a variable that receives the inputs from MicroStrategy and (2) a variable that will be sent back to MicroStrategy as the output as depicted below. Since the number of output elements must match with the number of input elements, it is defined as “output = mstrInput2” to avoid the errors. In other words, this script executes R functions to obtain the data that you wish to display in a MicroStrategy report, but the output is the same as the input. More details about how to display the result in a MicroStrategy report will be followed up later in this technical note.
In this technical note, after manipulating the input value, we assume that the variable named ‘norm.TF.IDF’ in the R script holds the values of the TF-IDF for each term.
STEP 4: Create tables in the data warehouse to store the value of your R output
In order to display the values of the ‘norm.TF.IDF’ defined in a MicroStrategy report, tables to hold the result need to be created in the data warehouse. In other words, additional report will later have to be created in MicroStrategy and it will extract the data from the database tables, which are created in this section.
In this specific example, the variable ‘norm.TF.IDF’ has the elements of words (terms) and categories and the values of the normalized TF-IDF. Considering the types of data, the first two should be displayed as attributes and the values of the normalized TF-IDF should be presented in a metric. Hence, two lookup tables to hold the term and category elements and one fact table need to be created to store all the data. On top of these tables, one relationship table is also required since the relationship between words and categories is many-to-many.
STEP 5: R script – Phase 2: Populate the tables in your R script
As previously mentioned, the variable named ‘norm.TF.IDF’ contains the values, which a user wishes to display in a MicroStrategy report as shown below.
In this R script, four more variables are defined from ‘norm.TF.IDF’, each of which contains the subset of data that will be inserted into the database tables.
tm_Category holds the unique elements of the Category.
tm_Word holds the unique elements of the Word (Term).
tm_Word_Cat stores the values of the many-to-many relationship.
tm_Fact contains the values of TF-IDF for every Word-Category combination.
In the R script, populate the database tables with the above four subsets of ‘norm.TF.IDF’.
# Load RODBC library(RODBC) # RODBC package: assign ch the connectivity information ch <- odbcConnect("DSN_name") # Delete all the rows of the tables sqlClear(ch, "tm_Category", errors = TRUE) sqlClear(ch, "tm_Word", errors = TRUE) sqlClear(ch, "tm_Word_Cat", errors = TRUE) sqlClear(ch, "tm_Fact", errors = TRUE) # SQL: insert the data into tables; use parameterized query sqlSave(ch, tm_Category, tablename = "tm_Category", rownames=FALSE, append=TRUE, fast = TRUE) sqlSave(ch, tm_Word, tablename = "tm_Word", rownames=FALSE, append=TRUE, fast = TRUE) sqlSave(ch, tm_Word_Cat, tablename = "tm_Word_Cat", rownames=FALSE, append=TRUE, fast = TRUE) sqlSave(ch, tm_Fact, tablename = "tm_Fact", rownames=FALSE, append=TRUE, fast = TRUE) #Close the channel odbcClose(ch)
STEP 6: Create and add an R metric, which implements the R script
The R script is done. It is time to implement this R script from MicroStrategy by creating an R script. In the deployR interface, open the R script and define the input and output that you specify in Step 3 as follows. Since the elements of the Category and Item attributes are characters, choose “String” as its data type. Likewise, since the output is the same as the mstrInput2, its data type is also set to string.
Create a stand-alone metric and paste the metric definition of the deployR utility. Then, replace the last parameters by the Category and Item metrics that you created in Step 2.
Add the R metric to the report.
The report and R will perform the following actions after adding the R metric
i. The report lets users select the prompt answers
ii. MicroStrategy sends the Category and Item elements to R via the R metric
iii. R performs text mining to calculate the TF-IDF based on the inputs
iv. R generates subsets of the TF-IDF
v. R truncates the database tables and populates them with the subset of the TF-IDF
vi. R sends the output(which is actuary the input) to MicroStrategy
vii. The report displays the values of all object including the R metric
STEP 7: Create MicroStrategy objects to display the data
From the tables created in Step 4, create the Word and Category attributes and the fact named weight. The object relationship is as depicted below.
Now, create a new report with these objects. This report will obtain and display the data from the database tables.
STEP 8: Utilize the report level VLDB properties to manipulate the order of the report execution jobs
There are currently two reports and let each of which to be named R1 and R2 as described below
- R1: A report which prompts users to specify the report requirements and implements the R script executing text mining
- R2: This report obtains the result of text mining from the database and display it
If the two reports are placed in a document as datasets as shown below, there is one problem: R2 may start its execution before R1 populates the database tables with the result of text mining.
In order to force R2 to execute its job after the completion of R1, the VLDB properties PRE/POST statements along with additional database table may be used. The table tm_Flag contains the value of 0 or 1. R2 is triggered when R1 sets the value of completeFlag to 1. The detailed steps are described below with the script for SQL Server.
i. Create another table in the database, which holds the value of 1 or 0
CREATE TABLE tm_Flag ( completeFlag int ) INSERT INTO tm_Flag VALUES(0)
ii. In the VLDB property ‘Report Post Statement 1” of the R1 report, defines a Transact-SQL statement that changes the value of completeFlag to the value of 1.
DECLARE @query as nvarchar(100) SET @query = 'UPDATE tm_Flag SET completeFlag = 1' EXEC sp_executesql @query
iii. Define the VLDB property ‘Report Pre Statement 1’ in R2 so that it will check the value of completeFlag every second and loop until it turns to 1. After the loop, it will revert the value of completeFlag back to 0. After this Report Pre Statement, R2 will obtain data from the database, which has been populated by R1.
DECLARE @intFlag INT SET @intFlag = (select max(completeFlag) from tm_Flag) WHILE(@intFlag = 0) BEGIN WAITFOR DELAY '00:00:01' SET @intFlag = (select max(completeFlag) from tm_Flag) END DECLARE @query as nvarchar(100) SET @query = 'UPDATE tm_Flag SET completeFlag = 0' EXEC sp_executesql @query
Overall execution flow
- Answer prompts
2. Only the text mining result is displayed to users
Third Party Software Installation:
WARNING: The third-party product(s) discussed in this technical note is manufactured by vendors independent of MicroStrategy. MicroStrategy makes no warranty, express, implied or otherwise, regarding this product, including its performance or reliability.
In my last blog post, I blogged about the new MicroStrategy Community. Jaime Perez, VP of Worldwide Customer Services, and his crew have come up with a better way for us to engage with MicroStrategy as well as his team.
Speaking of Jaime, last June, he posted this great tip on the MicroStrategy Knowledgebase site as a TechNote. I am reblogging it since it is one of the most frequent questions I get asked and I find it an extremely useful Tip & Trick. Also, this will give you an idea of the great stuff being posted in the MicroStrategy Community.
MicroStrategy and Cross Joins
In some scenarios, one may encounter cross joins in the SQL View of a standard, SQL Report in MicroStrategy. Cross joins appear when two tables do not have any common key attributes between them in which they can inner join. As a result, the two tables essentially combine together to create one table that has all the data from both tables, but this results in poorer performance with a common effect of increased execution times. Sometimes these execution times, and performance hits, can be very severe. Therefore, it is important to understand some simple steps that can be performed to resolve a cross join, as well as some steps to understand why it may be appearing in the SQL View of the report.
One common occurrence of a cross join is when a report contains at least two unrelated attributes in the grid, and no metrics are present in order to relate the unrelated attributes via a fact table. Such a occurrence can be resolved in a few ways:
- Create a relationship filter, set the output level as the unrelated attributes (or the entire report level), and then relate these by a Logical Table object
- Create a relationship filter, set the output level as the unrelated attributes (or the entire report level), and then relate these by a Fact object
- Add a metric to the report that uses a fact from a table in which both attributes can inner join to
This provides a pathway from the fact table to the lookup tables in which the unrelated attributes are sourced from. The result is an inner join between the fact table and the lookup tables, which resolves the cross join between the two unrelated lookup tables.
Options 1 and 2 provide a means in which the report template can remain as only attributes, whereas Option 3 would have a metric on the report. Option 3 may not be desired if a metric does not want to be placed on the report. Keep in mind that other techniques can also be employed to have the metric on the report, but formatted to be hidden from display.
More common scenarios include cross joins between a fact table and a lookup table, and are typically surprising to a developer. These situations can be a bit more tricky to troubleshoot and resolve, but here are a few techniques that can be employed to try to resolve the issue:
- In SQL View look at where the cross join appears, and between which tables the cross join appears
- Open up those tables in the Table Editor by navigating to the Schema Objects\Tables folder, and double-clicking the tables
- Select the Logical View Tab of both tables to see all the logical objects mapped to the table
- Take note of which attributes have a key icon beside them
- These key attributes denote attributes at the lowest level of their hierarchy presently mapped to the table and/or attributes that are in their own hierarchy (meaning they have no parents or children)
- The SQL Engine will join 2 tables on common key attributes only, so if none of the key attributes on either table exist on both tables, then a cross join should appear
This means that just because a Region attribute exists on Table_A and a Region attribute exists on Table_B does not necessarily mean that the SQL Engine will join on Region. If Region has its child attribute on the table, then that attribute should be the key as it is the lowest level attribute of its particular hierarchy mapped to the table. If Region exists on both tables, and is also a key attribute on both tables, then an inner join should take place on Region.
This essentially means that one can find a cross join, investigate the tables in which it appears, and verify if at least 1 common key attribute exists between the tables. If not, then that should be the first path to investigate because a cross join is correct in that scenario.
You can find a detailed video on how this issue is reproduced and resolved here: Tech Note 71019 . Steps to reproduce and resolve
MicroStrategy Technical Support can assist with resolving cross joins in a specific report, however caution should be taken when resolving such issues. In some scenarios, the cross join is resolved through modifications to the schema objects, which can have a ripple effect to all other reports in an environment. For example, if a relationship is changed in the Region attribute to resolve a cross join in one report, this change will be reflected in all other reports that use Region, and potentially the hierarchy in which Region belongs. As a result, the SQL View of one report will have the cross join resolved, but the SQL may have changed in other reports using Region or its related attributes. This may or may not be desired. MicroStrategy Technical Support may not be able to fully understand the impact of such a schema change to the data model, so before a change is made to the data model the consequences of such a change should be fully understood by the developer, and any changes made to the schema should be recorded.
 Jaime Perez, TN47356: How to troubleshoot cross joins in SQL Reports for the SQL Generation Engine 9.x, MicroStrategy Community, 06/24/2014, http://community.microstrategy.com/t5/Architect/TN47356-How-to-troubleshoot-cross-joins-in-SQL-Reports-for-the/ta-p/196989.
 MicroStrrategy Knowledgebase, Tech Note 71019 . Steps to reproduce and resolve,