This article explains how to make averages work with Mondrian aggregated tables. To understand this article, basic knowledge on Mondrian aggregate tables is required.
Average Rollup Types for Aggregate Tables
“Mondrian can compute averages just fine from a SUM
and FACT_COUNT
column - we never need AVG
columns at all. Calculating SUM
from AVG
seems backwards to me, and may result in rounding errors.” (Julian Hyde, 2007)
While this statement still holds true for a lot of use cases, over the years there have been refinements to the original approach.
Following methods / rollup types are available:
Average From Sum
AvgFromSum
: Mondrian will use this formula as part of the SQL query:
SUM(measure_sum) / SUM(fact_count)
To populate the aggregate table, we can use a SQL query similar to this one:
SELECT
...
, SUM(`measure`) AS `measure_sum`,
, COUNT(*) AS fact_count
FROM ..
Average From Average
AvgFromAvg
: Mondrian will use this formula as part of the SQL query:
SUM('measure_avg' * 'fact_count') / SUM('fact_count')
Sum From Average
SumFromAvg
: Mondrian will use this formula as part of the SQL query:
SUM(measure_sum * fact_count)
The DIY Approach
You could always also create your own aggregated average by defining a calculated measure and defining your formula of choice, e.g. dividing the sum of the measure by a non-null count of the same measure.
Aggregate Measure Fact Count and RollupType
Version 7.1 of the Pentaho Server introduced a new feature to support averages in explicitly defined aggregate tables.
Pentaho Server v7.1 ships with a new Mondrian version (3.14.0.0-12) that supports a new feature called AggMeasureFactCount
, which enables you to specify an additional measure count column per measure in the aggregate table definition. The count represents the count of non-null values. Moreover, the attribute rollupType
enables you to specify the rollup type.
AggMeasureFactCount
: The number of non-null fact table rows associated with a specific fact column. Mondrian can use this information when calculating averages from aggregate table values.
rollupType
attribute forAggMeasure
: Enables you to explicitly define the rollup type. Available types:AvgFromSum
,AvgFromAvg
,SumFromAvg
.
Example:
...
<AggName name="agg_fact_sales_by_department">
<AggFactCount column="fact_count"/>
<AggMeasureFactCount column="total_sales_fact_count" factColumn="total_sales" />
<AggMeasure column="total_sales" name="[Measures].[Total Sales]" rollupType="AvgFromSum"></AggMeasure>
...
So let’s go through the options:
- If your measures don’t have any
NULL
values, you might just want to use thefactCount
together with summed up measures. - If your measures do have
NULL
values, specify additionally for each measure aAggMeasureFactCount
, storing the non-null count for each measure.
Important:
AggMeasureFactCount
has to be specified after theAggFactCount
and before theAggForeignKey
element in the Mondrian Schema.
Solutions
Prior to Pentaho Server v7.1
DIY solution: Works in all scenarios
Gain total control by defining a calculated measure for averages: But this is also the most complex solution - so only use it if you absolutely have no other way.
Imagine you have a fact table with a total_sales
column.
When defining the aggregate table explicitly in the Mondrian Schema:
- To the fact table add:
total_sales_fact_count
- To the cube add an invisible measure
[Total Sales Fact Count]
with the aggregation typeSUM
. Then add a calculated member[Avg Sales]
with the formula[Measures].[Total Sales] / [Measures].[Total Sales Fact Count]
. - The aggregation table remains the same, so add:
total_sales
total_sales_fact_count
- The SQL to populate the aggregate table should sum up
total_sales
andtotal_sales_fact_count
(using the these columns from the fact table). - In the cube, add the aggregate table definition. Define both
total_sales
andtotal_sales_fact_count
asAggMeasure
.
Non Explicitly Defined Aggregate Tables
If you are just after the simple average, you can just skip defining the aggregate table explicitly in the Mondrian Schema and create a table following Mondrian’s naming conventions, something like this:
CREATE TABLE test.agg_time_department_fact_sales AS
SELECT
time_tk
, department_tk
, SUM(total_sales) AS total_sales
, COUNT(*) AS fact_count
FROM test.fact_sales
GROUP BY 1,2
;
And when we query the cube, Mondrian will create a query similar to this one:
select department_name as c0, sum(total_sales) / sum(fact_count) as m0 from dim_x, agg_time_department_fact_sales where department_tk = department_tk and department_name = 'a' group by department_name
As of Pentaho Server v7.1: A full example
You define the relevant measures with aggregation type average:
...
<Measure name="Total Sales" column="total_sales" visible="true" aggregator="avg" formatString="#,##0; -#,##0"/>
<Measure name="Total Quantity" column="total_quantity" visible="true" aggregator="avg" formatString="#,##0; -#,##0"/>
...
The aggregate table definition looks like this (note the addition on measure fact count columns):
...
<AggName name="agg_fact_sales_by_department">
<AggFactCount column="fact_count"/>
<AggMeasureFactCount column="total_sales_fact_count" factColumn="total_sales" />
<AggMeasureFactCount column="total_quantity_fact_count" factColumn="total_quantity" />
<AggForeignKey factColumn="department_id" aggColumn="department_id" />
<AggMeasure column="total_sales" name="[Measures].[Total Sales]" rollupType="AvgFromSum"></AggMeasure>
<AggMeasure column="total_quantity" name="[Measures].[Total Quantity]" rollupType="AvgFromSum"></AggMeasure>
...
Note: In the example above we expect the data to have
NULL
values for the measures, so we make use ofAggMeasureFactCount
and create a dedicated column to hold the non-null count for each relevant measure. We also set to therollupType
explicitly toAvgFromSum
.
Populating the aggregate table:
INSERT INTO test.agg_fact_sales_by_department
SELECT
department_id
, COUNT(total_sales) AS total_sales_fact_count
, COUNT(total_quantity) AS total_quantity_fact_count
, SUM(total_sales) AS total_sales
, SUM(total_quantity) AS total_quantity
, COUNT(*) AS fact_count
FROM test.fact_sales
GROUP BY 1
;
When you query the cube, the pentaho.log
or mondrian_sql.log
should show something like this for the average then:
SUM(total_sales) / SUM(total_sales_fact_count)
A special note when using AvgFromAvg
. The SQL will look like this:
2017-09-30 22:39:36,515 DEBUG [mondrian.rolap.RolapUtil] Segment.load: done executing sql [select sum("agg_fact_sales_by_department"."total_sales" * "agg_fact_sales_by_department"."total_sales_fact_count") / sum("agg_fact_sales_by_department"."total_sales_fact_count") as "m0" from "test"."agg_fact_sales_by_department" as "agg_fact_sales_by_department"], exec+fetch 35 ms, 1 rows, ex=11, close=11, open=[]
I just want to highlight again how the AvgFromAvg
is calculated by Mondrian internally:
SELECT
SUM(total_sales * total_sales_fact_count) / SUM(total_sales_fact_count)
FROM agg_fact_sales_by_department
The important thing here is that total_sales
and total_sales_fact_count
get first multiplied on a row level and then summed up, the result of which then gets divided by the sum of total_sales_fact_count
. Let’s illustrate this with an example:
total_sales | total_sales_fact_count | row level multiplication | final average result |
---|---|---|---|
2000 | 100 | 200000 | |
3000 | 150 | 450000 | |
Total | 250 | 650000 | 2600 |
So when would you want to use AvgFromAvg
? If you want to show the average based on the currently selected dimensions as opposed to the average based on the lowest granularity.
Related Jira Cases
- Incorrect avg calculation in aggregates
- Wrong results when aggregate tables used to calculate avg value which rolls up one or more NULLs
Problem fixed in 7.1: Github Merge Request
Some Background Info
This Jira case from 2007 provides us quite some interesting information: Apart from telling us that Mondrian can handle aggregate columns differently if they follow the naming pattern outlined in the Mondrian documentation (e.g. measure_sum
, measure_avg
, you can use this approach when you don’t want to explicitly define the aggregate in the Mondrian schema), we also learn that there are at least three aggregation methods for averages:
“Mondrian can compute averages just fine from a SUM
and FACT_COUNT
column - we never need AVG columns at all. Calculating SUM from AVG seems backwards to me, and may result in rounding errors.” (Julian Hyde)
Methods:
AvgFromSum
:sum(measure_sum) / sum(fact_count)
AvgFromAvg
, which usessum('measure_sum' * 'fact_count') / sum('fact_count')
SumFromAvg
:sum(measure_sum * fact_count)
A problem highlighted further down this Jira case is the handling of NULL
values:
”” Similar issue if some of the measures are NULL
avg(1,2,null,3,4) = 2.5
but
sum(1,2,null,3,4) / sum(1,1,1,1,1) = 2.0
As fact_count
will still be 1 (or other positive value) if other measures in data table row is not null.
“”
MONDRIAN-2398 goes a bit more into detail on this subject:
”” When calculating an average from a summed rollup in the aggregate table, Mondrian will compute
sum(factvalue) / fact_count
I.e. the summed fact value over the count of the fact rows in that rollup. Unfortunately if any of the individual fact values is equal to NULL this calculation will differ from a SQL avg, which excludes NULLs.
When this was discussed on the Mondrian dev group, the suggestion for a fix was to include new fact_count
columns in the aggregate table that are specific to the measure, such that we have a count of non-NULLs to use for the avg calc.
“”
No way to use an AvgFromSum calculation with explicitly defined aggregate tables: Now this one is quite interesting …
””
Using an explicitly defined aggregate table which includes an average measure (i.e. defined in the Mondrian schema), there is no way to get the calculation to use AvgFromSum
. It will attempt to rollup with AvgFromAvg
, which assumes the rolled up avg value is captured in the aggregate table.That method can result in rounding errors and is not always desired.
Ideally there would be some way to specify the exact rollup method used when averages are involved: AvgFromSum
, AvgFromAvg
, SumFromAvg
“”
Note: The problem seems to only ever have existed with explicitly defined aggregate tables.