Mondrian: Averages in Aggregate Tables

01 Oct 2017

This article explains how to make averages work with Mondrian aggregated tables. To understand this article, basic knowledge on Mondrian aggregate tables is required.

## Average Rollup Types for Aggregate Tables

“Mondrian can compute averages just fine from a `SUM` and `FACT_COUNT` column - we never need `AVG` columns at all. Calculating `SUM` from `AVG` seems backwards to me, and may result in rounding errors.” (Julian Hyde, 2007)

While this statement still holds true for a lot of use cases, over the years there have been refinements to the original approach.

Following methods / rollup types are available:

### Average From Sum

`AvgFromSum`: Mondrian will use this formula as part of the SQL query:

``````SUM(measure_sum) / SUM(fact_count)
``````

To populate the aggregate table, we can use a SQL query similar to this one:

``````SELECT
...
, SUM(`measure`) AS `measure_sum`,
, COUNT(*) AS fact_count
FROM ..
``````

### Average From Average

`AvgFromAvg`: Mondrian will use this formula as part of the SQL query:

``````SUM('measure_avg' * 'fact_count') / SUM('fact_count')
``````

### Sum From Average

`SumFromAvg`: Mondrian will use this formula as part of the SQL query:

``````SUM(measure_sum * fact_count)
``````

### The DIY Approach

You could always also create your own aggregated average by defining a calculated measure and defining your formula of choice, e.g. dividing the sum of the measure by a non-null count of the same measure.

## Aggregate Measure Fact Count and RollupType

Version 7.1 of the Pentaho Server introduced a new feature to support averages in explicitly defined aggregate tables.

Pentaho Server v7.1 ships with a new Mondrian version (3.14.0.0-12) that supports a new feature called `AggMeasureFactCount`, which enables you to specify an additional measure count column per measure in the aggregate table definition. The count represents the count of non-null values. Moreover, the attribute `rollupType` enables you to specify the rollup type.

`AggMeasureFactCount`: The number of non-null fact table rows associated with a specific fact column. Mondrian can use this information when calculating averages from aggregate table values.

`rollupType` attribute for `AggMeasure`: Enables you to explicitly define the rollup type. Available types: `AvgFromSum`, `AvgFromAvg`, `SumFromAvg`.

Example:

``````...
<AggName name="agg_fact_sales_by_department">
<AggFactCount column="fact_count"/>
<AggMeasureFactCount column="total_sales_fact_count" factColumn="total_sales" />
<AggMeasure column="total_sales" name="[Measures].[Total Sales]" rollupType="AvgFromSum"></AggMeasure>
...
``````

So let’s go through the options:

• If your measures don’t have any `NULL` values, you might just want to use the `factCount` together with summed up measures.
• If your measures do have `NULL` values, specify additionally for each measure a `AggMeasureFactCount`, storing the non-null count for each measure.

Important: `AggMeasureFactCount` has to be specified after the `AggFactCount` and before the `AggForeignKey` element in the Mondrian Schema.

## Solutions

### Prior to Pentaho Server v7.1

#### DIY solution: Works in all scenarios

Gain total control by defining a calculated measure for averages: But this is also the most complex solution - so only use it if you absolutely have no other way.

Imagine you have a fact table with a `total_sales` column.

When defining the aggregate table explicitly in the Mondrian Schema:

1. To the fact table add: `total_sales_fact_count`
2. To the cube add an invisible measure `[Total Sales Fact Count]` with the aggregation type `SUM`. Then add a calculated member `[Avg Sales]` with the formula `[Measures].[Total Sales] / [Measures].[Total Sales Fact Count]`.
3. The aggregation table remains the same, so add:
• `total_sales`
• `total_sales_fact_count`
1. The SQL to populate the aggregate table should sum up `total_sales` and `total_sales_fact_count` (using the these columns from the fact table).
2. In the cube, add the aggregate table definition. Define both `total_sales` and `total_sales_fact_count` as `AggMeasure`.

#### Non Explicitly Defined Aggregate Tables

If you are just after the simple average, you can just skip defining the aggregate table explicitly in the Mondrian Schema and create a table following Mondrian’s naming conventions, something like this:

``````CREATE TABLE test.agg_time_department_fact_sales AS
SELECT
time_tk
, department_tk
, SUM(total_sales) AS total_sales
, COUNT(*) AS fact_count
FROM test.fact_sales
GROUP BY 1,2
;
``````

And when we query the cube, Mondrian will create a query similar to this one:

``````select department_name as c0, sum(total_sales) / sum(fact_count) as m0 from dim_x, agg_time_department_fact_sales where department_tk = department_tk and department_name = 'a' group by department_name
``````

### As of Pentaho Server v7.1: A full example

You define the relevant measures with aggregation type average:

``````...
<Measure name="Total Sales" column="total_sales" visible="true" aggregator="avg" formatString="#,##0; -#,##0"/>
<Measure name="Total Quantity" column="total_quantity" visible="true" aggregator="avg" formatString="#,##0; -#,##0"/>
...
``````

The aggregate table definition looks like this (note the addition on measure fact count columns):

``````...
<AggName name="agg_fact_sales_by_department">
<AggFactCount column="fact_count"/>
<AggMeasureFactCount column="total_sales_fact_count" factColumn="total_sales" />
<AggMeasureFactCount column="total_quantity_fact_count" factColumn="total_quantity" />
<AggForeignKey factColumn="department_id" aggColumn="department_id" />
<AggMeasure column="total_sales" name="[Measures].[Total Sales]" rollupType="AvgFromSum"></AggMeasure>
<AggMeasure column="total_quantity" name="[Measures].[Total Quantity]" rollupType="AvgFromSum"></AggMeasure>
...
``````

Note: In the example above we expect the data to have `NULL` values for the measures, so we make use of `AggMeasureFactCount` and create a dedicated column to hold the non-null count for each relevant measure. We also set to the `rollupType` explicitly to `AvgFromSum`.

Populating the aggregate table:

``````INSERT INTO test.agg_fact_sales_by_department
SELECT
department_id
, COUNT(total_sales) AS total_sales_fact_count
, COUNT(total_quantity) AS total_quantity_fact_count
, SUM(total_sales) AS total_sales
, SUM(total_quantity) AS total_quantity
, COUNT(*) AS fact_count
FROM test.fact_sales
GROUP BY 1
;
``````

When you query the cube, the `pentaho.log` or `mondrian_sql.log` should show something like this for the average then:

``````SUM(total_sales) / SUM(total_sales_fact_count)
``````

A special note when using `AvgFromAvg`. The SQL will look like this:

``````2017-09-30 22:39:36,515 DEBUG [mondrian.rolap.RolapUtil] Segment.load: done executing sql [select sum("agg_fact_sales_by_department"."total_sales" * "agg_fact_sales_by_department"."total_sales_fact_count") / sum("agg_fact_sales_by_department"."total_sales_fact_count") as "m0" from "test"."agg_fact_sales_by_department" as "agg_fact_sales_by_department"], exec+fetch 35 ms, 1 rows, ex=11, close=11, open=[]
``````

I just want to highlight again how the `AvgFromAvg` is calculated by Mondrian internally:

``````SELECT
SUM(total_sales * total_sales_fact_count) / SUM(total_sales_fact_count)
FROM agg_fact_sales_by_department
``````

The important thing here is that `total_sales` and `total_sales_fact_count` get first multiplied on a row level and then summed up, the result of which then gets divided by the sum of `total_sales_fact_count`. Let’s illustrate this with an example:

total_sales total_sales_fact_count row level multiplication final average result
2000 100 200000
3000 150 450000
Total 250 650000 2600

So when would you want to use `AvgFromAvg`? If you want to show the average based on the currently selected dimensions as opposed to the average based on the lowest granularity.

Problem fixed in 7.1: Github Merge Request

## Some Background Info

This Jira case from 2007 provides us quite some interesting information: Apart from telling us that Mondrian can handle aggregate columns differently if they follow the naming pattern outlined in the Mondrian documentation (e.g. `measure_sum`, `measure_avg`, you can use this approach when you don’t want to explicitly define the aggregate in the Mondrian schema), we also learn that there are at least three aggregation methods for averages:

“Mondrian can compute averages just fine from a `SUM` and `FACT_COUNT` column - we never need AVG columns at all. Calculating SUM from AVG seems backwards to me, and may result in rounding errors.” (Julian Hyde)

Methods:

• `AvgFromSum`: `sum(measure_sum) / sum(fact_count)`
• `AvgFromAvg`, which uses `sum('measure_sum' * 'fact_count') / sum('fact_count')`
• `SumFromAvg`: `sum(measure_sum * fact_count)`

A problem highlighted further down this Jira case is the handling of `NULL` values:

”” Similar issue if some of the measures are NULL

``````avg(1,2,null,3,4) = 2.5
``````

but

``````sum(1,2,null,3,4) / sum(1,1,1,1,1) = 2.0
``````

As `fact_count` will still be 1 (or other positive value) if other measures in data table row is not null. “”

MONDRIAN-2398 goes a bit more into detail on this subject:

”” When calculating an average from a summed rollup in the aggregate table, Mondrian will compute

``````sum(factvalue) / fact_count
``````

I.e. the summed fact value over the count of the fact rows in that rollup. Unfortunately if any of the individual fact values is equal to NULL this calculation will differ from a SQL avg, which excludes NULLs.

When this was discussed on the Mondrian dev group, the suggestion for a fix was to include new `fact_count` columns in the aggregate table that are specific to the measure, such that we have a count of non-NULLs to use for the avg calc. “”

No way to use an AvgFromSum calculation with explicitly defined aggregate tables: Now this one is quite interesting …

”” Using an explicitly defined aggregate table which includes an average measure (i.e. defined in the Mondrian schema), there is no way to get the calculation to use `AvgFromSum`. It will attempt to rollup with `AvgFromAvg`, which assumes the rolled up avg value is captured in the aggregate table.That method can result in rounding errors and is not always desired.

Ideally there would be some way to specify the exact rollup method used when averages are involved: `AvgFromSum`, `AvgFromAvg`, `SumFromAvg` “”

Note: The problem seems to only ever have existed with explicitly defined aggregate tables.