Every now and then I am thinking about the snapshot strategy I described about a year ago. I am still not quite happy about it. It’s a huge new data generating monster (well it actually creates a massive amount of output data based on the relatively compact input dataset). It sometimes puzzles me that OLAP tools don’t have a better way to handle snapshots. Or maybe there is a better strategy to do this?!
Just to recap: If you are asked to provide a daily snapshot, your ETL has to generate a row for every day for every little thing that you are trying to keep track of. This is kind of the consequence of providing this data in a cube: You have to have a (fact/snapshot) row available for a particular day in order to join it to any other dimension.
The problem of having to produce these artificial records is not going away for now (unless there is better support for this by OLAP tools), so we have to find better ways to generate the snapshot. However, every now and then I ask myself: What if we have to generate a snapshot on a by second grain (as opposed daily), how will we cope with generating so much data? There must be a better approach to do this. This is certainly an extreme situation, however, it pronounces the problem we have with current approach.
So how could we possibly make things a bit easier? What if we tried to implement everything in SQL, so that we can i.e. run it easily on Hive? The main challenge here is actually how to create the artificial clones?
For this example we assume that we are tracking the status of cases. Say we have a case which got
2016-02-01 and changed its status to
Coming back to our example: To create a daily snapshot, we have to clone the record from
2016-02-01 3 times: For
2016-02-04. The key here is to treat this data as a Slowly Changing Dimension Type 2, so assign a
valid_to to each event. We store this data in a table called
dim_events. It’s not really necessary to make this table a fully qualified dimension - the idea is just to treat the data in a similar fashion.
Showing state changes: If you want to show state changes in your reports, e.g. The case moving from status
closed, you can introduce a
prev_statuscolumn for this table. Why
next_status? This is a design decision and might vary on your use case: The advantage of
prev_statusis that it makes to record immutable, which is ideal for Hadoop. However, if you already have a process in place to maintain a
valid_tofor the same table, then you might as well instead maintain a
So what about creating the clones then for the actual snapshot table? Actually, this can be very easily achieved by joining to the Date Dimension. We just have to use the
valid_to from our
dim_events table and retrieve the respective days from the
dim_date table using
WHERE clause. Based on the example above, we have one record from
dim_event being joined to 4 records from
dim_date, so in the end we will get 4 records returned. We enrich the result by adding a constant of
1 for aggregation purposes (not strictly required).
I prototyped the solution quickly with PostgreSQL:
DROP SCHEMA IF EXISTS snapshot_example CASCADE; CREATE SCHEMA snapshot_example; CREATE TABLE snapshot_example.dim_date AS WITH RECURSIVE date_generator(date) AS ( VALUES (DATE '2016-01-01') UNION ALL SELECT date+1 FROM date_generator WHERE date < DATE '2017-01-01' ) SELECT CAST(TO_CHAR(date,'YYYYMMDD') AS INT) AS date_tk , date AS the_date , EXTRACT(DAY FROM date) AS the_day , EXTRACT(MONTH FROM date) AS the_month , EXTRACT(QUARTER FROM date) AS the_quarter , EXTRACT(YEAR FROM date) AS the_year FROM date_generator ; SELECT * FROM snapshot_example.dim_date LIMIT 20; -- treat events as slowly changing dimension CREATE TABLE snapshot_example.dim_events ( event_tk BIGINT , case_id BIGINT , valid_from INT , valid_to INT , event_status VARCHAR(20) ) ; INSERT INTO snapshot_example.dim_events VALUES (1,2321, 20160201, 20160205, 'open') , (2,2321, 20160205, 20160207, 'assigned') , (3,2321, 20160207, 20160208, 'closed') ; -- create snapshot by exploding the events CREATE TABLE snapshot_example.snapshot AS SELECT case_id , event_tk , date_tk AS snapshot_date_tk , valid_from , valid_to , event_status , 1 AS quantity FROM snapshot_example.dim_events f INNER JOIN snapshot_example.dim_date d ON d.date_tk >= f.valid_from AND d.date_tk < f.valid_to ; SELECT * FROM snapshot_example.snapshot LIMIT 20;
An the result looks like this (
valid_to are only part of the result to facilitate QA and should be removed thereafter):
Keep in mind that with a snapshot you only look at one day at a time, we do not aggregate the final figures across days! The main question that this output answers is the following: “How many cases were in status X on day Y”?
Note: If for some reason you also have to count the days and/or working days that a certain case stayed in a given status, you can just simply add a binary flag (using an
1) to the date dimension and enrich the join of
dim_datewith these flags, which can then just be aggregated in some form (so perform some additional aggregation across days upfront for the final output … you could use a windowed function in example):
SUM(is_working_day) OVER (PARTITION BY ... ORDER BY ...)
The final join helps us to materialise the view which the OLAP/Cube tool requires.
Note: When implementing the solution on HDFS there is currently no way to update records, so keeping the
valid_toupdated causes an issue. You could opt for Kudu, which allows updates. However, there is still a way to achieve this on Hadoop: 1) Either you keep the
valid_fromonly and generate a materialised view using a windowing function which provides the
valid_toas well. This works for smaller tables only. 2) Or you use the same approach as in 1), but store any properly closed records in a dedicated partition (let’s call the partition
closed) and the other records (the ones with a
valid_toin the future) in another dedicated partition (let’s call it
open). If you get an update to an open record, you move the old record with the “updated”
static/closedpartition and insert the new record into the
openpartition. This way you only have to run an “update” against a smaller batch of data.
The output of the query above might be in some cases a bit too granular, but performing an aggregation of top of this is very easy.
Pentaho Metadata Layer
Just as a side node, if you were ever to implement a similar model with the good old Pentaho Metadata Editor, you could define the two tables in the model and a complex join between them which could look something like this:
AND(["BT_DIM_DATE_DIM_DATE.BC_DIM_DATE_DATE_TK"] >= ["BT_DIM_EVENTS_DIM_EVENTS.BC_DIM_EVENTS_VALID_FROM"] ; ["BT_DIM_DATE_DIM_DATE.BC_DIM_DATE_DATE_TK"] < ["BT_DIM_EVENTS_DIM_EVENTS.BC_DIM_EVENTS_VALID_TO"])
This replicates the same logic we applied in the very last logic above.
A screenshot of this complex join is shown below:
This approach puts a lot of weight onto the database, so if you are dealing with a lot of data, it might be better to materialise the result of this join upfront.
Let’s extend our sample data a bit (complete data set shown, not just additional records):
INSERT INTO snapshot_example.dim_events VALUES (1,2321, 20160201, 20160205, 'open') , (2,2321, 20160205, 20160207, 'assigned') , (3,2321, 20160207, 20160208, 'closed') , (4,2322, 20160202, 20160204, 'open') , (5,2322, 20160204, 20160207, 'assigned') , (6,2322, 20160208, 20160208, 'closed') ;
If we were just interested in understanding, how many cases/events there were within a certain period and event status, then we can add a new field in the metadata model at the table level called
Count Events and define as Aggregation Type
Count and as Formula
Alternatively we could also just add a counter to our base table directly.
We can test if we generate the correct figures by using the built-in query builder:
And the result looks like this:
Which is in line with our expectations.
The sample files are available here.