Database Version Management (DVM) Process Powered By Pentaho Data Integration
When creating a data integration process, we quite often have to create of some database tables as well to store intermediate or final data. We want to easily deploy the data integration process (including the tables) into a new environment (test, production etc) and also apply any new changes in an easy fashion once the project is live.
In regards to table definitions there are a few open source database version management tools out there which get the job done, however, quite a lot of them only support specific databases. Now this is kind of bad if you have to be flexible as with every new project you might be dealing with a different database. I have covered a few of theses tools in the past, but let’s see if we can come up with something else:
So what do I want this tool to primarily do:
- Run DDL statements in a given order: Usually you start off with various
CREATE TABLE
statements, but throughout the project you might also have to apply changes, e.g.ALTER TABLE
,DROP TABLE
etc. - Keep a log of the executed SQL scripts so that only new scripts are run on next execution.
Well, this isn’t actually that complicated, is it? Our good old trusted Pentaho Kettle (aka PDI, Pentaho Data Integration) comes to mind to handle this challenge. We can use any JDBC driver with it, which basically allows us to connect to any database out there! And there is another big benefit straight away: It is the same tool that we use to create our data integration processes, so there is nothing new to be learnt! So how do we create this process?
I will briefly sketch out the process out here (and will not go into any details). You can download the files from here to understand the process even better.
jb_dvm_master.kjb
:
The master job jb_dvm_master.kjb
creates the DVM log file (which will store a list of successfully executed DDL SQL files) and starts the main transformation tr_dvm_execute_new_dll_scripts.png
:
This transformation checks a given folder for SQL DLL files and compares them to the already processes files from the DVM log. Only the new files remain in the stream and the list of filenames gets sorted by the incremental file number (version). The process writes the names of all these new files to the log and then loops over this list using the Job Executor step, which in turn executes jb_dvm_execute_new_ddl_script
:
For each new DDL SQL file the jb_dvm_execute_new_ddl_script
job executes the SQL script and upon successful execution writes the filename to the DVM log via the tr_dvm_write_to_log
transformation:
Now that we have implemented most features, it is time to test the process. Let’s fire up our local PostgreSQL server and populate the shell file run_jb_dvm_master
with the required parameters:
cd /Applications/Development/pdi-ce-6.0
sh ./kitchen.sh -file=/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/database-versioning-tool/jb_dvm_master.kjb \
-param:VAR_DB_DVM_JDBC_DRIVER_CLASS_NAME=org.postgresql.Driver \
-param:VAR_DB_DVM_JDBC_URL=jdbc:postgresql://localhost:5432/test \
-param:VAR_DB_DVM_PW=postgres \
-param:VAR_DB_DVM_USER=postgres \
-param:VAR_DDL_FILES_DIR=/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/database-versioning-tool/ddl \
-param:VAR_PROCESSED_DDL_FILES_LOG=/Users/diethardsteiner/Dropbox/Pentaho/Examples/PDI/database-versioning-tool/processed/processed.csv \
> /tmp/jb_dvm_master.err.log
Let’s create a few simple DDL SQL files:
0_create_dma_schema.sql
:
CREATE SCHEMA dma;
1_create_dim_date.sql
:
CREATE TABLE dma.dim_date (
date_tk INT
, the_year INT
, the_month INT
, the_date DATE
)
;
And then let’s run it:
$ chmod 700 run_jb_dvm_master.sh
$ ./run_jb_dvm_master.sh
Let’s follow the log:
$ tail -fn 200 /tmp/jb_dvm_master.err.log
Once the process finished successfully, we should have this in the DVM log:
$ cat processed/processed.csv
0_create_dma_schema.sql
1_create_dim_date.sql
This shows you the list of SQL DLL scripts which have been applied successfully to the specified database.
Ok, you might be still doubtful, so let’s check what psql
has to tell us:
$ psql -Upostgres -dtest
psql (9.4.1)
Type "help" for help.
test=# \dt dma.*
List of relations
Schema | Name | Type | Owner
--------+----------+-------+----------
dma | dim_date | table | postgres
(1 row)
test=# \dt dma.dim_date
List of relations
Schema | Name | Type | Owner
--------+----------+-------+----------
dma | dim_date | table | postgres
(1 row)
test=# \d dma.dim_date
Table "dma.dim_date"
Column | Type | Modifiers
-------------+---------+-----------
date_tk | integer |
the_year | integer |
the_month | integer |
the_date | date |
Ok, this looks pretty good. Now let’s assume we want to add a new column to the existing table. We create a new SQL file and prefix the name with the sequence number:
2_dim_date_add_quarter.sql
:
ALTER TABLE dma.dim_date ADD COLUMN the_quarter INT;
Let’s run the process again:
$ ./run_jb_dvm_master.sh
$ cat processed/processed.csv
0_create_dma_schema.sql
1_create_dim_date.sql
2_dim_date_add_quarter.sql
Note: Our process did only execute the new file
2_dim_date_add_quarter.sql
and not all of the files.
This simple test demonstrates quite nicely that our process is indeed working as intended. As highlighted before, we can run this process for any database which offers a JDBC driver. We only have to adjust the connection details and add the JDBC driver to PDI’s lib folder.
When you deploy your project to a new environment, you can run this process upfront, so that all the required tables are in place for your data integration process. Should there be the need to apply changes over time, you can test them first in your test environment and deploy them to production in an easy fashion. And of course you store all your files on Git or a similar version control system, so you can sync files easily between environments.