Slimming down Apache Hop

  28 Jan 2021


Pentaho Data Integration (PDI), the predecessor of Hop, had grown in size substantially over years, nearly amounting to 2GB in size. This is of course undesirable for a lightweight deployments, like using it for a AWS Lamba function or creating a super small container.

Luckily for us, things have quite changed over the last few years. About 2 years ago Matt Casters, Bart Maertens and a few others founded Project Hop based on the open source code of PDI and then started cleaning up and improving the code base. A big focus has been set on making each step/transform an actual plugin, so that in case you don’t really require a certain step/transform, you could just remove this plugin from the package. The great news is that recently Hop was accepted into the Apache Incubator programme.

Our first task will be to create a simple Hop pipeline and then to slim down the Hop package so that it only has the plugins and core functionality that we require to run this very pipeline.

Our simple pipeline will:

  • Read Excel file from S3
  • Convert it to CSV and write it back to S3

How to slim down Hop

Download Apache Hop from here.

Remove non required plugins

I will explain how to remove folders and files based on the example pipeline given above. You will have to adjust these steps for your setup (which should be fairly straight forward after reading this article).

From within the Hop install directory run the command shown below. This will provide you a good starting point to decide which plugins to remove:

% tree -L 2 plugins
plugins
├── actions
│   ├── abort
│   ├── addresultfilenames
│   ├── checkdbconnection
│   ├── checkfilelocked
│   ├── columnsexist
...
├── databases
│   ├── as400
│   ├── cache
│   ├── db2
│   ├── derby
│   ├── exasol4
...
├── engines
│   └── beam
├── misc
│   ├── debug
│   ├── git
│   ├── projects
│   └── testing
├── transforms
│   ├── abort
│   ├── addsequence
│   ├── analyticquery
...
└── vfs
    ├── azure
    ├── googledrive
    └── s3

225 directories, 0 files

Plugins:

  • actions: remove entire folder as we are not using any workflow actions
  • databases: remove all folders apart from postgresql
  • engines: remove entire folder
  • misc: remove everything apart from projects
  • transforms: delete everything apart from tableouput, textfile, writetolog
  • vfs: remove everything apart from S3

Remove not required core compenents

We don’t need the GUI, the server, …

rm -rf demo
rm -rf docs
rm hop-conf*
rm hop-encrypt*
rm hop-gui*
rm hop-run.bat
rm hop-server*
rm hop-translator*
rm -rf libswt
rm LICENSE.txt
rm -rf pwd
rm README.md
rm -rf static
rm translator.xml
rm -rf ui

lib folder: Again, here it might vary what you need depending on the workflows and pipelines you created. A primitive way of finding out what is needed is just to remove the libs one by one and making sure the code runs each time. In our case we can remove these libraries:

rm asm*
rm batik * #svg library
rm blueprints-core*
rm dom4j*
rm encoder*
rm enunciate-core-annotations-*
rm enunciate-jersey-rt-*
rm flexjson-*
rm hibernate*
rm http*
rm jackson-jaxrs-base-*
rm jackson-jaxrs-json-provider-*
rm janino-*
rm javax.servlet-api-*
rm jaxen*
rm jcommon*
rm jersey*
rm jetty*
rm jfreechart*
rm jsch*
rm json-simple*
rm jsr311-api-*
rm jug-lgpl-*
rm jzlib-*
rm mimepull-*
rm ognl-*
rm org.eclipse.*
rm rhino-*
rm slf4j-log4j*
rm snappy-java-*
rm spark-avro*
rm tyrus-standalone-client-*
rm webservices*
rm xercesImpl-2.12.0
rm xml*

So what we are left with is this:

apache-log4j-extras-1.2.17.jar
commons-beanutils-1.9.4.jar
commons-cli-1.2.jar
commons-codec-1.10.jar
commons-collections-3.2.2.jar
commons-compiler-3.0.8.jar
commons-compress-1.19.jar
commons-configuration-1.9.jar
commons-dbcp-1.4.jar
commons-io-2.2.jar
commons-lang-2.6.jar
commons-logging-1.1.3.jar
commons-math3-3.6.1.jar
commons-pool-1.5.7.jar
commons-validator-1.3.1.jar
commons-vfs2-2.4.1.jar
geronimo-servlet_3.0_spec-1.0.jar
gson-2.8.5.jar
guava-27.0-jre.jar
hop-core-0.40-SNAPSHOT.jar
hop-engine-0.40-SNAPSHOT.jar
hop-ui-swt-0.40-SNAPSHOT.jar
jackson-annotations-2.10.2.jar
jackson-core-2.10.2.jar
jackson-databind-2.10.2.jar
javassist-3.20.0-GA.jar
log4j-1.2.17.jar
picocli-3.7.0.jar
postgresql-42.2.13.jar
scannotation-1.0.2.jar
slf4j-api-1.7.7.jar

In my case, the Hop package is 23.8 MB in size now.

I created this Jira request suggesting for a new option to be added to the GUI which would check the dependencies of your project and build the smallest possible Hop package for you that you can then use for deployment.

comments powered by Disqus