This is a follow-up article to Hop on Docker. In this article we will explore how to create a simple Kubernetes Job for short-lived data processes and a Kubernetes Deployment to run an always-on server that can execute data processes on request. Our tool of choice for data processing in this case is Hop.
Short-lived processes: Kubernetes Job
Sources:
- “Running pods that perform a single completable task”, Kubernetes in Action.
- “Using a Git repository as a starting point for a volume”, Kubernetes in Action.
- Jobs - Run to Completion
Our main aim here is to only consume resources while the process is running, hence we are opting for a Kubernetes Job, which will start the container on an available node and remove it once the data processing is finished.
I assume your are a bit familiar with Kubernetes, so I will only provide a brief overview of the job definition:
apiVersion: batch/v1
kind: Job
metadata:
name: hop-job
spec:
template:
metadata:
labels:
app: hop-job
spec:
restartPolicy: OnFailure
initContainers:
- name: clone-git-repo
image: alpine/git
volumeMounts:
- name: git-repo-volume
mountPath: /tmp/git-repo
readOnly: false
command: ['sh', '-c',
'cd /tmp/git-repo; git clone https://github.com/diethardsteiner/project-hop-in-the-cloud.git; mv project-hop-in-the-cloud/project-a/.hop .; chmod -R 777 .hop']
containers:
- name: hop
image: diethardsteiner/project-hop:0.20-20200429.230019-56
volumeMounts:
- name: git-repo-volume
mountPath: /home/hop
readOnly: false
env:
- name: HOP_LOG_LEVEL
value: "Basic"
- name: HOP_FILE_PATH
value: "/home/hop/project-hop-in-the-cloud/project-a/pipelines-and-workflows/main.hwf"
- name: HOP_RUN_CONFIG
value: "classic"
- name: HOP_RUN_PARAMETERS
value: "PARAM_LOG_MESSAGE=Hello,PARAM_WAIT_FOR_X_MINUTES=2"
volumes:
- name: git-repo-volume
emptyDir: {}
There isn’t anything ground-breaking going on really:
- We use an
initContainer
to clone a git repo to a volume that we later on mount to the main container hosting Hop. The git repo contains our project specific Hop configuration as well as the Hop workflows and pipelines. - For the main container we source the Docker image we created in the previous article and we define a few environment variables that will be used to run the Hop data process.
The beauty about Kubernetes is that you can run it on various cloud offerings - in other words you are not locked into a specific cloud ecosystem. In my case I’ll run it on GCP. Let’s first create our Kubernetes cluster:
# create project
gcloud projects create k8s-project-hop
# OR if project already exist, set project id
gcloud config set project k8s-project-hop
# set compute zone
# list of available zones
# https://cloud.google.com/compute/docs/regions-zones/#available
gcloud config set compute/zone us-west1-a
# create kubernetes engine cluster
# running command again after enabling API
gcloud container clusters create project-hop-cluster \
--machine-type=n1-standard-2 \
--num-nodes=1
# get authentication credentials to interact with cluster
gcloud container clusters get-credentials project-hop-cluster
gcloud container clusters list
kubectl get nodes
Now that the cluster is in place, we can run our Kubernetes job:
kubectl apply -f hob-job.yaml
Let’s get some info about the job and the pod:
kubectl get jobs
kubectl get po
kubectl describe job hop-job
Your job might finish rather quickly. Pods aren’t deleted when the job finishes. Completed jobs aren’t shown when running kubectl get po
, but this can be changed by adding the -a
flag:
kubectl get po -a
Pods aren’t deleted when the job finishes so that you can still examine the lgos:
% kubectl get po (master)project-hop-in-the-cloud
NAME READY STATUS RESTARTS AGE
hop-batch-job-7lvqx 0/1 Completed 0 118s
% kubectl logs hop-batch-job-7lvqx (master)project-hop-in-the-cloud
Error found during execution!
picocli.CommandLine$ExecutionException: There was an error during execution of file '/home/hop/pipelines-and-workflows/main.hwf'
at org.apache.hop.cli.HopRun.run(HopRun.java:121)
at org.apache.hop.cli.HopRun.main(HopRun.java:642)
Caused by: picocli.CommandLine$ExecutionException: There was a problem during the initialization of the Hop environment
at org.apache.hop.cli.HopRun.initialize(HopRun.java:150)
at org.apache.hop.cli.HopRun.run(HopRun.java:109)
... 1 more
Caused by: java.io.FileNotFoundException: /home/hop/.hop/hop.properties (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at org.apache.hop.cli.HopRun.buildVariableSpace(HopRun.java:159)
at org.apache.hop.cli.HopRun.initialize(HopRun.java:142)
... 2 more
Clean up
Delete the job:
kubectl delete job <JOBNAME>
# or
kubectl delete -f ./hob-job.yaml
When you delete the job all related pods it created are deleted too.
We keep the cluster running for the next exercise.
Long-lived process: Kubernetes Deployment
For running the hop-server
, we are using a Kubernetes deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: hop-deployment
labels:
app: hop
spec:
replicas: 1
selector:
matchLabels:
app: hop
template:
metadata:
labels:
app: hop
spec:
initContainers:
- name: clone-git-repo
image: alpine/git
volumeMounts:
- name: git-repo-volume
mountPath: /tmp/git-repo
readOnly: false
command: ['sh', '-c',
'cd /tmp/git-repo; git clone https://github.com/diethardsteiner/project-hop-in-the-cloud.git; mv project-hop-in-the-cloud/project-a/.hop .; chmod -R 777 .hop']
containers:
- name: hop-server
image: diethardsteiner/project-hop:0.20-20200429.230019-56
volumeMounts:
- name: git-repo-volume
mountPath: /home/hop
readOnly: false
env:
- name: HOP_LOG_LEVEL
value: "Basic"
resources:
requests:
memory: "4Gi"
cpu: "1"
volumes:
- name: git-repo-volume
emptyDir: {}
As it turns out, the K8s deployment isn’t so much different from the K8s job, so I won’t repeat the explanation here. The only main difference here is that we now request a specific amount of memory and CPU to be available.
To determine how much memory our container needs, let’s have a look at home much memory our app, in this case hop-run
requires:
HOP_OPTIONS="-Xmx2048m"
Now that we learnt that it needs 2GB, we should set the memory for our container a bit higher, e.g. 4GB.
To start our deployment run:
kubectl apply -f hop-deployment.yaml
And then we can get some info to understand what’s going on:
% kubectl get deployments
NAME READY UP-TO-DATE AVAILABLE AGE
hop-server 0/1 1 0 1m01s
% kubectl get pod
NAME READY STATUS RESTARTS AGE
hop-server-7f65f9bdd4-rjfg8 0/1 Pending 0 9m52s
# debugging in case of pod not starting up
kubectl describe pod hop-server-7f65f9bdd4-rjfg8
kubectl get events
# get the logs
% kubectl logs hop-server-7f65f9bdd4-rjfg8
# container name only required if we are running more than one container on the given pod
% kubectl logs hop-server-7f65f9bdd4-rjfg8 -c hop-server
If you are really super curious, you can log onto the the running container:
kubectl exec -it hop-server-7f65f9bdd4-rjfg8 -c hop-server -- /bin/bash
Clean up
Don’t forget to remove your deployment and the cluster (either via the GCP GUI or the command line).
Conclusion
And that’s it really. With a small amount of effort we’ve create short-lived and long-lived hop deployments that are easily repeatable (and easy to further automate). Certainly this is only a starting point, but I hope that this article provided you at least a good idea on how this setup works and sparked your motivation to explore this interesting topic further.