Resilient Jenkins Agents Provisioning with HashiCorp's Nomad

2019-10-20 / modified at 2023-07-30 / 796 words / 4 mins

Nomad is an easy-to-use workload orchestrator which is more lightweight and operational than Kubernetes. With nomad, we can create a scalable jenkins cluster running up to 1k jobs on the bare metal machines.

What we have achieved?

Running raw_exec, docker with jenkins/nomad on self-hosted machines at scale. Which is a cheaper solution compared with IBM LSF.
Declarative agent pool(based on HCL) provisioner that can allocate CPU and memories at user side.
OpenLDAP/AutoFS works fine with Nomad.

How Agent Works

jenkins agent is not a jenkinsfile-runner

It is intuitive that jenkins agent seems to be a jenkinsfile-runner. However, the agent(TCP/WebSocket based) is only an asynchronous event-driven application for bytecode commands intercepting.

The jenkins master sends and receives commands to the agent via JNLP(TCP). In most situation, a fixed TCP port is seleted on startup.

Agent internal

Dispatcher: Networking and dispatching in a single thread by NIO.
InterceptingExecutorService: Running commands by the cached thread pool, corePoolSize is 0, idle time is 60s

Resilient Agent Pool

Here is an example using HashiCorp’s nomad for scheduling

Provision Flow

Nomad will use cgroups(Linux Control Groups) and chroot for resource isolation(based on runc). Nomad can be replaced with Kubernetes. Howerver, a single binary is easier to be installed and maintained.

when your resources(CPU/Memory) are exhausted, Nomad will refused to create agent, and shows

Placement Failures
  jenkins-slave-taskgroup 1 unplaced
    Resources exhausted on 1 node
    Dimension memory exhausted on 1 node

Even the real usage of the CPU is not full.

Networking

Nomad uses the Serf protocal to communicate within the nodes.

Jenkins to Nomad(HTTP)

Requests for provision containers who will run the agent.jar
Health: Nomad will check the JVM process and recreate if required.

Jenkins to Agent

Two sides of the channel via TCP/WebSocket.
Persistence: for DurableTask, there is a file-cookie based solution.

Agent-side performace

Cold startup performance

Using Java 11+ with AOT and CDS
Using environment module or docker image for binary consistency.
Using Cloud Storage Gateway for storage performace.

Scheduling-policy

Nomd side: You may use affinity to run jobs depends on free disk and more.
Jenkins side: Add some idle timeout for the reuse of agents.

Troubleshoot

when the agent.jar is down?

There will be some errors show that

“hudson.remoting.RequestAbortedException: java.nio.channels.AsynchronousCloseException”.

The job would die even if the agent have restarted.

What will happen when the master restarts?

For DurableTask, there is a file-cookie solution to monitor the bash status. However, the agent.jar may lost the connection.

Please avoid master restarting while any agent is running.

Parked thread leaking problem

First, It is known that a thread will be stopped when code fragments end. But a thread pool can pause(park) the thread to be reused. In Jenkins, a parallel step can create a thread overhead.

For example, we submit a pipeline job for a stress test, creating a lot of threads in parallels.

pipeline{
    agent {
        label "test"
    }
    stages{
        stage('init'){
            steps{
                script {
                    def jobs = [:]
                    100.times{
                        jobs["task" + it] ={ echo "1"}
                    };
                    parallel(jobs);
                }
            }
        }
    }
}

You will see the threads increasing during the workload, but when all jobs are done, parked threads are not released. By using top, you may find that your native threads are also not released which means

You will leak native stack size(or Xss) on master.
You will leak /proc/sys/kernel/threads-max .
You will face Unable to create new native thread when you deploy too many agents.

What you need to do

Increase more heap memory in the case of OOM.
Increase more ulimt or decrease Xss(eg: -Xss256K) or restart the agent periodically in case of the parking thread are not released.
Modify code of the threadpool. (Only if you really have too many agents.)

Preserve images while agents are getting stopped

By default, nomad will remove your agent’s docker image after 5 minus. To avoid re-download images, please see the configs at here

# on nomad's client->plugin "docker"->config
gc {
    image = false
}

Reference

https://blog.codecentric.de/en/2017/11/microservices-nomad-consul/