️This article has been over 2 years since the last update.
Nomad is an easy-to-use workload orchestrator which is more lightweight and operational than Kubernetes. With nomad, we can create a scalable jenkins cluster running up to 1k jobs on the bare metal machines.
What we have achieved?
- Running raw_exec, docker with jenkins/nomad on self-hosted machines at scale. Which is a cheaper solution compared with IBM LSF.
- Declarative agent pool(based on HCL) provisioner that can allocate CPU and memories at user side.
- OpenLDAP/AutoFS works fine with Nomad.
How Agent Works
jenkins agent is not a jenkinsfile-runner
It is intuitive that jenkins agent seems to be a jenkinsfile-runner. However, the agent(TCP/WebSocket based) is only an asynchronous event-driven application for bytecode commands intercepting.
The jenkins master sends and receives commands to the agent via JNLP(TCP). In most situation, a fixed TCP port is seleted on startup.
Agent internal
- Dispatcher: Networking and dispatching in a single thread by NIO.
- InterceptingExecutorService: Running commands by the cached thread pool, corePoolSize is 0, idle time is 60s
Resilient Agent Pool
Here is an example using HashiCorp’s nomad for scheduling
Provision Flow
Nomad will use cgroups(Linux Control Groups) and chroot for resource isolation(based on runc). Nomad can be replaced with Kubernetes. Howerver, a single binary is easier to be installed and maintained.
when your resources(CPU/Memory) are exhausted, Nomad will refused to create agent, and shows
1 | Placement Failures |
Even the real usage of the CPU is not full.
Networking
Nomad uses the Serf protocal to communicate within the nodes.
Jenkins to Nomad(HTTP)
- Requests for provision containers who will run the agent.jar
- Health: Nomad will check the JVM process and recreate if required.
Jenkins to Agent
- Two sides of the channel via TCP/WebSocket.
- Persistence: for
DurableTask, there is a file-cookie based solution.
Agent-side performace
Cold startup performance
- Using Java 11+ with AOT and CDS
- Using environment module or docker image for binary consistency.
- Using Cloud Storage Gateway for storage performace.
Scheduling-policy
- Nomd side: You may use affinity to run jobs depends on free disk and more.
- Jenkins side: Add some idle timeout for the reuse of agents.
Troubleshoot
when the agent.jar is down?
There will be some errors show that
“hudson.remoting.RequestAbortedException: java.nio.channels.AsynchronousCloseException”.
The job would die even if the agent have restarted.
What will happen when the master restarts?
For DurableTask, there is a file-cookie solution to monitor the bash status. However, the agent.jar may lost the connection.
Please avoid master restarting while any agent is running.
Parked thread leaking problem
First, It is known that a thread will be stopped when code fragments end. But a thread pool can pause(park) the thread to be reused. In Jenkins, a parallel step can create a thread overhead.
For example, we submit a pipeline job for a stress test, creating a lot of threads in parallels.
1 | pipeline{ |
You will see the threads increasing during the workload, but when all jobs are done, parked threads are not released. By using top, you may find that your native threads are also not released which means
- You will leak native stack size(or Xss) on master.
- You will leak
/proc/sys/kernel/threads-max. - You will face
Unable to create new native threadwhen you deploy too many agents.
What you need to do
- Increase more heap memory in the case of OOM.
- Increase more ulimt or decrease Xss(eg:
-Xss256K) or restart the agent periodically in case of the parking thread are not released. - Modify code of the threadpool. (Only if you really have too many agents.)
Preserve images while agents are getting stopped
By default, nomad will remove your agent’s docker image after 5 minus. To avoid re-download images, please see the configs at here
1 | # on nomad's client->plugin "docker"->config |