Nomad is an easy-to-use workload orchestrator which is more lightweight and operational than Kubernetes. With nomad, we can create a scalable jenkins cluster running up to 1k jobs on the bare metal machines.
How Agent Works
The agent(JNLP based) is an asynchronous event-driven application for bytecode commands intercepting. And there are no concepts with Jenkinsfile in the agent, they are all but java bytecodes. Most commands can be considered as pipeline steps(eg:
Master and Agent
The master sends and receives commands to the agent via JNLP(TCP). In most situation, a fixed TCP port is recommand.
- Dispatcher: Networking and dispatching in a single thread by NIO.
- InterceptingExecutorService: Running commands by the cached thread pool, corePoolSize is 0, idle time is 60s
Resilient Agent Pool
Here is an example using HashiCorp’s nomad for scheduling
Nomad will use cgroups(Linux Control Groups) and chroot for resource isolation(based on runc). Nomad can be replaced with Kubernetes, but Nomad is more easy to use.
when your resources(CPU/Memory) are exhausted, the nomad will refused to create agent, and shows
Even the real usage of the CPU is not full.
Nomad uses the Serf protocal to communicate with the nodes.
Jenkins to Nomad
- Requests for provision containers who are running agent.jar
- Health: Nomad will check the JVM process and recreate if required.
Jenkins to Agent
- Two sides of the channel via JNLP.
- Persistence: for
DurableTask, there is a file-cookie based solution.
Cold startup performance
- Using Java 11 with AOT and CDS
- Using shared/cached directory
- Nomad: checking by heartbeat, the JVM process. recreate if required.
- Agent.jar: check by heartbeat, free disk/RAM.
- Q: What will happen when the agent.jar or container is failover?
- A: The error shows “hudson.remoting.RequestAbortedException: java.nio.channels.AsynchronousCloseException”. The job will always terminate event the agent restarts.
- Q: When the master is forced to restarts?
- A: for
DurableTask, there is a file-cookie based solution, jobs will be recoveryed.
You may need to do with the agent
- Add some idle timeout just in case of the jenkins master is failover.
Parking thread leaks
Firstly, we submit a stress test pipeline job for creating a lot of threads.
You will see the threads increased during the workload, but when all jobs are done, parking threads are not released. By using
top, you may find your native threads are also not released which means
- You will leak native stack size(or Xss).
- You will leak
- You will face
Unable to create new native threadwhen you deploy too many agents.
What you need to do
- Add more heap memory in the case of OOM.
- Add more ulimt or decrease Xss(eg:
-Xss256K) or restart the agent periodically in case of the parking thread are not released.
- Modify code of the threadpool. (Only if you really have too many agents.)