The jenkins master sends and receives commands to the agent via JNLP(TCP). In most situation, a fixed TCP port is seleted on startup.
Agent internal
Dispatcher: Networking and dispatching in a single thread by NIO.
InterceptingExecutorService: Running commands by the cached thread pool, corePoolSize is 0, idle time is 60s
Resilient Agent Pool
Here is an example using HashiCorp’s nomad for scheduling
Provision Flow
Nomad will use cgroups(Linux Control Groups) and chroot for resource isolation(based on runc). Nomad can be replaced with Kubernetes. Howerver, a single binary is easier to be installed and maintained.
when your resources(CPU/Memory) are exhausted, Nomad will refused to create agent, and shows
1 2 3 4
Placement Failures jenkins-slave-taskgroup 1 unplaced Resources exhausted on 1 node Dimension memory exhausted on 1 node
Even the real usage of the CPU is not full.
Networking
Nomad uses the Serf protocal to communicate within the nodes.
Jenkins to Nomad(HTTP)
Requests for provision containers who will run the agent.jar
Health: Nomad will check the JVM process and recreate if required.
Jenkins to Agent
Two sides of the channel via TCP/WebSocket.
Persistence: for DurableTask, there is a file-cookie based solution.
The job would die even if the agent have restarted.
What will happen when the master restarts?
For DurableTask, there is a file-cookie solution to monitor the bash status. However, the agent.jar may lost the connection.
Please avoid master restarting while any agent is running.
Parked thread leaking problem
First, It is known that a thread will be stopped when code fragments end. But a thread pool can pause(park) the thread to be reused. In Jenkins, a parallel step can create a thread overhead.
For example, we submit a pipeline job for a stress test, creating a lot of threads in parallels.
You will see the threads increasing during the workload, but when all jobs are done, parked threads are not released. By using top, you may find that your native threads are also not released which means
You will leak native stack size(or Xss) on master.
You will leak /proc/sys/kernel/threads-max .
You will face Unable to create new native thread when you deploy too many agents.
What you need to do
Increase more heap memory in the case of OOM.
Increase more ulimt or decrease Xss(eg: -Xss256K) or restart the agent periodically in case of the parking thread are not released.
Modify code of the threadpool. (Only if you really have too many agents.)
Preserve images while agents are getting stopped
By default, nomad will remove your agent’s docker image after 5 minus. To avoid re-download images, please see the configs at here