Resilient Jenkins Agents Provisioning with HashiCorp's Nomad
2019年10月20日に投稿

Nomad is an easy-to-use workload orchestrator which is more lightweight and operational than Kubernetes. With nomad, we can create a scalable jenkins cluster running up to 1k jobs on the bare metal machines.

How Agent Works

Bytecodes Interceptor

The agent(JNLP based) is an asynchronous event-driven application for bytecode commands intercepting. And there are no concepts with Jenkinsfile in the agent, they are all but java bytecodes. Most commands can be considered as pipeline steps(eg: sh/echo)…

Master and Agent

The master sends and receives commands to the agent via JNLP(TCP). In most situation, a fixed TCP port is recommand.

Agent internal

  • Dispatcher: Networking and dispatching in a single thread by NIO.
  • InterceptingExecutorService: Running commands by the cached thread pool, corePoolSize is 0, idle time is 60s

Resilient Agent Pool

Here is an example using HashiCorp’s nomad for scheduling

Provision Flow

Nomad will use cgroups(Linux Control Groups) and chroot for resource isolation(based on runc). Nomad can be replaced with Kubernetes, but Nomad is more easy to use.

when your resources(CPU/Memory) are exhausted, the nomad will refused to create agent, and shows

1
2
3
4
Placement Failures
jenkins-slave-taskgroup 1 unplaced
Resources exhausted on 1 node
Dimension memory exhausted on 1 node

Even the real usage of the CPU is not full.

Networking

Nomad uses the Serf protocal to communicate with the nodes.

Jenkins to Nomad

  • Requests for provision containers who are running agent.jar
  • Health: Nomad will check the JVM process and recreate if required.

Jenkins to Agent

  • Two sides of the channel via JNLP.
  • Persistence: for DurableTask, there is a file-cookie based solution.

Troubleshoot

Cold startup performance

  • Using Java 11 with AOT and CDS
  • Using shared/cached directory

Health Check

  • Nomad: checking by heartbeat, the JVM process. recreate if required.
  • Agent.jar: check by heartbeat, free disk/RAM.

Self-Healing

  • Q: What will happen when the agent.jar or container is failover?
  • A: The error shows “hudson.remoting.RequestAbortedException: java.nio.channels.AsynchronousCloseException”. The job will always terminate event the agent restarts.
  • Q: When the master is forced to restarts?
  • A: for DurableTask, there is a file-cookie based solution, jobs will be recoveryed.

You may need to do with the agent

  • Add some idle timeout just in case of the jenkins master is failover.

Parking thread leaks

Firstly, we submit a stress test pipeline job for creating a lot of threads.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
pipeline{
agent {
label "test"
}
stages{
stage('init'){
steps{
script {
def jobs = [:]
100.times{
jobs["task" + it] ={ echo "1"}
};
parallel(jobs);
}
}
}
}
}

You will see the threads increased during the workload, but when all jobs are done, parking threads are not released. By using top, you may find your native threads are also not released which means

  • You will leak native stack size(or Xss).
  • You will leak /proc/sys/kernel/threads-max .
  • You will face Unable to create new native thread when you deploy too many agents.

What you need to do

  • Add more heap memory in the case of OOM.
  • Add more ulimt or decrease Xss(eg: -Xss256K) or restart the agent periodically in case of the parking thread are not released.
  • Modify code of the threadpool. (Only if you really have too many agents.)

Reference