Resilient Jenkins Agents Provisioning with HashiCorp's Nomad
2019-10-20 / modified at 2023-07-30 / 796 words / 4 mins

Nomad is an easy-to-use workload orchestrator which is more lightweight and operational than Kubernetes. With nomad, we can create a scalable jenkins cluster running up to 1k jobs on the bare metal machines.

What we have achieved?

  • Running raw_exec, docker with jenkins/nomad on self-hosted machines at scale. Which is a cheaper solution compared with IBM LSF.
  • Declarative agent pool(based on HCL) provisioner that can allocate CPU and memories at user side.
  • OpenLDAP/AutoFS works fine with Nomad.

How Agent Works

jenkins agent is not a jenkinsfile-runner

It is intuitive that jenkins agent seems to be a jenkinsfile-runner. However, the agent(TCP/WebSocket based) is only an asynchronous event-driven application for bytecode commands intercepting.

$2Agent.jarNIO DispatcherCommandReceiverInterceptingExecutorService(from 0 tothreads)RemotingClassLoader2RemotingClassLoader1DispatcherIO QueueBytecodesBytecodesMasterCommands

The jenkins master sends and receives commands to the agent via JNLP(TCP). In most situation, a fixed TCP port is seleted on startup.

Agent internal

  • Dispatcher: Networking and dispatching in a single thread by NIO.
  • InterceptingExecutorService: Running commands by the cached thread pool, corePoolSize is 0, idle time is 60s

Resilient Agent Pool

Here is an example using HashiCorp’s nomad for scheduling

Provision Flow

Nomad will use cgroups(Linux Control Groups) and chroot for resource isolation(based on runc). Nomad can be replaced with Kubernetes. Howerver, a single binary is easier to be installed and maintained.

$2Timer(PeriodicWork)NodeProvisionerInvokeruse existing agentsnoAny job starving for executors?yesorg.jenkinsci.plugins.nomad.NomadCloudAgent(pending)Agent(busy)Agent(idle/release)calculate resourcesrunc(opencontainer)running agent.jarhudson.model.LoadStatistics.clock(dft 10s)Jenkins internal LoadBalancerAll pending jobs donecgroup/chrootsubmit a job called jenkins-agent-xxx.nomadbin packing algorithmJNLP connection backJenkinsNomadServerHostOS

when your resources(CPU/Memory) are exhausted, Nomad will refused to create agent, and shows

1
2
3
4
Placement Failures
jenkins-slave-taskgroup 1 unplaced
Resources exhausted on 1 node
Dimension memory exhausted on 1 node

Even the real usage of the CPU is not full.

Networking

Nomad uses the Serf protocal to communicate within the nodes.

$2Host OSContainers(based on Runc)2GHz/4GNomadAgentAgent1500MHz1G RAMAgent2500MHz1G RAMAgent3500MHz1G RAMAgent4500MHz1G RAMJenkinsNomadServerResource request(HTTP)Scheduling(Serf)JNLP

Jenkins to Nomad(HTTP)

  • Requests for provision containers who will run the agent.jar
  • Health: Nomad will check the JVM process and recreate if required.

Jenkins to Agent

  • Two sides of the channel via TCP/WebSocket.
  • Persistence: for DurableTask, there is a file-cookie based solution.

Agent-side performace

Cold startup performance

  • Using Java 11+ with AOT and CDS
  • Using environment module or docker image for binary consistency.
  • Using Cloud Storage Gateway for storage performace.

Scheduling-policy

  • Nomd side: You may use affinity to run jobs depends on free disk and more.
  • Jenkins side: Add some idle timeout for the reuse of agents.

Troubleshoot

when the agent.jar is down?

There will be some errors show that

“hudson.remoting.RequestAbortedException: java.nio.channels.AsynchronousCloseException”.

The job would die even if the agent have restarted.

What will happen when the master restarts?

For DurableTask, there is a file-cookie solution to monitor the bash status. However, the agent.jar may lost the connection.

Please avoid master restarting while any agent is running.

Parked thread leaking problem

First, It is known that a thread will be stopped when code fragments end. But a thread pool can pause(park) the thread to be reused. In Jenkins, a parallel step can create a thread overhead.

For example, we submit a pipeline job for a stress test, creating a lot of threads in parallels.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
pipeline{
agent {
label "test"
}
stages{
stage('init'){
steps{
script {
def jobs = [:]
100.times{
jobs["task" + it] ={ echo "1"}
};
parallel(jobs);
}
}
}
}
}

You will see the threads increasing during the workload, but when all jobs are done, parked threads are not released. By using top, you may find that your native threads are also not released which means

  • You will leak native stack size(or Xss) on master.
  • You will leak /proc/sys/kernel/threads-max .
  • You will face Unable to create new native thread when you deploy too many agents.

What you need to do

  • Increase more heap memory in the case of OOM.
  • Increase more ulimt or decrease Xss(eg: -Xss256K) or restart the agent periodically in case of the parking thread are not released.
  • Modify code of the threadpool. (Only if you really have too many agents.)

Preserve images while agents are getting stopped

By default, nomad will remove your agent’s docker image after 5 minus. To avoid re-download images, please see the configs at here

1
2
3
4
# on nomad's client->plugin "docker"->config
gc {
image = false
}

Reference