Jenkins Agent Scaling And Distributed Tunning
2019-09-05 / modified at 2022-04-04 / 713 words / 4 mins
️This article has been over 2 years since the last update.

With Increasing jobs on compiling task with Jenkins, we are facing agent scaling issues.

For the reader: Who may concern about scaling Jenkins. If you just want a cookbook, please refer [Jenkins 2: Up and Running: Evolve Your Deployment Pipeline for Next Generation Automation)]

How many machine/slot required?

There are no easy ways to estimate the amount of the machines because you have no context with domain special knowledge, just try it out first in a conservative amount.

  • register existing machines to master.
  • set slot per agent as CPU logic core.
  • Using time-series monitoring tools to check agent workload and analyze with BI(grafana) tools.

Choose connection protocol

SSH

The master uses ssh to log in on an agent by IP, User, and key, and send back registration info to master. This way is not dynamically, and we find the “cold wake up” time is too slow, so we give it up.

For now(2020), we implemented a ssh cloud provision interface to support the self-hosted agent service.

Remoting Kafka Plugin

This plugin use MQ to persist command and jobs info with four-channel, however, there is too little practice on the internet and has a strong coupling with K8s, so we skipped.

swarm-plugin(preferred)

swarm-plugin is not docker’s swarm overlay network, it’s just a wrapper of JNLP(jenkins-remoting) with retry policy. We create a custom shell script in userContent to make an agent easy to connect master. Whatever machine you use(Physical/K8s/HPC), it always works.

1
curl http://<jenkins_url>/userCotent/swarmcli.sh | sh

Pros:

  • no password/key required
  • third party machine supported
  • clean code, easy to customize.

Cons:

  • computing and networking in the same jar, not HA.

Jenkins Agent Scheduler

Scheduling is a comprehensive algorithm, in Jenkins, however, the default way is really simple to find a feasible node.

Jenkins internal LoadBalancer

In this situation, we are assuming every agent and job has the same specification, and we can’t allocate jobs depends on realtime CPU usage. I do not recommend to customize loadbalancer on Jenkins.

DefaultLoadbalancer

The default algorithm is consistent hash, a key-based routing, so there are no concept of priority and orders. you may find one agent is full load, and another is idle.

Least/Scores Loadbalancer(Score-based)

Least algorithm can be found at LeastLoadBalancer.java, they sort agents with the count of idle slots.

Scores algorithm can be found at NodeLoadScoringRule.java, they sort agents by (idle - busy)

The following shows the result

Agent1Agent2
All410
Idle35
Busy15
Least Loadbalancer(Higher is better)35
Scores Loadbalancer(Higher is better)200

Jenkins external scheduling

You can pass the buck with hudson.slaves.Cloud plugin, like EC2/K8S plugin, the PAAS solution, to create and destroy agent remotely on-demand, paying your money by running time.

  • kubernete((preferred for SAAS)
  • nomad(preferred for self-hosted, plugable driver)
  • mesos(preferred for self-hosted)

When you access hardware board resource, your can also use the Cloud interface to abstract Agent and Slot, to leave the algorithm on your self-hosted Hardware-PAAS platform.

TroubleShoot

Process limit per user

When a Jenkins job starts, the job may trigger other processes like shell or python, which will consume user processes, the limit is usually too small because the agent is the only working user on the machine.

On 64-bit Linux, max user processes can be found and changed on following command

1
2
ulimit -u
# my result is 1418, which is not enough, could be set to 4096

JVM ThreadStackSize(-Xss)

On 64-bit Linux OS, the default stack size of JVM is 1M(to maintain frames), but new Thread() in java still use fork() syscall and consumes a lot of OS stack size.

1
2
# check default Xss
java -XX:+PrintFlagsFinal -version | grep Stack

In most case, there are no recursion algorithms on a pipeline, we try to decrease Xss to 256K and it works.

1
java -jar agent.jar -Xss256K

Othe solutions

Train our users

  • not to use parallels DSL in for loop because it’s a fork bomb and it may block the head-beat thread
  • Replace py/shell with DSL to perform a simple task like fileExists, create file, to void process creation.