️This article has been over 2 years since the last update.
With Increasing jobs on compiling task with Jenkins, we are facing agent scaling issues.
For the reader: Who may concern about scaling Jenkins. If you just want a cookbook, please refer [Jenkins 2: Up and Running: Evolve Your Deployment Pipeline for Next Generation Automation)]
How many machine/slot required?
There are no easy ways to estimate the amount of the machines because you have no context with domain special knowledge, just try it out first in a conservative amount.
- register existing machines to master.
- set slot per agent as CPU logic core.
- Using time-series monitoring tools to check agent workload and analyze with BI(grafana) tools.
Choose connection protocol
SSH
The master uses ssh to log in on an agent by IP, User, and key, and send back registration info to master. This way is not dynamically, and we find the “cold wake up” time is too slow, so we give it up.
For now(2020), we implemented a ssh cloud provision interface to support the self-hosted agent service.
Remoting Kafka Plugin
This plugin use MQ to persist command and jobs info with four-channel, however, there is too little practice on the internet and has a strong coupling with K8s, so we skipped.
swarm-plugin(preferred)
swarm-plugin is not docker’s swarm overlay network, it’s just a wrapper of JNLP(jenkins-remoting) with retry policy. We create a custom shell script in userContent
to make an agent easy to connect master. Whatever machine you use(Physical/K8s/HPC), it always works.
1 | curl http://<jenkins_url>/userCotent/swarmcli.sh | sh |
Pros:
- no password/key required
- third party machine supported
- clean code, easy to customize.
Cons:
- computing and networking in the same jar, not HA.
Jenkins Agent Scheduler
Scheduling is a comprehensive algorithm, in Jenkins, however, the default way is really simple to find a feasible node.
Jenkins internal LoadBalancer
In this situation, we are assuming every agent and job has the same specification, and we can’t allocate jobs depends on realtime CPU usage. I do not recommend to customize loadbalancer on Jenkins.
DefaultLoadbalancer
The default algorithm is consistent hash, a key-based routing, so there are no concept of priority and orders. you may find one agent is full load, and another is idle.
Least/Scores Loadbalancer(Score-based)
Least algorithm can be found at LeastLoadBalancer.java, they sort agents with the count of idle slots.
Scores algorithm can be found at NodeLoadScoringRule.java, they sort agents by (idle - busy)
The following shows the result
Agent1 | Agent2 | |
---|---|---|
All | 4 | 10 |
Idle | 3 | 5 |
Busy | 1 | 5 |
Least Loadbalancer(Higher is better) | 3 | 5 |
Scores Loadbalancer(Higher is better) | 20 | 0 |
Jenkins external scheduling
You can pass the buck with hudson.slaves.Cloud
plugin, like EC2/K8S plugin, the PAAS solution, to create and destroy agent remotely on-demand, paying your money by running time.
- kubernete((preferred for SAAS)
- nomad(preferred for self-hosted, plugable driver)
- mesos(preferred for self-hosted)
When you access hardware board resource, your can also use the Cloud
interface to abstract Agent and Slot, to leave the algorithm on your self-hosted Hardware-PAAS platform.
TroubleShoot
Process limit per user
When a Jenkins job starts, the job may trigger other processes like shell or python, which will consume user processes, the limit is usually too small because the agent is the only working user on the machine.
On 64-bit Linux, max user processes can be found and changed on following command
1 | ulimit -u |
JVM ThreadStackSize(-Xss)
On 64-bit Linux OS, the default stack size of JVM is 1M(to maintain frames), but new Thread()
in java still use fork()
syscall and consumes a lot of OS stack size.
1 | # check default Xss |
In most case, there are no recursion algorithms on a pipeline, we try to decrease Xss
to 256K and it works.
1 | java -jar agent.jar -Xss256K |
Othe solutions
Train our users
- not to use
parallels
DSL in for loop because it’s a fork bomb and it may block the head-beat thread - Replace py/shell with DSL to perform a simple task like fileExists, create file, to void process creation.