With Increasing jobs on compiling task with Jenkins, we are facing agent scaling issues.
For the reader: Who may concern about scaling Jenkins. If you just want a cookbook, please refer [Jenkins 2: Up and Running: Evolve Your Deployment Pipeline for Next Generation Automation)]
How many machine/slot required?
There are no easy ways to estimate the amount of the machines because you have no context with domain special knowledge, just try it out first in a conservative amount.
- register existing machines to master.
- set slot per agent as CPU logic core.
- Using time-series monitoring tools to check agent workload and analyze with ETL tools.
Which connection protocols?
The master uses ssh to log in on an agent by IP, User, and key, and send back registration info to master. This way is not dynamically, and we find the “cold wake up” time is too slow, so we give it up.
Remoting Kafka Plugin
This plugin use MQ to persist command and jobs info with four-channel, however, there is too little practice on the internet and has a strong coupling with K8s, so we skipped.
swarm-plugin is not docker’s swarm overlay network, it’s just a wrapper of JNLP(jenkins-remoting) with retry policy. We create a custom shell script in
userContent to make an agent easy to connect master. Whatever machine you use(Physical/K8s/HPC), it always works.
curl http://<jenkins_url>/userCotent/swarmcli.sh | sh
- no password/key required
- third party machine supported
- clean code, easy to customize.
- computing and networking in the same jar, not HA.
Jenkins Agent Scheduler
Scheduling is a comprehensive algorithm, in Jenkins, however, the default way is really simple to find a feasible node.
Jenkins internal LoadBalancer
In this situation, we are assuming every agent and job has the same specification, and we can’t allocate jobs depends on realtime CPU usage. I do not recommend to customize loadbalancer on Jenkins.
The default algorithm is consistent hash, a key-based routing, so there are no concept of priority and orders. you may find one agent is full load, and another is idle.
Least algorithm can be found at LeastLoadBalancer.java, they sort agents with the count of idle slots.
Scores algorithm can be found at NodeLoadScoringRule.java, they sort agents by (idle - busy)
The following shows the result
|Least Loadbalancer(Higher is better)||3||5|
|Scores Loadbalancer(Higher is better)||20||0|
Jenkins external scheduling
You can pass the buck with
hudson.slaves.Cloud plugin, like EC2/K8S plugin, the PAAS solution, to create and destroy agent remotely on-demand, paying your money by running time.
- kubernete((preferred for SAAS)
- nomad(preferred for self-hosted, plugable driver)
- mesos(preferred for self-hosted)
When you access hardware board resource, your can also use the
Cloud interface to abstract Agent and Slot, to leave the algorithm on your self-hosted Hardware-PAAS platform.
Process limit per user
When a Jenkins job starts, the job may trigger other processes like shell or python, which will consume user processes, the limit is usually too small because the agent is the only working user on the machine.
On 64-bit Linux, max user processes can be found and changed on following command
On 64-bit Linux OS, the default stack size of JVM is 1M(to maintain frames), but
new Thread() in java still use
fork() syscall and consumes a lot of OS stack size.
# check default Xss
In most case, there are no recursion algorithms on a pipeline, we try to decrease
Xss to 256K and it works.
java -jar agent.jar -Xss256K
Train our users
- not to use
parallelsDSL in for loop because it’s a fork bomb and it may block the head-beat thread
- Replace py/shell with DSL to perform a simple task like fileExists, create file, to void process creation.