Setting up an active-passive GitLab instance
2024-11-11 / modified at 2025-03-27 / 1.7k words / 10 mins

GitLab is an open-source DevOps platform widely adopted in many companies. This instruction outlines the setup of a hot standby (active-passive) GitLab instance.

Key Considerations

  • To maintain segregation of duties and adhere to vendor-neutral technology stack, we’re utilizing a reduced-feature set (dumb down) GitLab configuration, focusing on core repository hosting functionalities.
  • We’re introducing an active-passive deployment that eliminates the need for NFS.

GitLab: More Than Just a Repository Host

GitLab Inc ($GTLB) is under pressure from stockholders to demonstrate income growth, leading to a rapid release cycle of new features. While these additions can be beneficial, they also increase the complexity of GitLab instance. requiring platform engineers to manage a broader range of components, disable potentially vulnerable features, and maintain high availability. The guide details how to create hot standby Git hosting solution using GitLab OSS.

Choosing the Right Hardware for GitLab Self-Hosting

We support approximately 3,000 active users per month performing code commits and CI/CD builds (by Jenkins) on self-hosting GitLab instances. We continuously evaluate GitLab performance across various cloud and bare metal environments.

Cloud VM Specs

According to our statistics, the following specifications are recommended for GitLab hosting.

  • CPU: GitLab is not inherently CPU-intensive, with peak load occurring during merge request processing. However, avoid oversold CPU, as high speed I/O network adapter requires better CPU specs.
  • Memory: GitLab requires between 8 GB and 128 GB of RAM. In addition to the application memory, the operating system may cache Git repository files into memory pages.
  • Networking: GitLab requires high-speed I/O during pull and push operations.
    • Without LFS: We’re observed peak traffic of 5 Gb/s during pipeline builds.
    • With LFS: If LFS storage is proxied or hosted by a GitLab instance, a 10 Gb/s networking card is recommended.
    • Local storage: High-performance storage (e.g., clould SSD disks) is crucial for optimal performance during merge request processing.

Bare Metal Considerations

While bare metal deployments offer potential benefits, our testing indicates that they are not always as efficient as virtualized environments.

Storage Scalability

Manage RAID5 disks at a fundamental level presents a barrier for non-IDC engineers. Furthermore, the limited disk slots (typically 8) on existing computing nodes, combined with our existing dedicated enterprise storage, makes bare metal expansion less attractive.

Memory Utilization

Significant RAM (e.g., 1TB/512G installed) may be underutilized, as only 100 GB of RAM (including the OS pages) is consumed by the application.

Network inefficiency

A 25 Gb/s fiber network adapter may be underutilized if peak traffic remains bellow 5 Gb/s.

In summary, we recommend deploying GitLab on instances with 4U/32G or 8U/64G configurations and large local cloud SSD disks.

GitLab Architecture

GitLab is a Ruby Rails application built on top of Postgres, Redis, and cloud storage. Compared with subversion’s focus solely on repository hosting, which requires only an Apache httpd loadbalancer and shared disk, GitLab demands more resources.

Component Overall

The diagram shows the minimum components for the repository-hosting.

$2GitLabStatefulGitaly: The FS layerSidekiq: Redis-based job queueStatelessRails: The RESTful APINginx/WorkhorseDatabasePostgres/RedisS3: for LFS/uploadsDisk(EBS/NVME)

NFS: End of support

Prior to GitLab 15, we implemented GitLab availability using shared NFS, despite occasional latency errors while accessing the filesystem.

Since GitLab 16, Gitaly, the filesystem layer of Git, has officially ended supporting for shared disk, including NFS, Luster, GlusterFS, multi-attach EBS, and EFS. The only supported storage is block storage, such as Cinder, EBS, or VMDK.

Which components could be disabled?

Due to segregation of duties (SoD) and vulnerabilities concerns, we opt to keep simplicity on GitLab. Features such as CI/CD, GitOps, KAS(Kubernetes management), or artifacts are disabled in our self-hosted GitLab. Instead, we opt for other well-known alternatives (JFrog, Jenkins, Gerrit) to reduce the workload on GitLab.

Git high availability hosting solutions

Here are some Git hosting solutions.

GitLab EnterpriseWandisco’s GerritHot standby GitLab
LicenseCommercialCommercialOSS
DocsGitLab Geo-replica.Wandisco’s Multisite solutionHere
Minimum Nodes532
Local Storages SupportYesYesN (Centralized storage required)
Zone ReplicationYes(Gitaly cluster)Yes(PAXOS)Partial
Regional ReplicationYes(GitLab Geo)Yes(WAN replication)No
CommentsGitaly cluster requires an additional PostgresRequire ecosystem migrationSwichover is not automatically.

Implement replication through GitLab Enterprise

You need to purchase enterprise licenses to access these premium services.

  • GitLab Geo (Push-through Proxy) facilitates communication between regional datacenters.
  • Gitaly Cluster (Shard and replication between nodes), leveraging GitLab’s Gitlay cluster with an additional Postgres and three dedicated storage nodes, while technical support is only for enterprise users.

It works fine on both bare metal and virtual machines, even with local storage.

Implement with Wandisco’s Gerrit

Wandisco introduced PAXOS on Git, which is a consensus algorithm for distributed systems, to replicate repositories in multiple continents. The performance has been proven over decades of use in their product.

To use Gerrit by Wandisco, consider the following:

  • Licenseing fees: you need to purchase a license to receive their support.
  • Ecosystem changed: you have to switch from GitLab to Gerrit. This may be acceptable if your team is migrating from subversion to Git.

Implement centralized storage with the hot standby node

To avoid the disruptions to computing nodes, we believe each node should access dedicated storage nodes through fiber connections, such as a NetApp/OceanStor hardware, or Cinder virtualization platform.

The diagram shows a hot standby architecture along with disaster tolerance in another region.

  • In Region 1, there are two nodes. the secondary node is connecting the primary VM directly with Gitaly client through gRPC, rather than connecting to a disk.
  • For backup in Region 2, leveraging storage level replication is sufficient. Tools like AWS Backup or Restic is acceptable. There’s a minor risk of data loss while transactional data is being uploaded during snapshot progress, but using GitLab’s tar backup is more complex.
$2Region1Primary VMSecondary VM(s)Region2(Disaster tolerance)ELBRedis(HA)Postgres(HA)Virtualized DiskS3Rails/NginxGitaly LocalSidekiqRails/NginxGitaly(as client)Backup DiskBackup S3Usersregional backupjobsI/ORPCPeriodic Snapshots

Here is more details explanation:

  • The Sidekiq component, a distributed job executor, is disabled on the secondary node, as we found that when users attempt to retrieve a temporary file generated by a background job, Nginx cannot reliably routed the request to the same node.
  • All services should be protected within private subnets.
  • You can even create more secondary machines to minimize the downtime risk.

Compared with the single node deployment, when the primary VM is unavailable, the Rails application on the secondary node, which is database-backed, continues to run.

  • The administrator can send broadcast messages to inform users about the outage in their terminal or webpages.
  • Third-party integration jobs such as group and user synchronization remain functional.
  • Instead of wait for the infrastructure provider, you might manually switch over the primary VM to other VMs by remounting the disk (GitLab reconfiguration is required, root access is required).

Product reviews on GitLab

The Weakness

As a platform engineer with extensive hands-on experience with a vast amount of DevOps products, I believe that GitLab’s “All-in-one” strategy is a significant design flaw. An overly broad scope leads to immaturity in features.

Competitiveness

  • Task management: Compared with Azure DevOps, GitLab lacks nested item management, time estimation, and advanced dashboards/queries.
  • Git and merge: Compared with Azure DevOps, which offers comprehensive features for free users, especially regarding complex merge policies, GitLab provides only limited functionality. Compared with Wandisco Gerrit, the PAXOS-based replication provides 100% uptime.
  • Packages and Registries: Compared with JFrog Artifactory, the virtual repositories and UI seem unfinished, the lack of a path-based permission feature is required in large enterprises.
  • CI/CD: I believe the two best commercial CI/CD platforms are Azure DevOps and Codefresh, both of which have GUI-based YAML editors built on Monaco. For free solutions, I prefer Jenkins duo to the numerous open-source plugins.

Serviceability

Manage the GitLab instance is complex duo to its underlying RPC-based architecture.

  • The structure inside GitLab’s Docker container is disorganized, I have never seen a product using Chef automation on a single node.

  • Unlike Jenkins or Sonarqube, there is no LTS version of GitLab on security & bug fixes, we must keep the instance up-to-date every three months. However, according to GitLab’s version policy, new releases may include new features, as we all know, no one can ensure new features are free from vulnerabilities. That is to say, we are in a frequently upgrade cycle.

  • New features can be an ongoing version (e.g., AI chat, Kubernetes deployment) for an extended period, you might wait for iteration.

  • The upgrade requires extensive preparations every three months.

    • It requires thorough test of any platforms that integrate with GitLab API, in case of significant API changes.
    • It requires a downtime window for backuping databases and repositories.

Security design

The following issues are unrelated to typical tech vulnerabilities (such as XSS or CSRF), but the underlying design of roles and permissions.

  • Ambiguous role: It’s difficult to impossible to understand the difference between Guest and Reporter without a cheat sheet.
  • Coarse-grained ACL policy: The :admin_project role grant extensive permission checkpoints.
  • Complex group permission hierarchy: Users can be assigned multiple times within a group, subgroup, or project. However, it’s unclear which one is the final merged role inside a repository.

Some design issues have been discussed in another article.

The Goodness

We are still using GitLab rather than Gerrit because we have accumulated a significant investment in the GitLab API, especially on the merge request workflow. Besides the investment, there are other goodness:

  • Free for self-hosting.
  • Seamless with DevOps ecosystems.

Summary

While no solution elimates all single points of failure, the hot-standby solution offers a license-free option with tradeoffs, I hope you find the design useful when deploying GitLab services.