Setting up an active-passive GitLab instance
2024-11-11 / modified at 2025-01-27 / 1.7k words / 10 mins

GitLab is an open-source DevOps platform in many companies. This instruction will set up a hot standby OSS GitLab in a self-hosted environment.

Takeaways

  • To maintain the segregation of duties and adhere to vendor neutral policy, we opt to use the only repository hosting feature in GitLab. We will disable unnecessary features such as CI/CD or package management.
  • We introduce a new active-passive deployment that does not require NFS.

GitLab is more than a repository hosting platform

GitLab Inc ($GTLB) is under pressure of income growth from stockholders, striving to become a platform company. They release new features every month, increasing the complexity of GitLab instances.

As a consequence, it is critical for platform engineers to manage the all-in-one platform, as they have to learn new components, disable vulnerable features, and maintain the high availability of GitLab instances. The guide will help engineers create hot standby Git hosting on top of GitLab OSS.

Which machine is best for GitLab self-hosting?

We have 3K active users per month on self-hosting GitLab instances, which undertake CI/CD builds (by Jenkins) and source code management. We consistently monitor GitLab instances in clouds and bare metals to find performance differences.

Cloud VM Specs

According to our statistics, there are some requirements for GitLab-hosting.

  • CPU: GitLab is not a CPU-intensive application, as the peak computation is only during the merge request. However, avoid oversold CPU, as high speed I/O network adapter requires better CPU specs.

  • Memory: GitLab requires memory between 8 GB and 128 GB. In addition to the application memory, the operating system may cache the Git repository files into memory pages.

  • Networking: GitLab requires high speed I/O during the code pull and push

    • Without LFS: When the Git LFS was not enabled or the traffic was redirected to third-party S3 storage, we observed 5 Gb/s peak traffic during the pipeline builds.
    • With LFS: If the LFS storage is proxied or hosted by a GitLab instance, a 10 Gb/s card is required.
  • Local storage: GitLab requires high-performance storage(database) during the merge request, the cloud SSD disk is recommended.

Bare Metal Specs

Besides the virtual machine, we also tested deploying instances on bare metal, we found that running on bare metal was not as efficient as we thought, as all resources were not utilized effectively.

Local storage inefficiency and Low-density
  • We must operate and manage RAID5 disks at a fundamental level.
  • There are only 8 disk slots in existing computing nodes, which is not designed for storage expansion, furthermore, we have already bought dedicated enterprise storage.
Memory inefficiency

Only 100 GB of RAM (including the OS pages) were used, but 1TB/512GB RAM were installed on the machine.

Network inefficiency

The 25 Gb/s fiber network adapter was wasted, as we only have 5 Gb/s peak traffic.

In summary, I’d recommend deploying GitLab on instances that have 4U/32G or 8U/64G with large local cloud SSD disks.

GitLab Architecture

GitLab is a Ruby Rails application that is built on top of Postgres, Redis, and cloud storage. Compared with subversion’s focus on repository hosting, which only requires an Apache httpd loadbalancer and shared disk, GitLab requires more resources while hosting repositories.

Component Overall

The diagram shows the minimum components for the repository-hosting.

$2GitLabStatefulGitaly: The FS layerSidekiq: Redis-based job queueStatelessRails: The RESTful APINginx/WorkhorseDatabasePostgres/RedisS3: for LFS/uploadsDisk(EBS/NVME)

NFS: End of support

Before GitLab 15, we implemented the availability of GitLab via shared NFS, despite there being minor latency errors while accessing the filesystem.

Since GitLab 16, Gitaly, the filesystem layer of Git, has officially ended supporting for shared disk, including NFS, Luster, GlusterFS, multi-attach EBS, and EFS. The only supported block storage is cloud storage, such as Cinder, EBS, or VMDK.

Which components could be disabled?

Due to segregation of duties (SoD) and vulnerabilities concerns, we opt to keep simplicity on GitLab. Features such as CI/CD, GitOps, KAS(Kubernetes management), or artifacts are disabled in our self-hosted GitLab. Instead, we opt for other well-known alternatives (JFrog, Jenkins, Gerrit) to reduce the workload on GitLab.

Git high availability hosting solutions

Here are some Git hosting solutions.

GitLab EnterpriseWandisco’s GerritHot standby GitLab
LicenseCommercialCommercialOSS
DocsGitLab Geo-replica.Wandisco’s Multisite solutionHere
Minimum Nodes532
Local Storages SupportYesYesN (Centralized storage required)
High AvailabilityYes(Gitaly cluster)Yes(PAXOS)Partial
Disaster toleranceYes(GitLab Geo)Yes(WAN replication)No
CommentsGitaly cluster requires an additional PostgresRequire ecosystem migrationSwichover is not automatically.

Implement replication through GitLab Enterprise

You need to pay enterprise licenses to get the premium service.

  • GitLab Geo (Push-through Proxy) that works between datacenters.
  • Gitaly Cluster (Shard and replication between nodes), leveraging GitLab’s Gitlay cluster with an additional Postgres and three dedicated storage nodes, while technical support is only for enterprise users.

It works fine on both bare metal and virtual machines, even with local storage.

Implement with Wandisco’s Gerrit

Wandisco introduced PAXOS on Git, which is a consensus algorithm for distributed systems, to replicate repositories in multiple continents. The performance has been proven on their product for decades years.

To use Gerrit by Wandisco, the following should take into account:

  • Licenseing fees: you need to buy a license to get the support.
  • Ecosystem changed: you have to switch the ecosystem from GitLab to Gerrit. I believe it’s okay to use Gerrit if your team is switching from subversion to Git.

Implement centralized storage with the hot standby node

To avoid the disruptions to computing nodes, we believe each node should never access its local storage but access dedicated storage nodes through fiber connections, such as a NetApp/OceanStor hardware, or Cinder virtualization platform.

The diagram shows a hot standby architecture, the secondary VM is connecting the primary VM directly with Gitaly client through gRPC, rather than connecting to a disk.

$2Region1Primary VMSecondary VM(s)Region2(Disaster tolerance)ELBRedis(HA)Postgres(HA)Virtualized DiskRails/NginxGitaly LocalSidekiqRails/NginxGitaly(as client)Backup DiskUsersjobsI/ORPCPeriodic Snapshots

Here is an explanation:

  • The component sidekiq, a distributed job executor, is disabled in the secondary node, as we found when users want to retrieve some temporary file generated by a background job, the Nginx can’t ensure the request is routed into the same node.
  • All services should be protected inside private subnets.
  • You can even create more secondary machines to minimize the downtime risk.

Compared with the single node deployment, when the primary VM is down, despite the unavailability of accessing code, the rails application in the secondary node, which is database-backed, keeps running.

  • The administrator can send broadcast messages to help users know what just happened in their terminal or webpages.
  • Third-party integration jobs such as group and user synchronization remain functional.
  • Rather than wait for the infrastructure provider, you might switch over the primary VM to other VMs manually with disk remount (GitLab reconfiguration is required, root access is required).

Product reviews on GitLab

The Weakness

As a platform engineer who has hands-on experience with a vast amount of DevOps products, I believe that the “All-in-one” strategy is the worst choice that GitLab Inc. has ever made, as an overly broad scope will lead to the immaturity of features.

Competitiveness

  • Task management: Compared with Azure DevOps, GitLab lacks nested item management, time estimation, and advanced dashboards/queries.
  • Git and merge: Compared with Azure DevOps, which has full features for free users, especially the complex merge policy, while GitLab only provides limited features. Compared with Wandisco Gerrit, the PAXOS-based replication provides 100% uptime.
  • Packages and Registries: Compared with JFrog Artifactory, the virtual repositories and UI seem unfinished, the path-based permission feature is required in large enterprises but is lacking.
  • CI/CD: I believe the two best commercial CI/CD platforms are Azure DevOps and Codefresh, both of which have GUI-based yaml editor on top of Monaco, while GitLab users must edit scripts manually. As for free solutions, I prefer Jenkins as there are numerous open-source plugins.

Serviceability

To manage the GitLab instance, I studied the underlying RPC-based architecture and found it complicated.

  • The structure inside GitLab’s docker container is disorganized, I have never seen a product using chef automation on a single node.

  • Unlike Jenkins or Sonarqube, there is no LTS version of GitLab on security & bug fixes, we have to keep the instance up-to-date every three months. However, according to GitLab’s version policy, new releases could include new features, as we all know, no one can ensure new features are free from vulnerabilities. That is to say, we are in a frequently upgrade cycles.

  • New features can be an ongoing version (e.g., AI chat, Kubernetes deployment) for a long period, you might wait for the iteration.

  • The upgrade requires extensive preparations every three months.

    • It requires a full test on any platforms that integrate with GitLab API, in case of significant API changes.
    • It requires a downtime window for the backup of databases and repositories.

Security design

The following issues are unrelated to typical tech vulnerabilities (such as XSS or CSRF), but the underlying design of roles and permissions.

  • Ambiguous role: It’s considered impossible to find the difference between Guest and Reporter without a cheat sheet.
  • Coarse-grained ACL policy: It’s considered the capability :admin_project undertakes too many checkpoints.
  • Complex group permission hierarchy: a user can be assigned multiple times in a group, subgroup, or project. However, no one knows which role is the final merged role inside a repository.

Some design issues have been talked about in another article.

The Goodness

We are still using GitLab rather than Gerrit because we have accumulated a large amount of investment in GitLab API, especially on the merge request workflow. Besides the investment, there are other goodness:

  • Free for self-hosting.
  • Seamless with DevOps ecosystems.

Summary

While no solution elimates all single points of failure, the hot-standby solution offers a license-free option with tradeoffs, I hope you find the design useful when deploying GitLab services.