I’ve been running Kubernetes in production for years now, and there’s a specific kind of pain that only hits you once you cross the threshold from “a couple of clusters” to “wait, how many do we have again?” That threshold, for me, was eight clusters. Eight clusters across three cloud providers and two on-prem data centers. And every single one of them had drifted into its own little snowflake.

This isn’t a theoretical post. I’m going to walk through how I used Fleet and Rancher to wrangle that mess back into something manageable, and why I think GitOps-driven multi-cluster management is the only sane approach once you’re past three or four clusters.


The Drift Problem Nobody Warns You About

Here’s how it starts. You spin up your first cluster, get everything dialed in perfectly. Namespace policies, RBAC rules, network policies, monitoring agents — all configured by hand with love and care. Then you need a second cluster. You copy the manifests over, tweak a few things for the new environment. No big deal.

By cluster number four, someone on the team has made a “quick fix” directly on cluster three that never made it back to the repo. By cluster six, you’ve got two different versions of your ingress controller running across the fleet. By cluster eight, I couldn’t tell you with confidence what was actually running where.

I’d already been using ArgoCD for GitOps on individual clusters, and it worked brilliantly for single-cluster deployments. But ArgoCD at the time wasn’t built to think in terms of fleet-wide operations. I needed something that understood the concept of “apply this to all production clusters” or “roll this change out to staging first, then prod.”

That’s where Fleet and Rancher entered the picture.


Why Rancher and Fleet, Not Just More ArgoCD

I want to be upfront about this choice. ArgoCD is fantastic. I still use it. But when you’re managing multiple clusters, you need a layer above the individual cluster GitOps tooling. Rancher gives you that unified control plane — a single pane of glass where you can see every cluster, its health, its workloads, and its configuration. Fleet, which ships as part of Rancher, is the GitOps engine that handles multi-cluster delivery.

The key difference is that Fleet thinks in terms of cluster groups and bundles. You define what should run where using labels and selectors, not by maintaining separate config per cluster. That’s a fundamental shift in how you think about multi-cluster config.

I’d tried the “one ArgoCD app per cluster” approach. It works until it doesn’t. The moment you need to roll out a policy change across 8 clusters, you’re editing 8 different Application manifests. With Fleet, you edit one GitRepo resource and let the label selectors do the work.


Setting Up Rancher as Your Control Plane

I won’t rehash the full Rancher installation — their docs cover that well. What I will share is the architecture that worked for me. I run Rancher on a dedicated management cluster that does nothing else. It’s a small three-node RKE2 cluster. Don’t run your workloads on the same cluster as Rancher. I learned that the hard way when a misbehaving workload took down the management plane and I lost visibility into everything simultaneously.

Once Rancher is up, you import your existing clusters. For each cluster, Rancher deploys a lightweight agent that phones home. The import process is straightforward — Rancher gives you a kubectl command to run on each downstream cluster, and within a minute or two, it shows up in the dashboard.

The first thing I did after importing all eight clusters was establish a labeling convention. This is critical and I can’t stress it enough. Every cluster got labels for environment, region, provider, and tier:

labels:
  env: production
  region: eu-west-1
  provider: aws
  tier: platform

These labels become the foundation for everything Fleet does. Get them wrong or inconsistent, and you’ll be fighting the tooling instead of benefiting from it.


Fleet GitOps: Bundles, GitRepos, and Cluster Groups

Fleet’s mental model is simple once it clicks. You have a Git repository containing your Kubernetes manifests. You create a GitRepo resource that tells Fleet where that repo is and which clusters should receive its contents. Fleet handles the rest — cloning, rendering, deploying, and monitoring drift.

Here’s a real GitRepo resource from my setup that deploys monitoring agents to every production cluster:

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: monitoring-stack
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/fleet-monitoring
  branch: main
  paths:
    - /prometheus
    - /grafana-agent
  targets:
    - clusterSelector:
        matchLabels:
          env: production

That’s it. Every cluster labeled env: production gets the Prometheus and Grafana agent configs from that repo. When I add a ninth production cluster and label it correctly, it automatically gets the monitoring stack. No manual intervention, no forgetting to update a list somewhere.

But here’s where it gets powerful. You can use targetCustomizations to overlay cluster-specific values without maintaining separate directories:

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: ingress-controller
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/fleet-ingress
  branch: main
  paths:
    - /nginx-ingress
  targets:
    - name: aws-clusters
      clusterSelector:
        matchLabels:
          provider: aws
      helm:
        values:
          service:
            type: LoadBalancer
            annotations:
              service.beta.kubernetes.io/aws-load-balancer-type: nlb
    - name: onprem-clusters
      clusterSelector:
        matchLabels:
          provider: onprem
      helm:
        values:
          service:
            type: NodePort
            nodePorts:
              http: 30080
              https: 30443

This was the moment I realized Fleet was solving a problem I’d been hacking around for years. Same ingress controller, same base config, but with provider-specific customizations declared in one place. I’d previously been maintaining separate Helm values files per cluster and it was a nightmare to keep in sync.


Cluster Groups and Staged Rollouts

One of the patterns I’ve come to rely on heavily is staged rollouts using cluster groups. I don’t want a config change hitting all eight clusters simultaneously. That’s how you turn a small mistake into a fleet-wide outage.

I set up cluster groups that mirror my deployment stages:

apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
  name: canary
  namespace: fleet-default
spec:
  selector:
    matchLabels:
      rollout-stage: canary
---
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
  name: production-wave-1
  namespace: fleet-default
spec:
  selector:
    matchLabels:
      rollout-stage: wave-1
---
apiVersion: fleet.cattle.io/v1alpha1
kind: ClusterGroup
metadata:
  name: production-wave-2
  namespace: fleet-default
spec:
  selector:
    matchLabels:
      rollout-stage: wave-2

My canary cluster is a small production cluster that gets changes first. If nothing breaks after a defined soak period, wave-1 picks it up, then wave-2. This is the same progressive delivery concept from multi-environment GitOps, but applied at the cluster infrastructure level rather than the application level.


Policy Enforcement Across the Fleet

The drift problem I mentioned earlier wasn’t just about workloads. It was about policies. Cluster three had a relaxed pod security policy because someone needed to debug something six months ago and never reverted it. Cluster five was missing network policies entirely in two namespaces. This is the kind of thing that keeps you up at night.

With Fleet, I created a dedicated policy repo that gets applied to every cluster, no exceptions:

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: cluster-policies
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/fleet-policies
  branch: main
  paths:
    - /network-policies
    - /pod-security
    - /resource-quotas
  targets:
    - clusterSelector:
        matchExpressions:
          - key: env
            operator: Exists

That matchExpressions with Exists means every cluster with an env label gets these policies. No opt-out. The repo contains baseline network policies that deny all ingress by default, pod security standards that prevent privileged containers, and resource quotas per namespace.

The beauty is that Fleet continuously reconciles. If someone manually deletes a network policy on a cluster, Fleet puts it back within minutes. That’s the drift correction I’d been missing. I wrote about namespace isolation patterns before, but enforcing those patterns consistently across a fleet was always the hard part. Fleet makes it automatic.


RBAC at Scale: Rancher’s Secret Weapon

Managing RBAC across multiple clusters was one of my biggest headaches before Rancher. Each cluster had its own set of ClusterRoleBindings, and keeping them synchronized was a manual process that nobody enjoyed.

Rancher has a concept of global roles and project-level permissions that propagate across clusters. When I onboard a new team, I create their permissions once in Rancher, assign them to the relevant clusters, and it’s done. No kubectl commands on eight different clusters. No YAML files to copy around.

But I still use Fleet for the RBAC resources that need to be version-controlled and auditable. Things like service account bindings for CI/CD pipelines, or custom roles for specific workload types. Those live in Git and get deployed via Fleet just like everything else.

The combination works well: Rancher handles the human-facing RBAC (who can access what), and Fleet handles the machine-facing RBAC (what service accounts exist with what permissions). Trying to do everything through one mechanism always felt like forcing a square peg into a round hole.


Operators and Custom Controllers in a Multi-Cluster World

I’ve written about building Kubernetes operators before, and multi-cluster management adds an interesting wrinkle. You need your operators deployed consistently, but you also need to think about which clusters actually need which operators.

Not every cluster needs every operator. My platform clusters run the full suite — cert-manager, external-dns, the works. My edge clusters run a stripped-down set. Fleet’s path-based targeting handles this cleanly:

apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata:
  name: operators
  namespace: fleet-default
spec:
  repo: https://github.com/myorg/fleet-operators
  branch: main
  targets:
    - name: platform-full
      clusterSelector:
        matchLabels:
          tier: platform
      paths:
        - /cert-manager
        - /external-dns
        - /sealed-secrets
        - /kyverno
    - name: edge-minimal
      clusterSelector:
        matchLabels:
          tier: edge
      paths:
        - /cert-manager
        - /sealed-secrets

Same repo, different subsets deployed to different cluster tiers. When I need to upgrade cert-manager, I update it once in the repo and it rolls out everywhere that uses it.


When Things Go Wrong: Debugging Fleet Deployments

Fleet isn’t magic, and I’ve had my share of head-scratching moments. The most common issue is bundles stuck in a “Not Ready” state. Nine times out of ten, it’s a YAML syntax error or a Helm values mismatch. Fleet’s status reporting has gotten better over time, but you still need to dig into the bundle’s status conditions to find the actual error.

The command I run most often when debugging:

kubectl -n fleet-default get bundles -o wide
kubectl -n fleet-default describe bundle <bundle-name>

The describe output shows you per-cluster deployment status, which is invaluable when a change works on seven clusters but fails on one. Usually it’s a cluster-specific issue — a missing CRD, a different Kubernetes version, or a node that’s run out of resources.

One gotcha that bit me hard: Fleet’s default sync interval is 15 minutes. When you’re actively developing and testing, that feels like an eternity. You can tune it per GitRepo:

spec:
  pollingInterval: 30s

Just remember to set it back to something reasonable for production. Polling every 30 seconds across eight clusters generates meaningful API load on your Git server.


The Repo Structure That Actually Works

After several iterations, I settled on a repo structure that separates concerns cleanly. I have three Fleet repos:

  1. fleet-platform — cluster-level infrastructure: CNI config, storage classes, monitoring, logging
  2. fleet-policies — security policies, resource quotas, network policies, pod security
  3. fleet-apps — application workloads and their supporting resources

Each repo has its own GitRepo resource in Fleet with different target selectors. Platform and policy repos target all clusters. The apps repo uses more granular targeting based on what each cluster is supposed to run.

I tried a monorepo approach first. It works for small setups, but once you have multiple teams contributing, the merge conflicts and review bottlenecks become painful. Separate repos with clear ownership boundaries scaled much better for us.


What I’d Do Differently

If I were starting from scratch today, I’d establish the labeling convention before importing a single cluster. I spent a solid week relabeling clusters and updating selectors because my initial labels were inconsistent. environment vs env, cloud vs provider — these inconsistencies cascade through every Fleet target selector.

I’d also invest in a fleet-wide dashboard from day one. Rancher’s built-in monitoring gives you per-cluster views, but I wanted a single Grafana dashboard showing the health of all clusters side by side. Building that after the fact meant retrofitting metric labels and datasources across the fleet. Not fun.

And I’d set up Fleet’s drift detection alerts immediately. Fleet corrects drift automatically, which is great, but you also want to know when drift is happening. Frequent drift on a specific cluster usually means someone is making manual changes, and that’s a process problem you need to address, not just a technical one.


Is It Worth the Complexity?

Absolutely, but with a caveat. If you’re running two or three clusters, Rancher and Fleet might be overkill. You can get by with separate ArgoCD instances and some discipline. But once you cross that threshold — and you’ll know when you do because you’ll start losing track of what’s where — a proper multi-cluster management layer pays for itself almost immediately.

My eight clusters now behave as a coherent fleet. Policy changes propagate in minutes. New clusters come online with the full stack pre-configured. Drift gets corrected automatically. And I can actually sleep at night knowing that the security policies I defined are running everywhere, not just on the clusters I remembered to update.

The combination of Rancher for visibility and cluster lifecycle management, plus Fleet for GitOps-driven configuration delivery, has been the most impactful infrastructure investment I’ve made in the last two years. It’s not perfect — the learning curve is real, and Fleet’s documentation could be better — but the alternative of managing each cluster as an island simply doesn’t scale.

If you’re feeling that multi-cluster pain, start with Rancher on a dedicated management cluster, import your existing clusters, establish your labels, and let Fleet bring order to the chaos. Your future self will thank you.