k3s HA vs Single Node: Best Architecture for Your Homelab?

June 4, 2026

Some of the links in this post may be affiliate links. If you click through and make a purchase, I may earn a commission at no extra cost to you.

You want Kubernetes at home. Maybe you've watched some videos, read some posts, and now you're staring at three browser tabs of mini PCs on Amazon wondering if you really need all of them.

Good. That's the right question.

Most "get started with Kubernetes" content skips the question entirely. It jumps straight to curl -sfL https://get.k3s.io | sh - without answering the thing you're actually asking: What am I building, why am I building it, and what happens when it breaks?

You can't choose an architecture until you know what you need it to survive.

By the end of this post, you'll know which k3s architecture fits your situation — single node, three nodes with embedded etcd, or three nodes with kube-vip for full HA. You'll know what each one costs, what it survives, and whether you even need Kubernetes in the first place. No commands. No YAML. This is the decision post. The building starts in the next one.

Scope: We're comparing k3s architectures on real hardware — mini PCs running Ubuntu Server. Everything here is backed by a live cluster at github.com/Taegost/homelab-k8s with 18+ running applications. I'm not teaching theory. I'm showing you what I built, what broke, and what I'd do differently.

Key Takeaways

  • k3s runs on three architectures: single node (cheap, fragile, perfect for learning), 3-node embedded etcd (data replicated but kubectl fails when the leader dies), and 3-node + kube-vip full HA (any node can die and everything keeps working — including kubectl)
  • A 2-node embedded etcd cluster offers ZERO control plane fault tolerance — 2/2 quorum means both nodes must be available. Never start with two nodes. Start with one (learning) or three (production).
  • Embedded etcd is the correct default for homelab — external databases create a bootstrap paradox where k3s needs Postgres but Postgres runs on k3s. Embedded etcd avoids this completely.
  • etcd is brutally latency-sensitive — sub-10ms disk fsync writes required — but even the bundled NVMe in a $170 mini PC handles this comfortably. The real risk isn't drive speed; it's stacking too many write-heavy workloads on the same physical disk as etcd.
  • If you're running five Docker containers with no uptime requirement, Kubernetes might not be right for you. There's an entire section on when to walk away. If that's you, that's a win — you just saved yourself months of operational complexity.

Table of Contents


The Three k3s Architectures — What They Are and What They Cost

k3s is the most-downloaded Kubernetes distribution. In 2026, it reached 33,100+ GitHub stars (GitHub, May 2026). In 2024, 2,200 U.S. Home Depot stores ran production edge workloads on k3s (SUSE, 2024). This isn't a toy distribution. It's a production-grade platform that happens to fit on a $170 mini PC.

There are three ways to run k3s in a homelab, and they exist on a spectrum from "cheap and fragile" to "resilient and complex." Single node: one machine, everything on it, ~$170. Three nodes with embedded etcd: data replicated across nodes, control plane HA on paper but no stable endpoint for kubectl, ~$510. Three nodes with kube-vip full HA: any node can die and kubectl still works, ~$510 — same hardware, smarter configuration. The hardware for options 2 and 3 is identical. The difference is one extra component and the decisions you make at install time.

As an example, here are the actual mini-PCs running my cluster right now:

Architecture 1 — Single Node

One machine. k3s runs the control plane, the worker, and all your workloads on it. If it goes down, everything goes down. No etcd replication. No node redundancy.

Best for: Learning Kubernetes, running non-critical services where downtime is an inconvenience not a disaster, or experimenting before committing to a multi-node build.

Hardware: One BeeLink S12 Pro or equivalent (Intel N100, 16GB RAM, 512GB NVMe — same hardware listed above). You can also use an old laptop, a single VM on Proxmox, or whatever x86 machine you have lying around.

What you get: A fully functional Kubernetes cluster. kubectl works. You can deploy apps, experiment with operators, and learn the API. Everything the multi-node architectures can do, a single node can do — it just can't survive the node going down.

What you don't get: Any form of fault tolerance. Node goes down, cluster goes down. Power supply dies, cluster dies. You accidentally sudo reboot at the wrong time, your family asks why Plex is broken.

Architecture 2 — Three Nodes, Embedded etcd (Data HA, Fragile Ops)

Three control plane nodes with embedded etcd. Cluster state is replicated across all three — if one node dies, the data survives on the remaining two. BUT: without kube-vip, kubectl points at ONE node's IP address. That node goes down, kubectl stops working. You can't deploy, can't scale, can't troubleshoot until you manually update your kubeconfig to point at a surviving node.

Workloads keep running — the kubelet on each node doesn't need the API server to keep pods alive. Your apps stay up. You just can't manage them until you fix your kubeconfig.

This is the architecture most "k3s HA" tutorials produce. It's HA for your data, not for your operations.

Hardware: Three BeeLink S12 Pro mini PCs (or three VMs on separate physical hosts if you care about physical fault isolation — more on that later). ~$510 total ($170 × 3). Or three VMs on a single Proxmox host if you're learning — the install commands are identical.

Architecture 3 — Three Nodes, kube-vip + Embedded etcd (Full HA)

Same as Architecture 2, plus kube-vip providing a virtual IP that floats between control plane nodes. kubectl always connects to the VIP — whichever node is the active control plane leader answers the request. One node goes down, the VIP moves to another node, kubectl still works, workloads still run, new deployments still succeed. This is production HA on a homelab budget.

Hardware: Identical to Architecture 2 — three mini PCs or VMs, ~$510. The difference is configuration, not cost. kube-vip is open source and runs as a DaemonSet inside the cluster. No extra hardware required.

ArchitectureNodesHardware CostData RedundancySurvives Node Failure?kubectl Survives?ComplexityBest For
Single Node1~$170NoneNoNoLowLearning, experimenting
Embedded etcd (3-node)3~$510Yes (etcd quorum)Workloads yes, kubectl noNoMediumHA-curious with tolerance for manual failover
kube-vip Full HA (3-node)3~$510Yes (etcd quorum)YesYesMedium-HighProduction homelab — the architecture this series builds

Architectural Decisions Nobody Explains — etcd, Colocation, Latency

Three decisions shape your k3s architecture more than any others, and they're the three decisions most guides skip entirely. First: embedded etcd vs external database — the bootstrap paradox makes external DB a trap for homelab. Second: colocating control plane and workloads on the same three nodes — it's the k3s default, it works, but the tradeoffs are real. Third: etcd is brutally latency-sensitive — a disk write stall on one node can destabilize your entire cluster — and your cheap NVMe drive handles it fine IF you know what to watch for.

Embedded etcd vs External Database — The Bootstrap Paradox

k3s supports external datastores — PostgreSQL, MySQL, or an existing etcd cluster — instead of embedded etcd. For enterprise, this makes sense. You have a dedicated database team. HA is their problem. You point k3s at their HA Postgres cluster and move on.

For homelab, it's a trap. Here's why.

If k3s depends on an external Postgres database, and if that Postgres runs ON k3s (via CloudNativePG, which we'll deploy in Section 5), you create a circular dependency: k3s needs Postgres to start, and Postgres needs k3s to run. Neither can boot without the other. This is the bootstrap paradox, and it's exactly the kind of architecture trap you only recognize after you've fallen into it. Or the database is hosted somewhere else entirely, and now you're managing Kubernetes AND another setup.

Embedded etcd avoids this entirely — etcd is self-contained inside k3s, no external dependencies, no circular startup. It's what k3s was designed for. It's one less database to manage. And it's the path every other post in this series assumes you took.

The ONE valid homelab use case for an external database: you already run a dedicated HA Postgres instance on separate hardware — not on k3s, not in a VM on the same Proxmox host — and you want to reduce etcd write load. For everyone else: embedded etcd is the correct default.

Why 3 Nodes, All Control Plane + Worker

k3s default behavior: every server node is both control plane AND worker — no taints by default. This is different from upstream Kubernetes best practice, where you dedicate control plane nodes and taint them to prevent workload scheduling.

For homelab, the math is simple: you have three nodes. You cannot afford to waste any of them running ONLY the control plane.

In 2025, 1–1.5GB RAM and 5–10% CPU was the documented control plane overhead for idle k3s nodes — the full plane (etcd + API server + controller manager + scheduler) (k3s docs, 2025). The rest is available for your workloads. My cluster runs 18+ applications on three nodes with colocated control plane and workloads — no issues once I started paying attention to IO utilization (more on that later).

The tradeoff: if you run a CPU or memory-hungry workload without resource limits on a control plane node, it CAN starve etcd of resources and destabilize the cluster. Mitigation is straightforward: always set resource limits on workload pods. Never run unchecked batch processing on control plane nodes. For most homelab workloads — Plex, Home Assistant, Mealie, n8n — the resource footprint is negligible. The control plane co-exists peacefully with them.

etcd Latency — The Hidden Requirement

etcd is the most latency-sensitive component in your entire cluster. The official etcd documentation specifies: disk fsync write latency (p99 of wal_fsync_duration_seconds) should be under 10ms for production, and under 1ms is ideal — which any SSD achieves. Network round-trip time between nodes should stay under 50ms to prevent heartbeat timeouts and spurious leader elections (etcd docs, 2025).

If etcd disk write latency spikes above 10ms for sustained periods, leader elections time out, heartbeat messages fail, and the cluster degrades. API server becomes unresponsive, nodes flip to NotReady, and you're staring at a cluster that's technically running but functionally dead.

What causes latency spikes: Heavy write I/O on the same physical disk (Longhorn volume replication, database WAL writes, log shipping), SATA SSDs under sustained write load, or — the big one — HDDs. etcd on spinning rust is explicitly unsupported and WILL fail in unpredictable ways.

What fixes it: NVMe drives. Even the bundled 512GB NVMe in a $170 mini pc delivers sub-millisecond write latency under homelab workloads. My cluster — three mini pc nodes, bundled NVMe, 18+ apps including Postgres, MariaDB, MongoDB, and Longhorn — has had zero etcd latency incidents once I spread everything out across the three nodes.

For a 3-node homelab, any NVMe drive from the last five years is fine. If you're on SATA SSDs, pay attention and monitor disk latency once Prometheus is in place (coming in a future section). If you're on HDDs — seriously, a $30 NVMe drive saves you from a class of failures that are nearly impossible to diagnose without metrics.

The All-Control-Plane I/O Amplification Problem

Here's the part nobody talks about.

When ALL three of your nodes are control plane nodes, ALL three nodes run etcd. And all three nodes run your workloads. That means every single node's NVMe drive is handling etcd writes + API server reads + your database WAL writes + Longhorn volume replication I/O + Loki log ingestion + whatever else you deployed.

A single NVMe drive CAN handle this — modern consumer NVMe does 200,000+ random write IOPS. But you need to be AWARE of the combined load. The failure mode isn't hardware; it's the "oh, I deployed three different database providers and Longhorn volumes replicated to every node and now etcd is competing with all of them for disk time.". Ask me how I know.

This is the hidden cost of colocating control plane and workloads: your disk I/O budget is shared. etcd is a background service until it isn't — the moment you deploy three write-heavy apps on node 1, node 1's etcd latency spikes, leader election fires, and suddenly node 2 is the leader handling the same I/O load from its own three write-heavy apps. The problem just migrates.

The fix is NOT buying faster drives — though that helps. The fix is awareness:

  • Know what's writing to disk on each node
  • Deploy write-heavy workloads across nodes, not stacked on one
  • Use Longhorn storage policies (or node selectors) to prevent a database and its replicas from saturating the same physical disk
  • Set resource limits on write-heavy pods
  • Keep the number of replicas of disk-intensive processes (Longhorn volumes, DB replicas, etc.) to less than the number of nodes. That will help spread out the utilization and not pile it all on one node

Think of it this way: etcd is the canary in the coal mine for your disk I/O. If etcd is unhappy, your disk is oversubscribed — and your databases and log systems are feeling it too. They just don't crash the entire cluster when they slow down.

The latency mitigation checklist:

  1. NVMe storage on all control plane nodes — non-negotiable
  2. Separate etcd data from heavy-write workloads if possible — k3s stores etcd in /var/lib/rancher/k3s/server/db/; keeping this on the system NVMe is usually sufficient
  3. Set resource limits on write-heavy pods (databases, log collectors)
  4. Distribute write-heavy workloads across nodes — don't stack three databases on node 1 while nodes 2 and 3 idle
  5. Use Longhorn storage policies or node selectors to prevent a database and its replicas from saturating the same physical disk
  6. Monitor disk I/O with Prometheus + Grafana once you have it (a future section covers this — it's a content gap we fill later)
  7. Run a sanity check on your own drives: fio --randwrite=1 --ioengine=libaio --direct=1 --bs=4k --size=1G --runtime=30 --time_based on each node. If your 99th percentile latency is under 5ms, you're fine.

What Does "High Availability" Actually Mean in a Homelab?

HA in a homelab is not the same as HA in a datacenter. You're not protecting against a rack losing power or a top-of-rack switch failure. You're protecting against: a mini PC power brick dying, an NVMe drive failing after 18 months of writes, a kernel panic during an unattended security update, or you accidentally kubectl delete namespace-ing the wrong thing at 11pm.

In a datacenter, HA means five nines. In a homelab, HA means "my family doesn't yell that the internet is broken while I'm at work."

The Fault Domains That Matter

Design for the failures that actually happen at home:

  • Single component failure — one node's power brick, one NVMe, one RAM stick. This is the threat model. Everything else is overbuilding.
  • NOT full rack failure — everything's on the same $30 gigabit switch.
  • NOT network partition — your entire cluster is on the same /24 subnet.
  • NOT region failure — unless your basement floods. In which case, you have bigger problems than kubectl.

What Fails in Each Architecture

Here's the concrete matrix. This isn't theoretical — these are the actual failure modes I've either experienced or tested in my cluster:

FailureSingle NodeEmbedded etcd (3-node)kube-vip Full HA (3-node)
One node downEverything stopsWorkloads survive on remaining nodes; kubectl dead if leader was the failed nodeWorkloads survive; kubectl survives (VIP floats to another node)
Two nodes downN/A (only one node exists)API server unreachable; etcd quorum lost; cluster becomes read-onlySame — 3-node quorum requires 2/3 nodes available. etcd math, not a bug
Storage failure on one nodeData loss for everything on that nodeData loss for workloads on that node; Longhorn replication protects replicated volumesSame — Longhorn replicas + off-node backups survive
You delete the wrong thingHope you had a backupHope you had a backup; ArgoCD can redeploy from git, but data needs a separate backupSame — GitOps re-deploys from git, data from backup

The Honest 2-Node Reality

A 2-node embedded etcd cluster requires 2/2 nodes for quorum — zero control plane fault tolerance (per the k3s HA embedded datastore docs; Sidero Labs, September 2022).

You read that correctly. A 2-node "HA" cluster has NO improvement in fault tolerance over a single node for control plane availability. Both require the full set of nodes to be available. The difference is that with two nodes, you have replicated data — so when the failed node comes back, it can catch up — but while it's down, you're no better off than single-node for actually USING your cluster.

Two nodes is the worst of both worlds: the complexity cost of HA with none of the resilience. The only valid reason for running two nodes is "I bought the second mini PC and the third one is in the mail."

If you're building HA, build three nodes from the start. Start with one if you're learning. Skip two entirely. (The pillar page's 2-Node HA Paradox has the full quorum math if you want the details.)


When NOT to Use Kubernetes

Kubernetes is not the answer to every homelab question.

If you're running Plex, Pi-hole, and a UniFi Controller on one machine with no uptime requirement, Docker Compose on Unraid is the correct tool. Kubernetes adds operational complexity — cert-manager renewals, etcd maintenance, CSI driver debugging, ArgoCD sync troubleshooting — that you will pay for EVERY single time something breaks. The question isn't "is Kubernetes cool?" (it is). The question is "do I need what it gives me enough to pay what it costs?"

In 2026, 82% of container users run Kubernetes in production, up from 66% in 2023 (CNCF Annual Cloud Native Survey, January 2026). It's the de facto operating system for cloud-native workloads. That's why learning it matters for your career. But your homelab doesn't need to be a production cluster unless you WANT it to be.

The Docker Compose Test

Can your current setup do everything you need? If yes — and you don't have a learning goal — don't migrate. Kubernetes solves orchestration problems: scheduling workloads across multiple nodes, declarative configuration, automated rollouts, self-healing. If you don't HAVE those problems, you don't need Kubernetes.

The Learning Exception

"I want to learn Kubernetes for my career" is a perfectly valid reason to run it even if Docker Compose works fine. But be honest with yourself: is this a learning cluster or a production cluster? A single-node learning cluster is the right place to start. You can add nodes later. The architecture is upgradeable — that's the entire point of the decision matrix in the next section.

Signs Kubernetes IS Right for You

  • You have services spread across multiple machines and want a unified control plane
  • You want GitOps — every config in git, every deploy via git push
  • You want self-healing — pod dies, it comes back automatically
  • You want TLS certificates managed automatically via cert-manager instead of manually renewing Let's Encrypt every 90 days
  • You want to learn a skill that's directly transferable to a job
  • You enjoy understanding how things work, not just that they work

Signs Kubernetes is NOT Right for You

  • You have five containers and one machine
  • You don't want to think about your infrastructure — you want it to Just Work
  • You're not interested in learning Kubernetes as a skill
  • Your uptime requirements are "meh, I'll fix it when I get home"
  • You're not willing to spend the first month debugging networking, storage, and certificates

If you read that second list and nodded along, that's not a failure. That's a SUCCESS. You just saved yourself months of operational overhead by recognizing that the right tool for your situation isn't Kubernetes. Docker Compose is excellent software. Unraid is excellent software. Use what fits.

Every Kubernetes tutorial should include this section. Most don't, because they're trying to sell you on Kubernetes. I'm not selling anything — the configs are public, the repo is open source, and I'd rather you make the right decision for YOUR homelab than cargo-cult mine.


The Decision Matrix — Picking Your k3s Architecture

Start with a single node if you're learning. Start with three nodes + kube-vip if you're building something you want to stop thinking about. Skip Kubernetes entirely if you don't need orchestration. And never, ever start with two nodes — it's the worst of both worlds.

The architecture is not a permanent decision. You can start with one node, add a second when you're ready, add a third for quorum, and add kube-vip when you need real HA. Each step is additive. You don't have to start at the destination.

"Just Do Something" — even a single node running k3s in your spare room is better than analysis paralysis and twelve open browser tabs.

The Decision Flowchart

  1. Do you need orchestration across multiple machines?

    • NO → Docker Compose / Unraid. Stop here. You're done.
    • YES → Go to question 2.
  2. Is this primarily for learning?

    • YES → Single node. You can upgrade later. The install commands are the same — you just add nodes and flags.
    • NO → Go to question 3.
  3. Do you need workloads to survive a node failure?

    • NO → Single node, or 3-node for future-proofing. Both work.
    • YES → Three nodes + kube-vip full HA. This is production on a homelab budget.

Hardware Shopping List by Tier

Learning / Experimenting (Single Node):

  • 1× BeeLink S12 Pro or equivalent (Intel N100, 16GB DDR4, 512GB NVMe) — ~$170
  • An Ethernet cable you already own
  • Total: ~$170

Production HA (3-Node Full HA):

  • 3× BeeLink S12 Pro (Intel N100, 16GB DDR4, 512GB NVMe) — ~$510
  • A small UPS (~$60) so a power blip doesn't take down all three nodes simultaneously
  • A gigabit switch you almost certainly already own
  • Total: ~$570

Storage note: Storage is architecture-independent and covered in Section 4 of the curriculum. You can use the built-in NVMe for initial workloads, add an external USB drive for backups, or point Longhorn at an existing NAS later. Don't let storage planning become the reason you don't start.

The Upgrade Paths

Single node → add kube-vip (It's way easier to add now, and then you don't need to think about it again) → add Node 2 (now you have etcd replication, still no fault tolerance) → add Node 3 (now you have quorum, 1-node fault tolerance).

Each upgrade is additive. You never need to reinstall. The --cluster-init flag on your first node was forward-looking — it set up embedded etcd ready for additional control plane members. When you add nodes, they join the existing etcd cluster with --server https://<KUBE_VIP>:6443. The cluster grows around your existing config. No rebuild required.


Frequently Asked Questions

"Can I start with one node and add more later?"

Yes. k3s embedded etcd supports adding nodes to an existing cluster. Start with --cluster-init on Node 1, then add kube-vip. When you add Nodes 2 and 3, use --server https://<KUBE_VIP>:6443 with the shared token to join the control plane. The catch: a single-node etcd cluster has no replication until you add the second node, and the join process carries a small risk of quorum disruption. Take a VM snapshot or back up your data before adding nodes. The install tutorial (next post) walks through every flag.

"What's the difference between embedded etcd HA and kube-vip HA?"

Embedded etcd replicates the cluster state database across nodes — your data survives a node failure. kube-vip provides a virtual IP that floats between control plane nodes — kubectl survives a node failure. You need BOTH for actual HA. Without kube-vip, your data is safe on the surviving nodes but you can't run kubectl commands until you manually point your kubeconfig at a surviving node's IP. It's the difference between "the cluster still exists" and "I can still manage the cluster."

"Is a 2-node cluster worth it?"

No. A 2-node embedded etcd cluster requires 2/2 nodes for quorum — zero fault tolerance improvement over a single node, but with all the complexity of a multi-node cluster. Start with one (learning) or three (production). Skip two. The only exception is "I bought two mini PCs and the third arrives Thursday." In that case, build two and add the third when it shows up. That's fine. Just don't STOP at two.

"Do I really need three physical machines, or can I use VMs?"

VMs work identically for learning and architecture evaluation. The install commands are the same. The limitation: if the VM host goes down, ALL your "nodes" go down together — you lose the physical fault isolation that real HA provides. For a production homelab, spread nodes across physical machines. For learning, a single Proxmox host with three VMs is a perfect sandbox. Start with what you have. If you only have two physical machines, then aboslutely set up 3 VMs between them so you can start a 3-node cluster. Then when you add a third machine, you just migrate the VM over (You ARE using a hypervisor like Proxmox, right?)

"Do I need special SSDs for etcd? What about my mini PC's NVMe?"

No. etcd's disk latency requirement is sub-10ms fsync writes (ideally sub-1ms on SSD). Even the bundled NVMe drive in a $170 mini pc delivers sub-millisecond write latency under typical homelab workloads. The latency panic you read about mostly applies at datacenter scale — hundreds of nodes, etcd handling millions of operations per second. For a 3-node homelab with under 50 pods, any NVMe drive from the last five years is fine. SATA SSDs work but monitor them once you have Prometheus. HDDs are a hard no — etcd on spinning rust is explicitly unsupported. Run the fio command from the latency section on your own drives if you want peace of mind.


What's Next?

You picked your architecture. Now let's build it.

The next post is the install tutorial: Why k3s? Choosing and Installing k3s for Your Homelab Cluster. It covers OS prep (including the multipathd gotcha that breaks Longhorn if you don't disable it NOW), the exact curl | sh commands with every flag explained, and the first kubectl get nodes showing all three nodes Ready. It's the hands-on companion to this decision post — no more theory, just terminal output.

After that: kube-vip setup (Spoke 3), IP planning (Spoke 4), and the bootstrap order deep-dive (Spoke 5). Each post builds on the last. By the end of Section 1, you'll have a running 3-node k3s HA cluster with a floating control plane VIP, MetalLB assigning LoadBalancer IPs from your LAN, and Traefik terminating TLS for every service — all deployed in the correct order, all backed by real working configs in the public repo.

The full architecture overview — what every component does and why the dependency order is non-negotiable — is in the pillar page (linked in the 2-Node Reality section above). Read it if you haven't. It's the map. The spokes are the turn-by-turn directions.

Don't be afraid to break things. Sometimes that's the best way to learn. Don't wait — just try it.

What architecture did you pick? Or are you sticking with Docker Compose? Let me know in the comments — especially if you chose "none of the above." I'm genuinely curious what people are running.

All configs referenced in this post are in the public homelab-k8s repo. Clone it. Compare it to what you're building. Submit a PR if you find a better way.

Sources

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.