Building a 130K Node Kubernetes Cluster

The relentless demand for artificial intelligence (AI) and machine learning (ML) workloads is pushing the boundaries of cloud infrastructure, requiring unprecedented compute resources. In a groundbreaking experimental feat, Google Cloud has shattered Kubernetes scalability records by successfully constructing and operating a 130,000-node cluster within Google Kubernetes Engine (GKE). This achievement, doubling the size of its previously announced 65,000-node capability, offers a compelling case study into the architectural innovations and engineering prowess required to manage Kubernetes at an exascale.

This article delves into the technical challenges and solutions involved in building and maintaining such an immense Kubernetes cluster, exploring the re-imagined control plane, advanced networking and storage paradigms, and the operational strategies necessary to orchestrate over a hundred thousand nodes.

The Hyperscale Challenge: Pushing Kubernetes Limits

Standard Kubernetes distributions are typically designed to support clusters up to around 5,000 nodes, with a maximum of 150,000 pods. While major cloud providers like Amazon EKS have pushed this limit to 100,000 nodes and GKE to 65,000 nodes for production, reaching 130,000 nodes in a single cluster is a monumental leap. This scale introduces a cascade of fundamental bottlenecks across the entire Kubernetes architecture, from the core control plane components to networking and persistent storage.

At this scale, every interaction with the cluster, every state change, and every resource allocation multiplies the load on the control plane. Traditional Kubernetes setups falter due to the immense volume of objects and the rate of operations required. Engineers must contend with issues like control plane saturation, API server queuing, and etcd database strain, where object counts can exceed 1.3 billion. Overcoming these challenges necessitates rethinking core components and introducing advanced optimization techniques.

Kubernetes control plane components — Photo by Markus Winkler on Unsplash

Reimagining the Control Plane: API Server and etcd at Scale

The Kubernetes control plane, consisting of the API server, etcd, scheduler, and controller manager, acts as the brain of the cluster. At 130,000 nodes, these components require significant architectural overhauls to maintain responsiveness and stability.

The API Server: The Cluster’s Gateway Under Pressure

The kube-apiserver is the front-end for the Kubernetes control plane, handling all requests. In a hyperscale cluster, it becomes the most heavily utilized component. To mitigate saturation, horizontal scaling of the API server is crucial, often involving multiple instances behind a load balancer to distribute the request load.

Further optimizations include:

API Priority and Fairness (APF): This mechanism acts like an intelligent traffic controller, categorizing incoming requests into different priority levels to prevent any single user or misbehaving application from monopolizing the API server.
Caching Strategies: Leveraging client-side caches and optimizing how clients retrieve resources can significantly reduce the volume of direct API server queries.
Downward API: For applications needing metadata about themselves (e.g., pod labels, annotations), the Downward API allows direct injection of this information into pods, eliminating the need for frequent API server polling by sidecars or init containers.

etcd: The Stateful Bottleneck

etcd, Kubernetes’ distributed key-value store, is often the single most critical and limiting factor for cluster scalability. It stores the entire state of the cluster, and its performance directly impacts API server responsiveness and overall cluster health. At 130,000 nodes, a monolithic etcd deployment is simply unsustainable.

Google’s approach to the 130,000-node cluster involved sharded etcd deployments across multiple high-availability rings. This strategy distributes the cluster’s registry across numerous etcd instances, significantly reducing latency compared to monolithic designs. Other techniques include:

Dedicated etcd for Events: Storing event objects in a separate etcd instance offloads a significant amount of write traffic from the primary etcd, improving performance.
Alternative Backends: For extreme scale, some providers like GKE have explored using highly scalable databases such as Spanner as an etcd backend. Amazon EKS also mentions enhanced etcd architecture with consensus offloading and in-memory databases.
Proactive Maintenance: Regular compaction and defragmentation are essential to manage etcd’s data store size, which has a recommended maximum of 8GB to prevent instability.

Scheduler and Controller Manager Optimizations

The kube-scheduler is responsible for placing pods onto nodes, and the kube-controller-manager maintains the desired state of the cluster. With 130,000 nodes and potentially millions of pods, their efficiency is paramount. Google’s experimental cluster involved deploying over 10 million pods, necessitating advanced scheduling algorithms, including “AI-schedule proactively”. The volume of Kubernetes objects directly impacts the work these components must perform, requiring them to be highly optimized and potentially sharded or distributed themselves.

Architecting for Network and Storage Immensity

Beyond the control plane, the data plane—the nodes and their interactions—presents its own set of immense challenges.

Hyperscale Networking

Networking in a 130,000-node cluster is a complex domain, requiring efficient communication between pods, services, and external endpoints.

EndpointSlices: A critical innovation for scaling network endpoints. Traditional Kubernetes services struggled beyond 5,000 endpoints due to monolithic endpoint objects. EndpointSlices “slice” these objects into smaller, manageable pieces, dramatically improving network scalability and reducing control plane load during endpoint updates.
IP Address Management: With hundreds of thousands of nodes and millions of pods, IP address exhaustion is a significant concern. Adopting IPv6 networking provides a much larger address space, eliminating complex NAT setups and future-proofing the cluster. Techniques like prefix delegation, where larger blocks of IP addresses are dynamically allocated to nodes, also streamline IP management.
Container Network Interface (CNI) Optimization: The choice and configuration of the CNI plugin are vital. Highly optimized CNIs like the Amazon VPC CNI plugin integrate tightly with the underlying cloud network, ensuring efficient IP allocation and routing. In some cases, teams like OpenAI found performance gains by removing certain CNI plugins at scale.
NodeLocal DNSCache: To prevent kube-dns from becoming a bottleneck on very large clusters, NodeLocal DNSCache provides a local DNS cache on each node, distributing the load and offering faster response times for DNS queries.

Distributed storage architecture — Photo by Shubham Dhage on Unsplash

Persistent Storage for AI/ML Workloads

AI/ML workloads are often stateful, requiring robust, scalable, and high-performance persistent storage solutions. In a cluster of this magnitude, storage solutions must handle petabytes of data, high IOPS, and low latency across tens of thousands of nodes.

Distributed Storage Systems: Solutions like Ceph (often deployed with Rook), Portworx, and Longhorn are designed for distributed, highly available, and scalable storage in Kubernetes environments. They offer block, object, and file storage, with features like dynamic provisioning, data replication, snapshots, and disaster recovery.
Cloud-Native Integration: The chosen storage solution must seamlessly integrate with Kubernetes through Container Storage Interface (CSI) drivers, enabling efficient volume management and lifecycle operations.
Performance and Scalability: For the demanding nature of AI/ML, solutions leveraging modern hardware like NVMe over TCP (e.g., Simplyblock) can provide superior throughput and lower access latency.

Operationalizing the Colossus: Management and Automation

Managing a 130,000-node cluster requires advanced operational practices and automation beyond typical Kubernetes deployments.

Advanced Autoscaling: While Horizontal Pod Autoscaler (HPA) and standard Cluster Autoscaler are common, their limitations become apparent at extreme scale. For clusters beyond 5,000 or even 15,000 nodes, standard Cluster Autoscaler might not be supported, requiring direct API calls for node pool scaling. Custom auto-provisioners like Karpenter, which can launch right-sized compute resources in under a minute, become essential. For very large clusters, sharding the Cluster Autoscaler might be necessary, with each instance managing a subset of node groups.
Proactive Monitoring and Observability: With so many components, nodes, and pods, comprehensive monitoring is non-negotiable. Real-time telemetry, advanced logging, and AI-driven anomaly detection are critical to identify and address issues before they impact the colossal cluster.
Resource Quota Management: Cloud provider quotas for CPUs, VM instances, IP addresses, and other resources must be meticulously planned and significantly increased to accommodate such a large deployment. Batching node creation is often necessary to avoid hitting cloud provider rate limits.

Conclusion

Building a Kubernetes cluster of 130,000 nodes is not merely an incremental scaling effort; it’s a testament to profound architectural redesign and engineering innovation. Google Cloud’s achievement, driven by the escalating demands of AI/ML workloads, highlights the transformative power of Kubernetes when pushed to its theoretical limits. By reinventing the control plane with sharded etcd and optimized API servers, implementing hyperscale networking with EndpointSlices and IPv6, and leveraging robust distributed storage solutions, the industry can now envision even larger, more powerful computational infrastructures. This experimental feat marks not an endpoint, but a crucial waypoint toward million-node realities, setting a new benchmark for cloud-native orchestration in the age of AI.

References

Kubernetes. (n.d.). Building Large Clusters.
WebProNews. (2025). Google’s 130,000-Node Kubernetes Colossus: Engineering the Future of AI-Scale Computing.
Kubernetes. (2024). Considerations for large clusters.
Amazon EKS. (n.d.). Scale cluster compute with Karpenter and Cluster Autoscaler.
Google Kubernetes Engine. (n.d.). Plan for scalability.
WafaTech Blogs. (2025). Optimizing Kubernetes API Server Performance.
Dev in the Cloud. (2025). Reduce Kubernetes API Server Load with DownwardAPI.
Scott, R., & Xia, M. (2020). Scaling Kubernetes Networking Beyond 100k Endpoints.
AWS Documentation. (n.d.). EKS Scalability best practices.
Malik, U. (2024). 7 Best Open Source Storage Solutions for Kubernetes.
Discuss Kubernetes. (2025). How many nodes can be managed in a single k8s cluster?.
Amazon EKS. (n.d.). Kubernetes Control Plane.
AWS Prescriptive Guidance. (n.d.). Network scaling.
Navigating the Network. (2025). A Comprehensive Guide to Kubernetes Networking Models.
Cloudfleet Kubernetes. (n.d.). Control plane scalability.
Kubernetes. (2024). Cluster Networking.
YouTube. (2025). Amazon EKS Ultra Scale: Running 100K-Node Kubernetes Clusters for AI/ML.
vCluster. (2025). Kubernetes etcd Sharding vs Virtual Clusters: What Scales?.
IBM. (n.d.). What Is Kubernetes Networking?.
simplyblock. (2025). 5 Storage Solutions for Kubernetes in 2025.
Portworx. (2024). Kubernetes Storage Solutions Guide.
Google Kubernetes Engine (GKE). (n.d.). Plan for large GKE clusters.
Datadog. (2024). How to support a growing Kubernetes cluster with a small etcd.
CNCF. (2025). Top 5 hard-earned lessons from the experts on managing Kubernetes.
Kubernetes. (2025). Cluster Architecture.
Kubernetes Architecture. (n.d.). Control Plane, Data Plane, and 11 Core Components Explained.
Plural. (2025). Kubernetes Control Plane: Ultimate Guide (2024).
Reddit. (2024). How OpenAI Scaled Kubernetes to 7500 Nodes by Removing One Plugin.