There’s a big difference between reading about cloud platform design and actually building one. The whitepapers make it sound clean. The reality is a series of trade-offs, heated debates about abstraction layers, and the constant tension between what’s ideal and what ships.
I’ve spent the last couple of years working on exactly this — designing and building an internal cloud platform at enterprise scale. Here’s what I’ve learned about the decisions that actually matter.
Start with the Consumption Model, Not the Tech Stack
The first mistake most teams make is starting with technology. “We’ll use Kubernetes” or “We’ll build on VMware” or “Let’s go all-in on AWS.” That’s backwards.
The first question should be: who is consuming this platform, and what do they need from it? Are your developers deploying containers? Are your infrastructure teams provisioning VMs? Are you supporting legacy workloads that can’t be refactored?
The consumption model drives everything downstream — the abstraction layer, the API surface, the self-service capabilities, and ultimately the technology choices.
Automation Isn’t Optional — It’s the Product
At scale, the platform is the automation. Manual processes don’t just slow you down; they introduce inconsistency, and inconsistency at scale is how outages happen.
Every provisioning workflow, every network configuration, every storage allocation should be codified. Not because automation is trendy, but because repeatability is the only way to maintain quality when you’re managing thousands of workloads.
We use a combination of Terraform for infrastructure-as-code, Ansible for configuration management, and custom API layers that abstract the underlying providers. The key insight: the automation layer is what your consumers interact with, so treat it like a product.
The Reusable Design Pattern Approach
One of the things I’m most proud of in our current architecture is the pattern library. Instead of designing bespoke solutions for every team, we’ve built a catalog of validated, tested infrastructure patterns.
Need a three-tier application stack? There’s a pattern for that. Need a database cluster with specific replication requirements? Pattern. Need a dev environment that mirrors production? Pattern.
Each pattern encodes years of operational knowledge — the right sizing, the right network segmentation, the right backup policies. Teams consume patterns rather than making infrastructure decisions they’re not equipped to make.
What I’d Do Differently
If I started over, I’d invest even more heavily in observability from day one. We bolted it on later, and retrofitting monitoring across a platform that’s already serving production workloads is painful.
I’d also push harder for a single control plane earlier. We had too many management interfaces for too long — different dashboards for compute, storage, network, and security. The cognitive overhead for operators was significant.
This is the kind of work that doesn’t make for exciting conference talks but fundamentally determines whether your infrastructure is reliable or fragile. More posts coming on specific patterns we’ve built and lessons from operating the platform in production.