A few times a month, my phone lights up with the same kind of message. A friend, a former colleague, or a peer from another company is somewhere in the middle of a cloud journey and wants to compare notes.
They want to know what worked. They want to know what did not. Mostly, they want to know what they should be worried about before it becomes expensive.
Sometimes they are six months in and looking for validation. Sometimes they are three years in and looking for a sanity check. Sometimes they are 10 years in and quietly wondering whether the whole thing needs to be rebuilt.
I love these conversations because I learn something every time. The shape of the journey is similar across firms, but the details are always different. The details are where the interesting failure modes live.
I also love them for a more selfish reason. Every conversation forces me to explain why we made the choices we made, and that explanation is its own form of pressure testing. Some decisions look sharper after I say them out loud. Others reveal a wobble I had not noticed, a piece of reasoning that no longer holds, or a trade-off I have been quietly paying for and stopped seeing.
The longer I do this, the more convinced I am that the small decisions in year one decide whether you have a platform or a pile in year five.
I have been in the cloud business since the days when EC2 was a curiosity and “the cloud” was a phrase you had to define for half the room. Two firms, 15 years, both in financial services. Both started before “cloud strategy” was a slide in everyone’s deck.
Financial services adds a useful constraint: security, auditability, resiliency, and cost discipline do not get to wait for a later maturity phase. They have to be part of the foundation.
The first build taught me what works. The second build taught me what compounds.
Most of what made these organizations successful, and most of what eventually made them painful, traces back to maybe eight choices made in the first 18 months. Five of those choices I would make again tomorrow. Three I would do very differently.
What follows is an honest accounting, written the way I would explain it to another technology leader over coffee. No victory lap, no Gartner quadrant, and no slide that says “transformation.”
The Five Decisions That Compounded
1. Centralized landing zones, before the first workload
The seductive thing about cloud in the early years was that any team with a corporate credit card could spin up infrastructure. We refused to let that be the on-ramp.
Both times, we built a centralized landing zone first. Shared services, identity, networking, logging, tagging, guardrails, and the rest of the foundation were in place before the first business workload landed.
The pain was real. Application teams complained about waiting. Leadership wondered out loud whether we were building a bottleneck. I had more than one meeting where I had to defend the timeline against a team that just wanted an account, a VPC, and the freedom to figure out the rest.
Five years later, the value of that decision was obvious. Every workload inherited identity, observability, network controls, and tagging by default. When auditors asked for a list of every workload in a given organizational unit and what it was costing us, we had the answer in minutes. The organizations that skipped this step are still building inventory tools and still finding shadow accounts.
The deeper lesson is that platforms only work when defaults are inherited, not chosen. If a team has to opt into security, observability, or cost controls, most teams will not. The ones that do will implement them slightly differently from each other.
Centralized landing zones make the right thing the normal thing. That sounds less dramatic than “cloud transformation,” but it is also much harder to ruin in production.
2. Terraform over CloudFormation
Both tools worked. CloudFormation was the native AWS answer and had tighter service integration on day one. We chose Terraform anyway, twice, for two reasons that mattered more than either tool’s feature list.
The first was the provider ecosystem. The same workflow that managed our AWS accounts also managed our DNS, identity provider, secrets backend, observability platform, and source control. Engineers learned one mental model and applied it across the entire toolchain. CloudFormation could only ever talk to AWS, and a real platform talks to a lot more than AWS.
The second reason was cognitive portability. Engineers who learned Terraform on our team could take that skill anywhere. Engineers joining us from another firm were productive in a week. Tool choices that double as recruiting and retention strategies are underrated, and the cost of a tool nobody else uses shows up on every offer letter and every onboarding plan.
That does not mean Terraform is the right answer forever, or that CDK and native provisioning tools have not improved. It means the best tool was the one that gave us the broadest platform model, the strongest talent market, and the least context-switching across the systems we actually had to manage.
I will note for the record that this decision aged well through the OpenTofu fork and the licensing turbulence. The skill investment held its value, which is the only test of a tool choice that ultimately matters.
I also wonder how generative AI will impact this specific decision for folks making it today. When code generation can produce CDK, Terraform, Pulumi, or Bicep with equal fluency, the cost of choosing the “wrong” IaC tool drops. The cost of fragmenting your team across three of them does not.
3. Infrastructure and policy as code, treated as a product
Most organizations say “infrastructure as code” and mean “we have a repository with some Terraform in it.” We treated the platform itself as a product. It had engineers, a roadmap, SLOs, a backlog, and real customers in the application teams.
Policy as code came in early as well, through service control policies, OPA, and the equivalent guardrails for every system that could be guarded.
The reason this compounds is subtle. If your platform team operates with lower engineering standards than the teams you serve, your platform becomes both the bottleneck and the blame magnet. Application teams will route around you, build their own glue, and quietly create a shadow platform inside your platform.
I have watched this happen. It is corrosive in a way that is hard to undo.
Treating the platform as a product means writing tests for your modules, running CI on your guardrails, versioning your golden paths, and shipping changes through the same pipeline you ask everyone else to use. It also means having an actual product manager who talks to internal customers, understands demand, and shapes a roadmap around what teams need rather than what the platform team finds interesting.
The leadership discipline here matters as much as the engineering discipline. A platform team has to be humble enough to listen, confident enough to set standards, and disciplined enough not to confuse “customer obsession” with “say yes to everything.”
4. T-shaped engineers, not specialists
In the 2014 to 2018 window, cloud organizations loved specialists. The AWS networking person. The IAM expert. The Kubernetes guru. We hired the opposite both times. We looked for generalists with one or two areas of depth and a willingness to learn the next thing.
The result was an organization that absorbed new services, new clouds, and new tools without restructuring. When containers became real, we did not need to hire a containers team. When serverless mattered, we did not need a serverless practice. When the AI infrastructure conversation arrived, the same engineers who built the landing zone could reason about GPU capacity, model serving, and inference cost.
Deep specialists are necessary. Hire them when the depth is genuinely irreducible. The failure mode is when the operating model depends on them, complexity hardens around them, and the organization starts protecting the dependency instead of reducing it.
T-shaped engineers create a different culture: a team where knowledge spreads, problems get owned by whoever is closest, and the next pivot does not require a reorg.
This is also where the dark factory idea starts to take shape. You cannot run a lights-out platform with a team built around heroes and gatekeepers. You need engineers who codify what they know and trust the system to run it.
5. The right level of abstraction
This is the most subtle of the five, and it is the one I think about most often. Too thin an abstraction and your platform is just raw AWS with extra steps and a logo. Too thick and you have built a homegrown PaaS that nobody can debug, no vendor will support, and the next platform leader will spend three years untangling.
The answer we converged on, both times, was golden paths. Opinionated, well-paved routes for the 80% of workloads that look the same as every other workload, with explicit escape hatches for the 20% that genuinely need to do something different.
A team building a standard web service should never need to think about the underlying primitives. A team building a low-latency trading bridge or a real-time risk engine should be able to reach past the abstraction without filing a ticket and waiting two quarters.
The leadership mistake is believing that giving every team a choice is the same thing as empowering them. Sometimes the more empowering move is to remove low-value decisions so teams can spend their energy on the business problem.
The failure mode to avoid is the gilded cage: a platform so opinionated it can only build one shape of application, then dies the moment the business asks for a second shape.
Greenfield gives you the rare opportunity to design the abstraction layer with both the common case and the escape hatch in mind from day one. Brownfield platforms almost always pick one and bolt on the other later, and the bolt-on never quite fits.
The Three I Would Redo
If the five above are the decisions I would make again, these next three are the ones I think about late at night.
Not every night, to be clear. I do sleep. Occasionally.
1. I would kill the multi-cloud aspiration faster
For a few years at both firms, “multi-cloud” lived on the strategy slide. It sounded responsible. It sounded like a hedge against vendor lock-in. It sounded like the kind of thing a serious enterprise should be planning for.
The honest truth is that most companies do not need multi-cloud. They need a real disaster recovery story, genuine regional resilience, a defensible vendor exit plan, and clean abstractions over the handful of services that genuinely matter.
None of those require running production on two clouds.
Multi-cloud done seriously is a tax on your engineers, tooling, contracts, security model, and finance team. Multi-cloud done as a hedge is a fantasy that costs you depth in the cloud you actually use while delivering none of the resilience you imagined you were buying.
There are companies that genuinely need multi-cloud. Sovereignty requirements, regulatory geography, and real customer demand for a specific provider can all be valid reasons. If you are one of those companies, you already know it.
Everyone else is paying a complexity tax for a problem they do not have. I would have removed that line from the strategy slide much earlier.
2. I would put FinOps in the foundation, not the renovation
At both firms, we treated cost optimization as a phase. Build the platform. Get workloads on. Then do a FinOps push.
It worked, in the sense that we eventually saved a great deal of money. The most recent push at my current firm reached a 46% effective optimization rate across our cloud estate, which I am proud of.
But every dollar you save through retroactive FinOps is a dollar you spent badly first. The engineering hours you put into cleanup are hours you cannot put into the next thing.
If I started over tomorrow, FinOps would live in the landing zone. Tagging would be enforced at account creation, not retrofitted across years of workloads. Showback dashboards would be live before the first invoice. Reserved capacity, savings plans, and commitment strategy would be early architecture decisions, not later cleanup projects.
Cost anomaly detection would be wired into the same alerting backbone as security and reliability, because cost is a reliability concern. If a workload can accidentally burn through budget in a weekend, that is not just a finance problem. That is an engineering problem wearing a finance costume.
The cultural piece matters as much as the tooling. If teams see their cost from day one, they design for cost. If they see it three years later, they argue about it.
3. I would standardize harder, sooner
This is the regret that hurts the most.
We let teams pick. CI tools. Secret managers. Observability stacks. Deployment patterns. Base images.
Each individual choice felt small at the time and was defensible in isolation. Each team had a reasonable story for why its pick was the right one for its context. We told ourselves variability was a feature, that meeting teams where they were would build trust, and that we could rationalize later.
Five years on, those small choices became a tax we paid every time we wanted to upgrade something, roll out a new control, or move a workload between teams.
Standardization feels heavy-handed in year one and saves you in year five. Variability is a debt that compounds faster than you think, and unlike financial debt, you cannot refinance it. You can only pay it down, slowly, while the interest keeps accruing.
If I were starting over, the platform would offer fewer choices and better defaults. Teams would still have escape hatches for genuine edge cases, but the default toolkit would be smaller, sharper, and non-negotiable.
The teams that pushed back hardest on standardization in year one would, I am certain, thank me in year five. They might not say it out loud, but I would know. Platform people are allowed a small amount of delusion. As a treat.
If I Started Tomorrow
If I had to compress the entire essay into a starting checklist, here is what I would protect:
- Build the landing zone before the first workload.
- Pick one infrastructure-as-code path and make it the paved road.
- Put FinOps, tagging, showback, and anomaly detection into the foundation.
- Define the golden paths before teams invent their own.
- Treat the pipeline as the product, not the plumbing.
That does not mean every edge case disappears. It means exceptions become explicit. The platform can support different shapes of work without letting every team quietly become its own platform team.
I would also make one leadership choice early: decide which decisions the platform owns and which decisions product teams own. Ambiguity here is expensive. If the platform owns too little, every team builds its own version of the same thing. If the platform owns too much, the platform becomes a ticket queue with a mission statement.
The goal is not control for its own sake. The goal is to remove low-value variability so teams can move faster, safer, and with less cognitive load.
What Greenfield Gets Right That Brownfield Never Can
This is where the cloud conversation becomes less about infrastructure and more about the delivery system itself.
In advanced manufacturing, the aspiration is a dark factory: a production line that runs lights-out, with no humans on the floor, because every step has been encoded into the machinery. Software has the same opportunity, and the cloud era is what makes it possible.
The path from a developer’s commit to production should run without human intervention, because every guardrail that matters lives in the pipeline rather than in someone’s head. Security, compliance, cost, reliability, and quality should not depend on a heroic person remembering the checklist. The checklist should be encoded into the system.
Pipeline-first delivery is the practical expression of that idea. It says the delivery path is the product. The application is the payload.
Most engineering organizations have this backward. They invest in the application and treat the pipeline as plumbing. The result is a beautifully crafted application moving through a pipeline that requires three approvals, two manual steps, and a meeting to release on a Friday.
No one should need a meeting to release on a Friday. Ideally, no one should release on a Friday, but that is a separate blog post and possibly a support group.
The reason brownfield organizations struggle to get there is not technical. It is human. Every existing exception is a person. Every team that does something differently has a reason that made sense at the time. Every legacy workaround was once a fix for a real problem.
You cannot encode what you cannot standardize, and you cannot standardize what people have already built workarounds for. Brownfield platforms spend most of their energy retrofitting around habits that calcified before the platform existed.
Greenfield gets the rare gift of writing the rules before anyone has a workaround. The pipeline exists before the workloads. The guardrails exist before the violations. The standards exist before the variability. Production-grade habits become normal because they are the only habits anyone has.
That is the real greenfield advantage. It is not the newer technology, the absence of legacy databases, or the chance to skip a generation of tooling. Those things matter, but they age out.
The advantage that compounds, the one I would protect above all others if I were starting again tomorrow, is the absence of bad habits. The chance to make the right thing the easy thing, the easy thing the default, and the default the only thing.
If you are about to start a greenfield cloud build, that is the gift you have been handed. Do not give it back by year three.
