The problem with "the platform"

Kubernetes can lead to over-engineered solutions, but it's a context problem, not a technical one.

By Kevin Stewart

August 17, 2020

‘Kubernetes is a platform for building platforms’^[1]. This statement, while simple and powerful, has likely contributed to the panoply of over-engineered solutions.

Looking back, this phenomenon started to take off with the publication of the Google Site Reliability Engineering (SRE) book and gained significant momentum as Kubernetes became more stable and widely deployed in production. This is in no way an indictment of Kubernetes; it is a solid solution for the problems it was designed to address. The failure is mostly in people understanding the context of the solution and determining if it aligns with their problems.

You get a platform and you get a platform and…

Developers love abstractions, which may explain their fascination with building platforms. If we go back to the days when object-oriented programming (OOP) took off, we can see similar patterns. What was intended to be a tool to help developers solve their problems soon became the main source of problems. People started to spend more time crafting object hierarchies and refactoring classes to better model their domain, and less time producing solutions. Over time, frameworks evolved to supposedly help developers save time in building applications. Arguments ensued over which framework had the best abstractions and were more extensible. It continues to this day with fresh arguments about React or the latest framework to pose a challenge to React.

The trend continued with meta-platforms like Adobe Flash. Users voted with their wallets on which computing platforms they preferred: Windows vs macOS, iOS vs Android. With the exception of multibillion dollar software vendors, independent software vendors (ISVs) needed to decide how to allocate their development budgets to maximize their chances of reaching a broad base. Vendors like Adobe bet on providing a platform layer on top of these native platforms with the hopes that they would become the platform. This concept of papering over the base platform in order to own the layer developers write to, has a long history. Unfortunately that history is littered with failures, including Adobe Flash.

The fundamental misunderstanding most people had was that although these platforms may have addressed 80% of boilerplate application development issues, the last 20% was different for everyone. This led to people either fighting against the frameworks or rolling their own solutions in a way that added more complexity. With web development becoming the dominant application development and deployment approach, this embarrassment of riches also led to additional complexities to deal with when deployed to production. Developers wanted to use every tool at their disposal, and operations engineers wanted more standardization and predictability in their environments.

Enter Docker.

The rise of the platform team

Docker provided a nice abstraction (there’s that word again) for Linux Containers (LXC) which provided a technical solution to some of the challenges that DevOps aimed to solve. Developers were given a tool in the form of container images that packaged up their applications and their dependencies into one atomic, deployable unit. Operations engineers were given a tool that provided a standardized contract for the applications they needed to deploy, as opposed to having to become experts in every language ecosystem that developers chose to use. The ability to ensure that the code that ran on a developer’s laptop behaved the same way when it ran on a server was beyond compelling. When people realized that Google used containers and had systems like Borg and Omega to orchestrate them, it seemed as if all of the pieces were falling into place. Although Docker Swarm and Apache Mesos tried to lay claim to being the standard container orchestration system, Kubernetes earned that role. Kubernetes originated as an evolution of the systems Google itself used to manage its own global distributed systems, and won over software and operations engineers alike. This was particularly due to its fostering of an open source community from day one.

Meanwhile, another phenomenon was occuring in parallel. As more and more folks started reading and internalizing the Google SRE book, an unfortunate rebranding was occurring. In too many organizations, people who were hired to be system administrators were either being (or asking to be) re-titled to “site reliability engineers”. In some cases, it was a way to erase a perceived stigma surrounding their role. In other cases, it was a path to increased compensation and privilege. No matter the reason, this led to a fundamental impedance mismatch. While in a Venn diagram there would be some overlap in roles, there is a difference between software engineering and operations engineering. It mostly comes down to mindset; for operations engineers, code is a means to an end. For software engineers, code is often the end.

The original description of site reliability engineering is ‘what happens when software engineers are tasked with the operations function^[2]. In other words, people with a software engineering mindset were figuring out how to write software systems that did what operations engineers did. Yet in many organizations, teams of SREs without a software engineering mindset were attempting to do this work. Some managed to succeed, but many failed, particularly due to the management that enabled their rebranding without understanding the cultural underpinnings that led to SRE.

As Kubernetes adoption increased, so did the idea of platform teams. If Kubernetes was the platform for building platforms then the mission for many SREs was to build a platform. So platform teams were dutifully formed in order to build their namesake on top of Kubernetes that the organization’s applications would be deployed upon. If you were lucky, someone may have asked ‘why do we need to build a platform?’ And if you were even luckier, someone may have pondered whether the organization was equipped to build and manage a platform along with their mission-critical applications. To riff on a classic Google SRE saying, ‘luck, much like hope, is not a strategy^[3].

It’s the culture, silly

Along my career journey I took a role as, you guessed it, VP of Platform Engineering at an edge cloud provider. The SVP sold me on being part of a transformation of the engineering organization. She had already sold the executive team on the need for both this and building a platform for the organization. While the latter appealed to me, I probably should have dug in more on the need for an “engineering transformation”.

After the first 90 days or so of observing and learning more about the organization, technology and processes, I was able to formulate a basic game plan for the platform initiative. The first step was not to assume Kubernetes from the start. Our system had two major components: a control plane and a data plane. The control plane was the most obvious fit for a Kubernetes-based solution. The data plane was not as obvious due in part to its bespoke nature and the high risk to the business if we messed up. So, we prioritized work on the control plane side in parallel with some long-standing and critical operations work.

The overall platform model was a simple CI/CD oriented approach that laid out a “golden path” for internal teams. The idea was that if you followed the golden path, you’d get a lot of functionality for free such as immutable infrastructure, observability, and so on. Still, we recognized that some teams might have special requirements and we were going to make sure there were sufficient escape hatches where needed. We started prototyping the shell of the system along with a CLI that we’d provide to all engineering teams. We were off to an auspicious start, or so I thought.

What became apparent rather quickly was that people were not as bought in on either the engineering transformation or the platform strategy as was originally communicated. The organization had as much management debt as they had technical debt. On one end of the spectrum, we had a near decade-old system that was driving the business and that almost no one could reason about on their own. New engineers were told that it could take them up to a year to understand how things worked. In addition, there were many “teams of one”, otherwise known as single points of failure (SPOFs). If one of these people were out sick or on vacation, you found yourself praying that nothing would go wrong with their subsystem.

The worst part was that while people may have complained about this situation, they were often resistant to any attempts to change it. Much of this was due to the aforementioned management debt. Many predecessors proposed similar initiatives to drive change that failed. Even when small wins were achieved in order to build confidence in the platform strategy, the cynicism remained. This is apparently why an “engineering transformation” was needed. The culture was broken and that is a huge impediment in any technology strategy, but especially for ones involving a move to Kubernetes. In my previous role, we spent exponentially more time talking to customers about their culture than the technology. Moving from ticket-based processes to APIs and declarative infrastructure requires a rethinking of many things, including people’s roles and responsibilities.

Our SRE team was also an example of the rebranded system administrator syndrome. They were all good people who worked incredibly hard. However, some wanted to be SREs but lacked the skills and/or the mindset. Others just wanted to keep doing what they were doing. To make the slightest bit of progress, we had to reorganize so that the SREs with growth mindsets and solid software engineering skills could make progress. It wouldn’t be enough. Between the departure of the SVP, some other organizational issues and impending burnout, I chose to leave before seeing the platform project through to completion. The company will be fine so long as they address the culture issues. The technology changes, while complex, are solvable problems.

git push heroku main

Do you need a team to build a platform for your applications? As with most things, the answer depends on context. Here are some scenarios where I do not believe you need a platform team to build a platform for you.

You are an early stage startup that has yet to validate your idea with customers.
Your application targets a niche domain, with a total addressable customer base that does not exceed tens of thousands.
You can’t afford to fund a separate, full-time operations team.

However, this doesn’t mean you don’t need a platform. We all do at some level. It just seems that developers feel they need their own bespoke platform rather than using something off the shelf. This shouldn’t be too surprising as there is a long history of not invented here (NIH) syndrome in our industry. This is a shame as a large part of the solution for many has long been delivered by the Platform as a Service (PaaS) implementations, such as Heroku and Google App Engine.

Heroku has been a viable solution for deploying and managing web applications and services for a very long time. Between its unmatched developer experience and support for many different language ecosystems, Heroku is very attractive. Its biggest downsides are costs at scale and the black box nature of its implementation. The latter may be the most concerning to many developers, but I also think it’s the most solvable.

If you need to do things slightly differently than Heroku supports by default, you have little ability to make changes. You are also constrained to Heroku’s infrastructure. It is in these areas where Kubernetes being a platform for building platforms comes into play. It was designed for building a PaaS-like experience. If Heroku implemented their developer experience on top of Kubernetes, how powerful would that be? For the majority of current Heroku users, not much would change. For those who either abandoned or didn’t consider Heroku, this might be a compelling development. Building on top of a tested, extensible and open source infrastructure solution might provide the assurances developers need. If we could have the Heroku experience deployable to AWS, Azure, Google Cloud or our own self-hosted Kubernetes clusters, that could be a game changer.

Now those in the know would say that Pivotal Cloud Foundry addressed this ages ago to a large degree, though it was not built atop Kubernetes. With the acquisition of Pivotal by VMware, this is now changing with work being done to host Cloud Foundry on Kubernetes within the VMware Tanzu portfolio of solutions. This is all great work and I want to see it all succeed. My only concern is related to why Cloud Foundry goes unrecognized. Enterprise solutions are often ignored or given short shrift by open source and startup developers. The solution that will likely take off will be the one that appeals to those constituencies. Can VMware Tanzu bridge that gap? Only time will tell.

Conclusion

Over the next decade, the true test of Kubernetes’ success will be a combination of how widely deployed it is and how much of a decrease there is in us talking about it. Kubernetes needs to drop beneath developers’ radars and almost be taken as a given. More importantly, for organizations that are focused on building applications for people to use, their development teams should be spending 80-90% of their time on building the application and not tending to the infrastructure. Their platform team should be the cloud provider they chose, not the one they felt they had to hire.