As tech faces a downturn, platform engineering offers a way to decrease cognitive load, cut cloud costs, and boost delivery speed. What’s the catch?
The tech industry is amidst another round of seemingly senseless layoffs. Serving as a backlash to a time of over-hiring, incredibly profitable companies are now cutting costs by cutting people. And with high salaries, engineering is often seen as a cost center – rather than an opportunity for an organization to thrive.
With burnout in software engineering at a high, leaner teams make for a less-than-optimal solution. Fortunately, the emerging practice of platform engineering offers an opportunity for developer teams to reduce their cognitive load and do more with less.
What is platform engineering?
Platforms have existed as long as software has. But they tended more towards the command-and-control way of working.
The recent popularity of platform engineering is more of a sociotechnical response to cloud-native complexity and constantly changing security and reliability requirements. It aims to bring together the various tools and workflows that developers use to write, test, and deploy their code into a simple, self-service platform. The goal is to abstract away complexity, monotony, and repetition, freeing up developers to work on interesting business or technical problems.
Gergely Orosz writes about “non-coding work” in his newsletter, The Pragmatic Engineer, where the platform should take care of anything a developer needs to get up and running, specifically targeting faster onboarding.
For example, by building the open source Backstage developer portal and adopting a platform mindset, Swedish streaming giant Spotify was able to quickly enable newly onboarded engineers to be productive, reducing the average number of days it took for new devs to make their tenth pull request from 110 days, to just 20.
Platform engineering specialist Syntasso’s Abigail Bangser echoes Orosz by saying that a platform should focus on “not unimportant but non-differential work,” like infrastructure, scaling, and security.
How to get started
Like most projects, success hinges on getting customer feedback early and often. Though, this time, your customers are also your colleagues. Sometimes that makes it harder.
If you adopt a top-down mindset of “this is the complex platform that app teams need”, you’re going to waste more time and money building another over-engineered, underutilized tool. By starting small, you’re able to iterate based on rapid developer feedback cycles, and pivot based on new demands.
The popular Team Topologies book advocates for kicking off with the Thinnest Viable Platform, which can even be a wiki or getting started documentation. Essentially, any first steps to reduce cognitive load on the stream-aligned teams, says co-author Manuel Pais.
Platform engineering should concentrate on your organization’s best practices and agreed-upon toolchain and workflow choices – this isn’t for your innovation sandbox, but for your tried and true norms. Without stagnating innovation, platform engineering is all about creating golden pathways to guide the majority of teams – you can stray off the path, but at your own cost, both in terms of time and responsibility for the outcome.
To get started, check out Jira tickets or host a cross-team pizza party to uncover commonly shared bottlenecks. What processes can you automate to increase app teams’ speed to delivering business value? What are developers telling you is slowing them down? Look for answers that are repeated across multiple teams. As budgets tighten, you want everyone focused on value drivers, not pushing buttons.
Just make sure devs and ops aren’t the only ones involved. In order for your platform to be successful, it must factor in stakeholders across security, legal, governance and beyond.
“Stop thinking about the tooling we use as something we ‘give devs’,” said Mario Platt, Director of Information Security and Privacy at LastPass. “Then evolve it into a ‘code-scanning service’ that is provided at the platform level that comprises the tooling, best practices and training, clear policy exception processes, integration into other SDLC [software delivery lifecycle] activities, and code libraries. Anything to make its consumption easy and repeatable.”
Cutting cloud costs
Aparna Subramanian, Director of Production Engineering at Shopify, kicked off a recent KubeCon Europe panel by asking the audience who was experiencing “efficiency panic” at their organization.
Back in the day of on-premise data centers, infrastructure costs were highly predictable. Now, the move to the cloud turned it into a pay-as-you-go model, which turned into an operational expense that’s impacting companies greatly, she said. In a world where it’s estimated that about a third of enterprise cloud spend is just waste, something needs to change.
The continuous monitoring of usage needs and then rightsizing of instances to workloads is an effective way to cut cloud cost – and your carbon footprint – but is far from simple. Very long cloud bills feature projects and clusters, but that’s not really helpful in a multi-tenant platform, Subramanian said. Most of her audience knew their cloud spend, but only a handful knew the cost per application or service.
Instead of increasing developer cognitive load by making them responsible for tracking their own cloud spend, a platform can also act as a way to automatically generate a single source of truth on costs.
An internal developer portal that integrates with tooling like Kubecost can offer a common language between the business and dev teams, mapping cloud spend data to developers, teams, microservices, systems, and domains, giving insight at the team level of who is spending what, within different environments.
“Right now it’s everybody’s problem, but it’s nobody’s problem, so having the central team is really important,” Subramanian said. She advocates for cross-organizational collaboration between engineering, finance, and procurement – the latter of which typically negotiates contracts with the cloud providers.
One of the first tactics to cut cloud cost is an autoscaler, but that isn’t the right fit for all use cases. A lot of the KubeCon audience was using horizontal and cluster autoscalers, but few were leveraging vertical autoscaling. Several were still running with more predictable on-premise data centers.
“Shopify is an ecommerce platform and sometimes we need to reserve and scale up all the way because there’s a big flash sale coming. At that time you don’t want to be scaled all the way down,” Subramanian said, reflecting on her last 18 months concentrating on platform efficiency, when her team realized there are “times when you want to protect your reputation, [and it’s] not about efficiency.”
But during non-peak times – basically not Black Friday through Cyber Monday – Shopify heavily uses horizontal pod and cluster autoscaling. She advocates for a central team to look at and react to cloud spend every day. At Shopify, that central team sends messages over Slack with suggestions of optimal CPU usage. “The teams always know more, so the final decision is up to them to make the appropriate changes,” she said.
Building a culture of cost optimization
Shopify still has the problem of sometimes fragmented clusters, with empty pockets of available resources that don’t get reaped. This isn’t unusual, as cloud cost optimization is an ongoing process that requires a cost-efficient and accountable culture.
Financial software maker Intuit also has underutilized pods, and found they weren’t taking advantage of their nodes. Todd Ekenstam, Principal Software Engineer on Intuit’s core systems team, was on the same KubeCon panel, talking about Intuit’s Descheduler, which terminates nodes roughly every 25 days, no matter what.
This open-source project was originally developed for the financial software company’s security and compliance requirements, but, he said, it has had a side effect of forcing applications to get rescheduled, which trains developers to not count on pods running forever.
“By doing that, we’ve sort of built this culture of understanding how Kubernetes works for the developers,” Ekenstam said.
Intuit has reasonably predictable workload spikes, such as for its TurboTax application around tax season. With this in mind, they heavily use autoscalers, but, he warned, this can lead to disruption – so they need to make it so the apps can be disrupted.
“You can’t launch a pod in Kubernetes and expect that pod to live forever,” Ekenstam said. They follow the 12-factor methodology, building for scalability and disposability at any time.
Third panelist Phillip Wittrock, a software engineer at Apple working on Kubernetes, advises to “start out measuring where your big wins are, versus those that are going to move the needle a lot but could take a long time to do, versus that which isn’t going to move the needle as much but is very easy to get done.” In order to achieve this, he says you have to engage with the right teams – and a lot of data. “What optimal efficiency looks like may be surprising,” he said.
Hurdles platform teams face
That may all sound great, but there are several hurdles to overcome before implementing an effective platform engineering layer at your organization.
The biggest risk for a platform is that when you build it, nobody comes to use it. Platform teams must treat their colleagues like customers and the platform as a product. This can be done by embedding within application teams to cultivate empathy, regularly consult with an internal user group, or establish someone as an internal developer advocate.
Platform teams should always be looking for ways to engage with their customers and measure the value they are adding to their users. You could create service-level objectives for your internal customer, or run regular net promoter score surveys. Involve your users in your retrospectives. Just always look to tighten that feedback loop.
“If you are running any software in production, you are running it on top of a platform,” said Sasha Rosenbaum, cofounder of tech consultancy Ergonautic. But what that includes will vary by organization.
“Your platform may be made of ad-hoc paperwork, people, and processes, or intentionally designed to provide services on top of your infrastructure,” Rosenbaum said. “The more unplanned your platform architecture is, the harder it will be to ensure reliability and provide a good customer experience while running on top of it.”
Finally, by using a single platform to bring all engineering teams together, you can start to link that impact to business value better for management. In tighter times, platform engineers must be able to talk business and finance to secure their jobs.
Recession-proof platform engineering
Despite these opportunities to do more with less, we are still seeing platform teams on the chopping block as deep cuts are made across the industry. Any team that is not making measurable contributions toward increased revenue or decreased costs is at risk.
No team is 100% safe from this wave of layoffs. But there are ways for platform engineering teams to find more success, irrespective of the economy:
- Build the thinnest viable product – Prioritize ways to ease engineer load.
- Treat the platform as a product – Treat internal developers like customers.
- Establish a common language – A platform should create transparency among all stakeholders.
- Focus on cloud cost efficiency – Help your organization cut cloud costs before people.
- Platforms include people and processes, too – Adoption hinges on incorporating with your whole organization.
By focusing on these tangible goals, you can start to build a platform engineering organization for all the right reasons, and set your teams up for sustainable success.