London

June 2–3, 2026

New York

September 15–16, 2026

Berlin

November 9–10, 2026

We ditched our vendor and built our own destiny

No vendor? No problem.
April 08, 2026

Estimated reading time: 6 minutes

Key takeaways:

  • Fight for operational thinkers. Technical skills are table stakes; engineers who understand how hardware actually runs at scale are the real differentiator.
  • Invest in telemetry early: you can’t fix what you can’t see, and generic signals will always miss the root cause.
  • Question the workflows nobody probes. The biggest wins are hiding in the processes everyone assumes are written in stone.

Faced with persistent delays and technical debt from vendor software, we made the strategic decision to design our own custom network interface card (NIC) with an aim to eventually deploy it to millions of servers.

Operating at hyperscale demands reliability and agility. Historically, our vendor-supplie’ NICs introduced persistent challenges. Critical software and firmware issues often took months to triage and resolve, resulting in operational delays and technical debt.

These delays forced us to implement workarounds across our software stack, including services, hosts, and switches. 

To control our destiny, we decided to build and deploy our own hardware. We assembled a cross-functional team from scratch, navigated organizational challenges, and developed the expertise needed to deliver on aggressive timelines.

By evaluating our approach to team composition, continuous testing, and deployment, senior engineering leaders can find actionable insights for driving technical transformation. 

Team composition and culture 

The first hurdle we faced was assembling a team with the right blend of skills and mindset. Technical expertise was essential, but operational experience – engineers who understood the realities of running hardware at hyperscale – was the true differentiator.

For example, during the project’s inception, multiple high-priority bets were happening in parallel across the organization, leading to a severe resource crunch. I identified a senior engineer whose approach to improving operations was exceptional.

However, other leaders contested this allocation because the engineer had recent experience in their respective domains. I had to build a strong case based on the candidate’s history of improving systems operations that other teams had historically overlooked. As this was a first-of-its-kind project for our organization, I argued that we desperately needed that detailed, operational perspective. 

Once management approved, I still had to convince the engineer to leave their current, familiar domain. By being fully transparent about the pros and cons of joining this new initiative, I was able to motivate them to take the leap.

By building around core individuals like this, we fostered a culture of mentorship, enabling rapid onboarding and upskilling of new team members. Our team ultimately combined software and production engineers, each bringing unique perspectives. Success depended on breaking down silos and fostering a culture of shared ownership.

Our motto was: “nothing is somebody else’s problem.” We built strong relationships across hardware, software, and operations, adapting to existing processes but always pushing for improvement.

The experience transformed our team’s reputation, positioning us as trusted partners for future hardware initiatives. 

Infrastructure and continuous testing 

We developed robust CI/CD pipelines tailored for hardware-software integration, with reliability and rapid iteration as guiding principles.

Testing presented unique challenges. Our NIC had to interact seamlessly with other foundational hardware components like the baseboard management controller (BMC), basic input/output system (BIOS), and different vendor rack switches (RSW).

As failures could arise from any part of the ecosystem, pinpointing the exact issue was difficult. To solve this, we invested heavily in telemetry and dashboarding. Historically, our existing telemetry infrastructure was very basic, relying on generic signals that often required looping in domain experts for manual triage.

Building an in-house NIC presented a unique opportunity to expose new metrics and signals that generic software engineers might not typically track, vastly improving our observability.

For instance, we encountered a highly elusive, hard-to-reproduce issue with bitflips. By leveraging our newly added, extensive telemetry for devlink health – which monitors the device’s operational status – we were able to directly correlate these bitflips to a memory corruption occurring within the NIC itself.

Without this granular data, the old system would have completely missed the root cause. This approach of building external checks and integrating data from the broader hardware ecosystem drastically reduced flakiness, a state where a test produces inconsistent results, and improved our overall test coverage.

Continuous deployment and rollout

When we went deep into the existing tooling and infrastructure for upgrading, maintaining, and repairing our fleet servers, we identified several hidden inefficiencies.

As is the case with most hyperscale systems, things are generally complex with multiple baked-in assumptions. Until someone closely looks at them, everything seems stable from a high-level view. We used our custom NIC project as an opportunity to introduce changes that not only supported our new hardware natively but also improved the overall hardware management, from upgrades to repair workflows, at hyperscale. 

For instance, as software engineers, it is easy to ignore what happens to a server when it fails, so long as another is available to run our service. However, by rolling up our sleeves and investigating exactly why our hardware components were being swapped, we discovered that the existing repair workflows were outdated and relied on highly generic signals.

Our NICs were frequently being physically swapped out before operators even checked basic cable integrity, leading to unnecessary replacement cycles, or “reswaps.” 

Many operators treated these legacy workflows as if they were written in stone. This presented us with another opportunity to drive change. We added NIC-specific signals to the workflow to bolster automated remediation decisions.

Ultimately, production environments can be highly unpredictable. Managing server infrastructure at hyperscale is a complex process, but taking the time to question established workflows constantly reveals new opportunities for improvement.

This mindset also applied to how we handled complex software upgrades for hardware – namely kernel and firmware. Our legacy firmware rollout systems treated NICs as low-priority. As our vendors generally provide updates around once a year, NICs were treated as third-tier citizens when it came to our Service Level Objectives (SLOs).

Our in-house NIC required a complete paradigm shift – frequent, targeted upgrades, and a stricter SLO to support a higher rollout cadence. We anticipated that deploying custom hardware at this scale for the first time would inevitably reveal issues requiring rapid mitigation. 

To achieve this, we collaborated closely with various cross-functional teams, including infrastructure and containers, to build new targeting capabilities. We worked to overhaul the rollout process for disruptive hardware components and generalized these improvements across the ecosystem. This cross-functional effort improved our Service Level Agreements (SLAs) by 50%, drastically increasing the overall freshness of component software fleetwide. 

Crucially, to optimize this new rollout velocity, we split firmware packages into disruptive and non-disruptive upgrades. In practice, disruptive upgrades are classified as those that incur substantial host downtime. These typically require draining the host of its active services, shifting workloads elsewhere, and initiating a hard or soft power cycle, such as a full reboot. 

On the other hand, non-disruptive upgrades must ensure minimal impact on running services, particularly from a networking perspective. In our hyperscale environment, even a network disruption on the order of tens of milliseconds is unacceptable for a non-disruptive classification.

This strict separation allowed us to iterate quickly, minimize downtime, and respond rapidly to emerging issues at scale.

LDX3 London 2026 agenda is live - See who is in the lineup

The shift from vendor dependency to in-house innovation

Ultimately, accelerating technical transformation – even in legacy environments – requires more than just engineering prowess; it requires strategic team composition, an operational mindset, and deep cross-functional collaboration.

By advocating for key talent, investing heavily in infrastructure, and building telemetry for reliability, you can empower a small, diverse team to deliver custom hardware under aggressive timelines. 

Whether you are building a custom NIC at hyperscale or driving another large-scale technical initiative, success hinges on navigating organizational resistance and adapting legacy processes.

By fostering a culture of shared ownership, you can build teams that thrive in ambiguous, high-stakes environments, successfully bridging the gap between technical excellence and lasting organizational impact.