New York

October 15–17, 2025

Berlin

November 3–4, 2025

London

June 2–3, 2026

What your engineering team really needs from an AI model

And how to choose the right one.
June 02, 2025

Estimated reading time: 7 minutes

For many technical leaders, it would be easy to treat AI integration as a downstream concern. 

Embedding LLMs into developer workflows, whether through code assistants in the IDE, AI-generated documentation, or test automation in CI pipelines, often gets handed off to platform or developer experience (DevEx) teams. 

But framing it that way misses the strategic significance. These tools don’t just optimize workflows – they reshape how engineers think, collaborate, and ship software. As leaders, we need to guide not just how these systems are deployed, but why and to what end.

Large language models (LLMs) aren’t just another tool or framework. They represent a fundamental shift in how software is built. The choices made today will influence how our teams operate and grow tomorrow.

Navigating this shift starts by understanding what engineers truly need – not what vendor marketing materials suggest, or what benchmark numbers boast. Instead, look for tools that can meaningfully improve how teams work.

The real impact comes from deliberately choosing the right model for your team’s unique challenges, and just as critically, putting systems in place to validate and refine how those models are used.

Not AI for the sake of it

Engineers don’t want AI for the sake of it. They want tangible, meaningful assistance that removes friction from their workflow without introducing new complexities.

First, there’s code generation. Not just generating trivial boilerplate, but being able to scaffold out meaningful chunks of logic, propose alternate implementations, or handle mundane tasks like setting up routes, interfaces, or environment configurations. When done well, it saves time and shifts focus to higher-order architecture and design decisions.

Then there’s testing support. Writing good tests takes time that often gets sacrificed under pressure. LLMs can propose test cases, suggest edge conditions, and generate stubs that would otherwise get skipped. This doesn’t replace test-driven thinking, but it augments it in a powerful way.

Another big one: understanding existing code. Every engineer knows the pain of diving into a poorly documented function or an inherited service with no clear rationale for why it was built the way it was. When an LLM can step in and explain the logic, offer context, or even draft inline documentation, it smooths the path dramatically – especially for onboarding or cross-functional collaboration.

In short, engineers want AI assistants that feel like collaborators. Tools that make them faster and more confident – not tools that second-guess them or create more cognitive load.

Choosing the right model

Once you know what the team truly needs, the next step is choosing the right LLM to meet those needs. This isn’t a technical decision, it’s a strategic one.

Metrics like HumanEval (for code generation), MMLU (for general reasoning), and the MATH benchmark (for algorithmic and symbolic problem-solving) are a good starting point for evaluating models like Claude 3.5 Sonnet, GPT-4.1, and Code Llama in a measurable, side-by-side comparison.

But while benchmarks are useful, they only tell part of the story. What really moves the needle is understanding how well each model aligns with your own needs. 

Code Llama, for instance, stood out in raw coding tasks thanks to its training on open-source codebases – a great match for syntax-heavy logic or low-level implementation work.

Claude 3.5, on the other hand, excelled at maintaining context across long spans of text and reasoning through complex scenarios – ideal for documentation, architectural discussions, or reviewing code across multiple files. 

GPT-4.1 showed its strength as a highly adaptable generalist – perfect when context switching across different media, documentation, and product logic was required.

The takeaway? There’s no universally superior LLM – only the right model for your specific use case. 

Our selection process now starts with one core question: What exact problem are we trying to solve? Only then do we begin matching the model to that challenge.

Integration is another critical pillar. A powerful model isn’t helpful if it doesn’t fit smoothly into your team’s workflow. We evaluated integration through a few lenses: IDE compatibility, API design, deployment flexibility (cloud vs. on-prem), and how well it meshed with our existing tooling. 

GPT-4.1 was a win in this category – it connected effortlessly with VSCode, and its availability through Azure meant we could enforce enterprise-grade data controls with confidence. Meanwhile, Claude 3.5 surprised us with its lightweight Slack integration. It turned out to be an excellent tool for fast, informal problem-solving during development sprints.

Data privacy and compliance

Using AI at the enterprise level can feel like walking a tightrope. You’re constantly balancing the need for speed with the responsibility to manage risk; pushing innovation forward while staying aligned with regulatory and security standards. We needed clear guarantees that sensitive data wouldn’t be used to retrain models and that all interactions would remain encrypted from end-to-end.

Some vendors helped make that leap easier. Anthropic, for example, offered zero-retention models along with SOC 2 compliance – giving us the assurance that our data wasn’t being logged or reused, and that strong operational controls were in place.

At the same time, we explored self-hosting options. For workflows involving sensitive customer information or proprietary intellectual property, spinning up private LLM instances behind our own firewall made the most sense. While this route came with added complexity – from infrastructure setup to ongoing maintenance – it gave us full control over data flow, model behavior, and access controls. In certain scenarios, especially those tied to compliance-heavy domains, that level of control is essential.

Verify everything

Even the smartest LLMs have a flaw: they can be confidently wrong. That’s why verification is foundational to AI rollouts.

We established one guiding principle early on: every AI-generated output must be easy to verify. This wasn’t about slowing things down – it was about creating trust in the system and building a safeguard against subtle, hard-to-catch errors.

To do this, we introduced human-in-the-loop reviews for any output tied to meaningful decisions – whether it was test cases, documentation, or code snippets. Nothing makes it into production unless it’s reviewed and approved by an engineer. We also log every AI interaction and capture real-time feedback from the team – a quick thumbs up/down and a note if needed – creating a lightweight audit trail.

Verification didn’t stop there. We regularly cross-check outputs against internal knowledge bases, open-source libraries, and historical implementations. This helped us flag hallucinations early and reduce downstream risk.

Over time, this verification process evolved into a continuous improvement loop. We now use the insights to refine prompts, tune model behavior, and upskill the team on how to get the most from our AI tools.

Operationalizing LLMs

Adopting LLMs is less about deployment and more about operationalization. The biggest gains don’t come from the first prompt – they come from what happens after that initial deployment, when you start treating the model as part of your engineering system, not just an isolated tool.

That means investing in context. An LLM without the right inputs is like a senior engineer dropped into a project with no documentation. The better the prompt engineering – and the more relevant metadata you supply, from module structure, commit history, or architectural notes — the smarter and more useful the model becomes.

It also means building pathways for feedback. Engineers need fast, low-friction ways to flag inaccurate or low-value outputs. Whether it’s through internal tooling, code review comments, or structured evaluations, feedback should be baked into the workflow. Over time, this becomes a form of continuous training for the model and the team using it.

Finally, it means taking ownership of outcomes. Just as we wouldn’t deploy an experimental database into production without monitoring and fallback plans, we shouldn’t deploy AI into engineering workflows without clear accountability. Define what success looks like. Measure quality over time. Establish guardrails and review protocols.

The technical leaders who get this right won’t be the ones who adopted LLMs first – they’ll be the ones who built a system around them that can scale, adapt, and improve continuously. That’s the difference between experimentation and transformation.

LeadDev Berlin is coming up soon

Final Thoughts

Trust in these systems isn’t automatic. It must be earned through careful validation, human oversight, and real-world iteration. That means putting verification processes in place from day one, creating feedback loops that improve quality over time, and being relentless about measuring what matters.

But most importantly, we have to lead this transformation with empathy and clarity. Engineers don’t need a mandate – they need support. They need partners who understand the complexity of their work and are willing to invest in tools that make them better at it.

Engineers who know how to harness AI – and leaders who know how to guide that journey – will redefine the future of software.