Shipping your first LLM feature requires more than code. This talk explains how we navigated new strategic demands, from opt-in stances to evaluation workflows, to find a pragmatic path to production.

Berlin • November 9 & 10, 2026
AI. Burnout. Big decisions.
The pressure is real. Berlin is where you work through it.
Even for mature engineering organisations, the first LLM-powered feature introduces a paradigm shift. The actual coding often feels like the easy part. The real challenge lies in a daunting list of new “supporting” activities. From defining evaluation frameworks and prompt versioning to deciding on customer “opt-out” stances and communication strategies, how do you decide where to focus your limited energy?
In this session, I will share the “battle scars” from our journey to production. I’ll break down our approach to the 80/20 rule: identifying the activities that required deep investment, those we found a “light” way to handle, and those we intentionally parked for later.
We will explore:
- The Effort Flip: How we restructured our workflow to move from “getting it to work” to “ensuring it’s good enough,” reallocating engineering time from writing code to context engineering and evaluation.
- Strategic Stances: How we tackled non-code needs like opt-in/opt-out policies, customer FAQs, and internal upskilling.
- Triage in Action: What we over-invested in, what I wish we’d done sooner (like early internal documentation on feature capabilities) and what we successfully deferred (like automation of evals).
This isn’t a prescriptive guide, but a look behind the curtain at how we defined our own “production-ready” standard. You will leave with a framework for auditing your own AI roadmap to ensure you’re focused on the activities that truly drive reliability and trust.
Key takeaways
- Evaluation First: Robust evaluation matters more for reliable AI systems than perfecting the initial code.
- Prioritising based on a North Star: We treated security and trust as non-negotiables, informing what to prioritise, including OWASP Top 10 reviews and data privacy decisions.
- Design the First Feature to Scale: The initial rollout can establish reusable patterns that accelerate every AI feature that follows.
- Manual Before Automated: Starting with human-in-the-loop evaluation saved time and helped us understand failure modes before investing in automated testing.