Chip Huyen
Every organisation is either building AI products, planning to, or wondering why their experiments haven't turned into production systems. The gap between "we built a demo" and "we run a reliable AI system at scale" is larger than most leaders expect - and vastly underestimated by most engineering teams.
Chip Huyen's AI Engineering is the most rigorous and practically useful book yet written on bridging that gap. Not on training models - on engineering systems that use them. It addresses the questions that don't appear in vendor presentations and rarely survive contact with reality: How do you evaluate whether your AI system is actually working? How do you construct the context that determines whether it gives useful answers? How do you run an AI product in production when its outputs are inherently non-deterministic?
If you're building AI products on top of foundation models - or responsible for teams doing so - this book will save you months of painful discovery.
The AI landscape is generating enormous quantities of content - tutorials, opinion pieces, conference talks, breathless announcements. Almost none of it addresses the unglamorous, difficult, enormously important engineering problems that determine whether AI systems work in production.
Huyen addresses those problems directly and without inflation. She doesn't promise that the next model release will solve your evaluation challenges, or that better prompts will substitute for better architecture. She asks the questions that matter: what are you actually measuring? What does "working" mean for your system? What happens when it fails?
For engineering leaders, this book reframes AI adoption as an engineering discipline - one with learnable principles, manageable tradeoffs, and clear failure modes. That reframing is worth more than any number of AI strategy presentations.
Huyen devotes more attention to evaluation than to any other topic, and for good reason: it is the hardest problem in AI engineering, the most commonly underinvested, and the one everything else depends on.
Unlike traditional software, where correctness can be asserted with test cases that return pass or fail, AI system quality is probabilistic and context-dependent. The same prompt can produce excellent output on Monday and poor output on Tuesday. The same model can excel at one task and be subtly wrong on a closely related one. Without a robust evaluation infrastructure, you cannot know whether your system is improving, degrading, or just behaving unpredictably.
Huyen's framework layers evaluation from automated metrics (fast, cheap, imperfect) through model-based evaluation (using a judge model to assess outputs) to human review and A/B testing in production (slow, expensive, ground truth). The key discipline is knowing which layer is appropriate for which decision - and building the infrastructure to run all three.
The teams who will build durable AI products are not those with the best models. They are those who build the best feedback loops between production behaviour and improvement. Evaluation is that feedback loop.
Most AI engineering effort is spent on model selection, prompt design, and application code. The research on what actually drives system quality points to something different: the quality of context you provide the model.
Retrieval-Augmented Generation (RAG) - retrieving relevant information from internal sources and injecting it into the model's context window - is the primary lever available to most teams. Huyen treats this as a full engineering discipline: embedding models, chunking strategies, retrieval relevance, reranking, context window management. Getting it right requires careful attention to data quality, retrieval architecture, and evaluation of retrieval performance separately from response quality.
The profound implication: your AI capability ceiling is set by your data quality. No amount of prompt engineering compensates for fragmented, undocumented, or unreliable internal knowledge. Teams investing in data governance are, whether they know it or not, investing in the upper limit of what their AI systems can achieve.
One of the book's most practically valuable contributions is a decision framework for the question that trips up most teams: when do you prompt, when do you use RAG, and when do you fine-tune?
Huyen's heuristic is disciplined and clear: start with prompting. Add retrieval when the context volume or freshness requirements exceed what prompts can handle. Consider fine-tuning only when you have clear evidence - measured, not intuited - that it will solve a specific problem that other approaches cannot.
The case against premature fine-tuning is strong: it is expensive, requires labelled data you may not have, creates a model you now own and must maintain, and frequently produces results no better than well-constructed RAG. The teams who rush to fine-tuning because it sounds more sophisticated consistently find themselves maintaining a technical asset that underperforms a simpler approach.
Traditional observability assumes determinism: given the same inputs, the same outputs. AI systems break this assumption completely. Log-based monitoring catches infrastructure failures, not quality degradation. An uptime SLA says nothing about whether your system is hallucinating.
Huyen makes the case for a fundamentally different operational posture for AI systems:
The organisations that run AI well are those that have extended their engineering maturity into the quality dimension - not just "is the service up?" but "is the service good?"
The book's treatment of agentic AI is measured in exactly the right way: genuinely enthusiastic about the potential, and unambiguous about the engineering challenges.
Agentic systems - where models take sequences of actions, invoke tools, and operate with greater autonomy - introduce compounding risk. A single model call with a 5% error rate is manageable. An agentic workflow with ten steps, each with a 5% error rate, has roughly a 40% chance of going wrong somewhere along the chain. The error calculus changes completely.
Huyen's principle: match the level of autonomy you grant to the maturity of your evaluation and oversight infrastructure. Build human-in-the-loop checkpoints for high-stakes actions. Introduce autonomy incrementally. Never grant full autonomy before you have the observability to justify it.
Agents are not a shortcut to capability. They are a commitment to a higher standard of engineering rigour.
If you had to demonstrate today that your AI system is improving week over week, what evidence would you produce? If the answer is "user sentiment" or "the model version went up," you don't have an evaluation framework - you have a hope.
Your AI capability is bounded by your data quality. What does your internal knowledge base actually look like? Is it structured, maintained, and trustworthy? Or is it a collection of outdated documents, inconsistently maintained wikis, and tribal knowledge that lives in people's heads?
Most organisations decide whether to fine-tune based on instinct, vendor recommendation, or the ambition of the use case. How would you make that decision based on measurement?
What would a quality degradation incident look like in your AI system? Would you detect it in seconds, hours, or days? Would you even know it had happened?
The teams winning at AI product development are not the teams with the most models. They are the teams with the most rigorous feedback loops. What does your feedback loop look like?
Define your evaluation strategy before your implementation strategy. For your most important AI use case: what does "working" mean across multiple dimensions - correctness, safety, tone, latency, cost? How will you measure it? Automate this measurement before you ship.
Assess your data foundations. Map the internal knowledge your AI systems will rely on. Identify gaps, staleness, and quality issues. Build a plan to address them. Your AI quality ceiling is set here.
Build a decision log for your model adaptation choices. For each AI use case, document what approach you're using (prompting, RAG, fine-tuning), what you measured to justify the choice, and what would cause you to reconsider.
Instrument your AI systems from day one. Log every model interaction - inputs, outputs, latency, token counts, user feedback. Build dashboards that track quality trends, not just uptime. Treat quality as a production concern.
If you're planning agentic systems, build the evaluation infrastructure for each component capability first. Know how well each tool call, retrieval step, and model decision performs in isolation before you chain them together.
"The teams who win are not those with the best models. They're those who build the best feedback loops."
- Chip Huyen