Key Lessons in Building Production-Grade AI Agents

These key takeaways are distilled from an in-depth Productbulb Podcast interview with expert AI engineer Rajaswa Patil, who has been a key contributor to major projects like GitHub Copilot and Postman’s AI assistant, Postbot. This document serves as a practical guide for aspiring AI developers, offering insights into the real-world challenges and solutions involved in building production-grade AI assistants that are used by millions of people.

1. The Hybrid Architecture of Modern AI Assistants

The core architectural principle of production-grade AI is that it is rarely about using a single, massive language model. The most effective and scalable systems rely on a clever “mix of both” large and small models, combined with traditional software engineering practices.

This hybrid approach is critical for managing cost, performance, and user experience. Here are two powerful examples of this principle in action:

GitHub Copilot’s Trigger Mechanism To decide when to trigger an autocomplete suggestion, Copilot doesn’t call its main, expensive model. Instead, it uses a tiny, 7-parameter logistic regression model that runs directly on the user’s machine (on the edge). This simple classifier uses features like keystrokes and programming language to pre-filter requests, preventing server overload and dramatically reducing operational costs.
Postbot’s Routing System Similarly, Postman’s Postbot has small, specialized classifiers “sprinkled” throughout its system. These models act as traffic controllers, routing user requests and managing complex workflows far more efficiently than a single, monolithic LLM ever could.

Ultimately, this hybrid architecture—combining the generative power of large LLMs with the speed of smaller models and the reliability of workflow engineering (like state machines)—is essential for building AI products that can operate at scale.

This strategic choice of using different models for different tasks naturally leads to the question of when and how to customize them.

2. The Three Pillars of LLM Fine-Tuning (And When to Avoid It)

While fine-tuning is a powerful technique for adapting models, its practical application, especially in the code generation domain, is often misunderstood. In production, fine-tuning is less about teaching a model new general knowledge and more about achieving specific, practical engineering goals.

Here are the three primary use cases for fine-tuning in a production environment:

Performance: Distill a large model into a smaller, faster one. Fine-tuning is often used to distill the capabilities of a very large, slow model into a smaller, faster one that is practical for production. For a feature like code completion, where speed is critical, you might fine-tune an 8-billion parameter model to achieve the performance needed for real-time suggestions.
Enforcing Output Structure: Force an LLM to generate a specific, required format. When your application requires an LLM to generate output in a very specific, machine-readable format, fine-tuning can be the solution. For example, the code editor Cursor fine-tuned a model specifically to produce a diff format, which enables its “fast apply” code editing feature to work reliably.
Inducing Vast, Unseen Knowledge: Teach a model a new domain or language too large for a prompt. This is useful when a domain’s required knowledge is too large to fit into a single prompt. A classic example is training a model on a massive, unique database schema. Fine-tuning allows the model to internalize the structure and relationships within that schema, enabling it to answer complex queries. Another key use case is teaching a model a novel, domain-specific language. A team at Postman, for instance, created a custom SQL-fork called fql (Flows Query Language). Base models consistently hallucinated standard SQL operators, making them unusable. Only by fine-tuning GPT-3.5 Turbo on fql syntax could they get the model to understand and use this new language correctly.

Critical Caveats

Despite its power, fine-tuning should be approached with caution for two major reasons:

Pace of Innovation: The process of data curation, training, and deployment for a fine-tuned model can take weeks or even months. There is a significant risk that by the time you deploy your model, a new, more capable foundation model is released, making your investment of time and resources obsolete.
Data Security Risks: Fine-tuning on user data is a “big no” from security teams. There is a high risk of the model memorizing and later leaking sensitive information, such as API keys or Personally Identifiable Information (PII), which can lead to major security breaches.

Once you’ve chosen your models, the next challenge is ensuring they perform efficiently and effectively for the end-user.

3. Optimizing for Performance and User Experience

For features like code autocomplete, inference speed is critical, but raw performance should not be the only goal. The ultimate benchmark for any AI feature is its impact on the user’s workflow and overall experience.

Optimization Tactic	Key Insight
Avoid the LLM Call	The most effective optimization is to avoid calling the expensive model altogether. Use small, on-device classifiers or simple heuristics to pre-filter requests, as seen with Copilot’s trigger model.
Implement Caching	Caching common requests and responses can save enormous costs and dramatically improve speed. At the scale of a product like Copilot, caching can handle as much as 70-80% of traffic.
Benchmark Against Manual Effort	Don’t just optimize for the fastest possible response. A multi-step, slightly slower process that saves a user a full day of work is far more valuable than an instantaneous but confusing output. The user experience is the ultimate benchmark.

Optimizing performance is one half of the equation; the other is ensuring the quality of the output, which begins with the data used for evaluation.

4. Data, Evaluation, and the Power of User Behavior

The most critical, non-obvious asset for any AI team is the Subject Matter Expert (SME). SMEs are not just consultants; they are the foundation of a high-quality AI product. They are essential for creating effective prompts, designing intelligent workflows, and curating high-quality evaluation datasets. The initial time invested by an SME compounds into massive productivity gains for the entire team down the line.

Building Your Evaluation Flywheel

A robust evaluation process is not built overnight. It’s a continuous loop that improves over time.

Start with a User Feedback Loop: From day one, implement simple UX mechanisms for users to provide feedback. This can be as simple as thumbs up/down buttons or a /report command. This user feedback is the seed for your evaluation dataset.
Focus on Regressions, Not Absolutes: When using automated metrics like LLM-as-a-judge, the absolute score is less important than the trend over time. The key question is not “Is our score a 4.2?” but rather, “Are we getting better or worse?” This focus on regressions provides the most actionable signal.

The Gold Standard: Client Analytics

Explicit user feedback is rare (often less than 0.5% of interactions) and is typically biased towards negative experiences. The richest and most reliable source of data for evaluation comes from observing implicit user behavior.

For Autocomplete: Instead of asking if a suggestion was good, track if the user accepted it by pressing Tab. To get an even stronger signal, track if that accepted code remains in the codebase over several days or commits. Code that persists is a very strong indicator of quality.
For API Debugging: Instead of judging if the AI’s suggested fix was “correct” in theory, monitor the user’s ultimate goal. After the user applies the AI-suggested fix, are they getting more successful 200 OK API responses and fewer 400 errors? This ties the evaluation directly to user success.

This focus on monitoring user outcomes provides a direct bridge to the practicalities of deploying and maintaining a live service.

5. Strategies for Robust Deployment and Maintenance

Maintaining a live AI service means preparing for failures that are often outside of your team’s direct control. A robust deployment strategy anticipates these issues to ensure a consistent user experience.

Always Implement Fallbacks: A degraded experience is better than a broken one. Services like OpenAI can go down, and dependent services (like third-party moderation models) can change their behavior without warning. It is far better to have a fallback to a smaller, lower-quality model than to return a 500 error. The guiding principle is: “a broken product is way worse” than a temporarily degraded one.
Keep Control on the Backend: Enable rapid fixes without forcing client-side updates. As much application logic as possible—especially prompt templates and AI actions—should be controlled by your backend services, not hardcoded in the client application (e.g., the desktop app or browser extension). This architecture allows your team to push rapid fixes, updates, and rollbacks without requiring users to download and install a new version of the app.
Use Controlled Rollouts for Every Change: Mitigate the risk of unpredictable LLM behavior. Due to the unpredictable, black-box nature of LLMs, even a one-word change in a prompt can have unexpected negative consequences. Every change should be released via a controlled rollout using feature flags. Start by releasing to just 1% of users, monitor the impact on key metrics, and only proceed with a full release once you’ve confirmed there are no negative effects.

These deployment strategies are crucial for all AI features, but they become even more complex when building more advanced, agentic systems.

6. Building Agentic Systems: A Lesson in “Controlled Autonomy”

The core design philosophy behind Postman’s Postbot is “controlled autonomy.” This approach was a deliberate choice, charting a middle ground between the full, uncontrolled autonomy of early agents like AutoGPT and the rigid, “no autonomy” nature of traditional, hardcoded automation (RPA). It provides a predictable framework while still allowing for dynamic, intelligent behavior.

This philosophy was born from rapid experimentation. The very first prototype of Postbot was built by Patil himself inside the OpenAI Playground, using the then-brand-new Assistants API just days after its release. This playground prototype was demoed directly to Postman’s CTO and co-founder, proving the concept and kicking off the project. The architecture that emerged operates as a state machine, where the agent follows a continuous loop: Act -> Observe -> Predict next action.

A key architectural decision was to build Postbot as a platform from day one. This foresight meant that the system was designed to be extensible. Other teams at Postman could easily integrate their own features as new tools or “AI actions” for the agent to use, creating a collaborative and scalable system.

This concept of building extensible, trusted platforms extends to the final and most critical topic: ethics and user trust.

7. Ensuring Trust and Mitigating Harm at the Product Level

It is fundamentally difficult to ensure safety and prevent harm at the LLM level alone. A model’s core strength—its ability to generalize—is naturally at odds with the need for strict, reliable guardrails.

Therefore, the most effective way to build user trust and mitigate risk is at the product and API level.

Problem	Product-Level Solution
Verifying AI-Generated Information	Provide transparency through the user interface. Implement features like citations that link back to source documents, allowing users to verify the information themselves (e.g., Perplexity).
Understanding AI Reasoning	Leverage API features that expose the model’s thought process. Use parameters that return the reasoning trace or “whiteboard work” the LLM performed before giving its final answer, giving users insight into how the conclusion was reached.
Preventing Harmful/Unethical Outputs	Do not rely solely on the base model’s built-in safety features. The best and most reliable approach for businesses is to use an enterprise-grade platform (e.g., Azure OpenAI, AWS Bedrock) that contractually handles moderation and assumes legal liability for the outputs.

Beyond the user-facing product, another powerful strategy is to design agent-friendly APIs. This means treating the AI agent as a primary user of your service. For example, Stripe’s API is praised for its design: when it returns a 400 error, the response doesn’t just fail; it includes a descriptive reason for the failure and a direct link to the relevant documentation. This isn’t just for human developers—it’s designed for an AI agent to read the feedback, understand its mistake, and attempt to self-correct on the next call, building a more resilient and reliable automated system.

Product Bulb