The Product Bulb Podcast Ep. 07:

Lessons from Building AI Copilots – Secrets to Scaling AI Agents | with Rajaswa Patil

Here’s a power packed episode – listen from a foundational AI engineer on scaling AI Agents.

The world is captivated by the seemingly magical capabilities of large language models. But behind the curtain of every seamless AI assistant lies a complex reality of engineering, trade-offs, and hard-won lessons. Moving from a promising demo to a reliable, production-ready product used by millions is a journey filled with surprising and often counter-intuitive challenges.

To uncover these truths, we look to the expertise of Rajaswa Patil, a sharp, first-principles product thinker and AI engineer who has been in the trenches building groundbreaking products like Microsoft’s GitHub Copilot Chat and Postman’s AI assistant, Postbot. His experience provides a practical, in-the-weeds playbook for anyone building the next generation of AI tools, translating complex engineering realities into actionable product strategy. Here are six critical lessons learned from his work.

1. Sometimes, the Most Powerful Model Has Only 7 Parameters

It’s a common assumption that powerful AI products are driven exclusively by massive, cutting-edge LLMs. The surprising reality is that the most elegant solutions are often the simplest. GitHub Copilot, for instance, uses a tiny, seven-parameter logistic regression model as a critical gatekeeper.

This model’s specific job is to decide whether to trigger a full autocomplete query. It doesn’t run on a server; it’s so small that it is shipped directly in the VS Code extension itself, running on the user’s machine (on the edge). It makes its decision based on simple features like the number of keystrokes, the time since the last suggestion was accepted, and the programming language being used. The impact is enormous: it prevents Copilot’s servers from being overwhelmed with requests every time a user pauses while typing, making the system efficient and scalable for millions of users.

The broader lesson is that production AI systems are not a single, monolithic LLM. They are sophisticated orchestrations of large models, small specialized models, and traditional machine learning techniques working together to create an efficient and effective whole.

“co-pilot has a logistic regression model which is a seven parameter model uh it is actually shipped in the extension…and that model is just trained uh it’s just seven features…based on that it identifies do I trigger an autocomplete query here or not”

2. Rethink Fine-Tuning: It’s Rarely the Silver Bullet You Think

The impulse to fine-tune a foundation model on domain-specific data is strong, but it’s often an unnecessary and risky endeavor. For general-purpose tasks like code generation, Rajaswa argues that fine-tuning rarely adds significant value (“alpha”). Foundation models have already been trained on vast quantities of public source code and APIs, making them incredibly capable out of the box.

Instead of a default strategy, fine-tuning should be reserved for three specific, practical use cases:

1. Inducing new knowledge: This applies when the required knowledge is too vast to fit into a prompt, such as a massive, private database schema.

2. Improving performance: Fine-tuning can be used to distill the capabilities of a large, expensive model into a smaller, faster one for deployment.

3. Enforcing a specific output structure: When you need the model to consistently produce a specific format, like a diff for code edits, fine-tuning can enforce that structural constraint.

A perfect real-world example of this third case comes from Postman, where the team invented a new domain-specific language called FQL (Flows Query Language). Because it was a fork of SQL, every general-purpose LLM would “hallucinate with SQL operators.” This made it an ideal candidate for fine-tuning a model to correctly generate and use FQL, a task no foundation model could perform out of the box.

Furthermore, fine-tuning comes with significant risks. There’s a real danger of data leakage, where sensitive information from the training data, like API keys, could be inadvertently exposed in the model’s outputs. Perhaps more critically, the rapid pace of AI development means that a new, more capable foundation model could be released tomorrow, rendering months of your team’s fine-tuning work obsolete overnight.

3. Your Most Valuable Asset Isn’t the AI Engineer—It’s Your Subject Matter Expert

In the race to build AI products, it’s easy to believe that the AI engineer is the most critical member of the team. However, the true linchpin for success is the subject matter expert (SME).

Without an expert who deeply understands the domain—someone who knows what a “good API test” looks like or how to properly migrate a legacy codebase—the entire development process breaks down. An AI team without an SME cannot engineer the right workflows, write effective prompts, or, most importantly, create meaningful evaluations to measure product quality. When the SME is missing, “the whole flywheel around data and eval goes off.”

The AI development lifecycle should be viewed as an investment. The initial time and deep expertise of an SME are invested in curating high-quality data and evaluating the product’s early outputs. This initial investment then compounds into massive productivity gains as the AI learns their expert patterns and begins to automate their workflows at scale.

“I don’t think your most valuable asset is your developer or your AI engineer it’s your subject matter expert…the whole flywheel around data and eval goes off”

4. Stop Guessing at AI Quality. Start Tracking User Behavior.

How do you know if your AI feature is actually working? Relying on user-submitted feedback like thumbs-up/thumbs-down is insufficient; response rates are typically low (around 0.5%) and skewed toward negative experiences. Abstract AI quality scores are often disconnected from real-world utility.

The most reliable way to measure success is through client-side analytics that track actual user behavior. The best metrics are directly tied to the user achieving their end goal. Consider these concrete examples:

• For an autocomplete feature, success isn’t an abstract score but a tangible action. Track if a user accepts the suggestion by hitting Tab (a clear positive signal) or discards it by hitting Escape (a clear negative signal). An even more powerful metric is code longevity: if a user accepts a suggestion and that code survives multiple commits, staying in the codebase for days or weeks, it’s an incredibly strong indicator of high-quality generation.

• For an API debugging assistant, don’t just ask if the fix was “good.” Monitor the downstream impact. Does its usage lead to a measurable decrease in 400 errors and an increase in 200 success responses in the user’s workflow?

The most insightful monitoring dashboards are built around the user’s goals and behaviors, not just the model’s raw output.

5. Build for “Controlled Autonomy,” Not Unbounded Agents

The initial vision for Postman’s Postbot was radical: “maybe natural language is the new UX and can we get rid of pixel-best UI.” Could every click and navigation become an AI action? While pursuing this ambitious future, the team explored the spectrum of automation. On one end were fully autonomous agents like AutoGPT, which proved too uncontrolled and unpredictable for a production environment. On the other was traditional, rigid automation that lacked the flexibility to handle complex workflows.

The team landed on a powerful middle ground they termed “controlled autonomy.” This approach operates like a state machine, where the agent can act and observe its environment but only within a defined, predictable, and controlled workflow. This prevents the agent from going off the rails while still providing powerful, multi-step capabilities—a pragmatic step toward an ambitious, UI-less future.

This architectural foresight meant the team built a system from scratch that mirrored the functionality of modern agentic frameworks like LangGraph before they were even released. The system was designed to be a modular platform, allowing other teams at Postman to easily plug their features in as new “tools” for the agent to use.

6. Your Golden Rule for Shipping Fast: Keep Control on the Backend

This is a crucial “war wound” lesson for any team looking to ship AI products quickly and safely. The core principle is simple but powerful: keep as much control as possible on the backend, not hardcoded into the client application.

App releases are slow to roll out and incredibly difficult to roll back. AI is a new and fast-moving technology where things frequently break or need rapid iteration—a prompt might need a small tweak, or a model behavior might change unexpectedly. If that logic is baked into the client, fixing it requires a full app deployment. This is exactly why you don’t hardcode your AI actions in the client. By keeping control on the backend, teams can push fixes and updates instantly, dramatically increasing iteration speed and safety.

This strategy can be enhanced by using prompt management tools that allow you to version prompts and prompt partials separately from your main codebase. This decouples the AI logic from the application release cycle, giving you the agility needed to build and maintain a high-quality AI product.

Final thought:

The journey to building successful, production-grade AI products is less about a relentless pursuit of the biggest model and more about thoughtful systems thinking. The most effective solutions emerge from a holistic view of the product, where a seven-parameter gatekeeper model, a backend-first architecture, and an SME-centric workflow are just as critical as the LLM itself. It’s about pragmatic engineering and deep product thinking, not just model worship.

As we build the next wave of AI tools, which of these hard-won lessons challenges our current assumptions the most?

Tune in to Episode 7 of The Product Bulb Podcast on Youtube !!

Leave a comment