Building Agentic Workflows: Lessons from Production
What I learned deploying LLM agents in enterprise environments. Tool use patterns, error handling, and why simple beats clever.
Over the past year, I've built and deployed several LLM-powered agent systems in production environments. Some succeeded, some failed spectacularly. Here's what I learned.
The Promise vs. Reality
The demos are impressive. An AI that can browse the web, write code, query databases, and orchestrate complex multi-step workflows. The reality? Most enterprise deployments need something far simpler—and that's actually good news.
The most successful agent deployments I've seen do one thing well. Not ten things poorly.
Pattern 1: Tool Use Should Be Boring
The sexiest part of agent systems—autonomous tool selection—is often the least reliable. In production, I've moved toward explicit tool routing:
# Instead of: "figure out which tool to use"
# Do this: "if query mentions X, use tool Y"
def route_query(query: str) -> Tool:
if "database" in query.lower():
return DatabaseTool()
if "document" in query.lower():
return RAGTool()
return DefaultTool()
Yes, it's less magical. It's also 10x more reliable and debuggable.
Pattern 2: Fail Loudly, Recover Gracefully
LLMs will hallucinate. Tools will fail. Networks will timeout. Your agent needs to handle all of this without lying to users.
My rule: Never let the agent make up an answer when a tool fails. Instead:
- Acknowledge the failure explicitly
- Offer alternatives or manual fallbacks
- Log everything for debugging
Pattern 3: Human-in-the-Loop by Default
Full autonomy sounds cool until your agent sends an email to your CEO at 3am with hallucinated data.
Start with confirmation for all external actions:
- ✅ Read from database → auto-approve
- ⚠️ Send email → require confirmation
- ⚠️ Modify data → require confirmation
- 🚫 Delete anything → manual only
You can relax these constraints over time as you build confidence. But starting strict saves you from disasters.
Pattern 4: Observability is Everything
You cannot debug what you cannot see. Every agent system needs:
- Full conversation logs — every message, every tool call
- Latency tracking — where is time being spent?
- Cost tracking — token usage per request
- Error rates by tool — which integrations are flaky?
I use a simple structured logging approach that captures all of this without adding complexity to the agent code itself.
What Actually Works in Production
After all the experiments, here's my current stack for enterprise agent deployments:
- Claude or GPT-4 for reasoning (not fine-tuned models)
- Explicit tool routing over autonomous selection
- Synchronous execution over complex async orchestration
- Confirmation UIs for any external action
- Simple Python over agent frameworks
The last point is controversial. LangChain, CrewAI, AutoGen—they're all interesting. But for production systems where reliability matters, I've found that plain Python with a clear architecture beats magic abstractions every time.
Conclusion
Building agentic systems is less about the AI and more about good software engineering. Clear boundaries, explicit error handling, comprehensive logging, and human oversight.
The agents that succeed in production aren't the most autonomous—they're the most predictable.