When I first started using Agentic LLM coding assistants, I was thrilled by their potential, but also often frustrated by their unpredictability.

Some days, they produced near-perfect code; other times, inexplicable errors crept in, despite similar instructions. In fairness, sometimes it was me and sometimes it was them.

It didn’t take long to realize that AI coding isn’t just about the model ... it’s about how you interact with it ... what tools you use ... and it's also sometimes influenced by factories completely outside of your control.

📝 Precision in Prompts and Rules: The quality of your prompts and the clarity of system-level instructions drastically impact output. Structured approaches like SPARC excel because of their robust rule sets, precise directives and iterative build/test/reflect/improve loops. Some agents (like Replit) try to incorporate these best practices behind the scenes, improving the quality of the output. Tools like Cursor and Augment Code work hard to refine system-level prompts and augment baseline models, ensuring more consistent and effective results.

🤝 Managing Development Dialogue: Engaging thoughtfully with the agent throughout the entire development process is essential. Providing clear instructions, encouraging iteration and refinement, asking it to evaluate its own work, and structuring dialogue thoughtfully, help maintain quality, even when external factors cause performance fluctuations. Don't rush. Be precise. Make it test it's own work.

⏳ Timing and System Load: Quality can vary based on time of day and server demand. In Eastern Time, mid-late afternoons and even occasionally weekends (vibe code mania) seem to bring performance dips, likely due to higher loads and resource allocation ... while late evening sees a rebound. Recognizing these patterns allows for smarter task scheduling. If switching tools doesn't help, take a break, go for a walk, listen to a podcast, do some product research on Perplexity ... even do something analog 😀

🔄 Keeping Up with Platforms: Staying updated on the latest LLM versions is crucial. The capability difference between models like Claude 4 vs. Claude 3.5, or o3 vs Gemini 2.5 Pro WRT diverse problem sets, is material. Some models are clearly better at some things than others. And knowing when to request deep thinking is important (certainly from a cost benefit basis). Get the model to document their work so they have a long-term record of what they have done and what they have been thinking. This really helps with limited context windows.

So, while LLM variability presents challenges, optimizing prompts, structuring interactions, being strategic in timing, and leveraging the right tools can significantly improve results.

Welcome to the entropy of emergent systems.

Mark Ruddock

Thoughts of an internationally experienced growth stage CEO and Board Member.