Spec-driven development (SDD) with AI: Making agents enterprise ready
Vibe coding is great for side projects, but not for enterprise-level development. Spec-driven development (SDD) offers an alternative approach.
Jun 18, 2026 • 10 Minute Read
Vibe coding with an agent is fast. But in enterprise settings, developers quickly hit a ceiling when they fail to replicate the same success they experience with side projects and prototypes. Spec-driven development (SDD) with AI is an approach that seeks to address these gaps. It’s also one the most productive teams are already practicing, whether they recognize it or not.
In this article, I’ll cover what spec-driven development is, the problems it’s designed to solve, and how to map a path from zero-governance vibe coding to SSD with AI.
The problem: Your teams on zero-governance vibe coding
You lead a backend team at a fintech startup. Your developers adopted AI coding agents six months ago and velocity tripled. Features that used to take a week shipped in two days. So far, so good.
Then, the cracks start to appear.
The first developer’s agent builds authentication with JSON Web Tokens (JWT). The second uses session cookies. The third builds a payments endpoint with no input verification.
The code compiles. The tests pass. But then, the security author asks: “Who decided to skip idempotency keys on the charges endpoint, and why?”
Nobody can answer. The decision was made by an LLM in a chat window which no longer exists.
Welcome to the vibe coding hangover
In early 2025, Andrej Karpathy coined the term vibe coding—an approach where you use LLMs to develop your code by fully “giving into the vibes” and forgetting the code even exists.
For prototypes and side projects, this worked perfectly. Then, the enterprise adopted it.
Google’s DORA research found that following AI adoption, there was a 7.2% decrease in delivery stability. And after Amazon mandated 80% AI assistant usage, it suffered a six-hour outage affecting roughly 6.3 million orders. Meanwhile, nearly half of all AI-generated code contains security vulnerabilities.
How vibe coding fails to meet enterprise requirements
The problem is not that agents write bad code. The problem is that vibe coding cannot provide four properties enterprise teams need:
Auditability: Who decided what.
Reproducibility: Same input, same output.
Cost effectiveness: Less rework, less token waste.
Governance: Compliance traits.
With these limitations, developers quickly hit a ceiling when vibe coding in enterprise settings. The agent is fast, but falls short.
Spec-driven Development is a discipline that seeks to address these gaps. It’s also one the most productive teams are already practicing, whether they recognize it or not.
What is spec-driven development with AI?
Spec-driven development (SSD) with AI is when specifications are written and provided to an AI coding agent, in contrast to prompting and re-prompting. A human defines the structuring acceptance criteria, architecture decisions, and domain rules. The agent then codes and regularly validates its work against these specifications.
AI coding assistants are more than chat windows. These solutions have real internal machinery that developers can utilize to achieve better results. For example:
Cursor reads structured rules from .cursor/rules/ folders
Claude Code reads hierarchical CLAUDE.md files that cascade from root down through subdirectories.
Both also support custom slash commands, hooks, and sub-agents. SDD is about writing specifications into that machinery.
To use the analogy of hiring a building contractor:
Vibe coding is like describing rooms to a contractor over the phone: They build fast, but every room is a surprise. They follow building codes (Your rules files), but these don’t tell them how to build, just how to do it safely.
SDD is like handing a contractor architectural blueprints: With the plans, they know what each room does and what they need to do.
Just like a contractor needs both the building codes and the blueprints to build what you want, the AI coding assistant needs the rules files and your architectural plans to execute successfully.
Thankfully, unlike real-world architecture, you don’t need to hire an expert to write those plans. You can simply decide what you want to build before you build it, and write that decision down in a format the agent can consume.
The SSD with AI Maturity Spectrum: Moving from raw vibes to enterprise outcomes
There are six stages of maturity in the journey towards spec-driven development with AI. Below is a table which maps each maturity level against the four enterprise dimensions mentioned earlier.
| Level | Auditability | Reproducability | Cost Effectiveness | Governance |
|---|---|---|---|---|
| 0: Raw vibe coding | None. Decisions lost in chat. | None. Every session diverges. | Quadratic token waste. | Zero compliance trail. |
| 1: Single rules file | Minimal. Rules tracked in git. | Low. Style consistent, architecture not. | Some reduction in rework. | Rules exist but unscoped. |
| 2: Scoped rules | Partial. Conventions versioned. | Moderate. Style governed per directory. | Less drift, fewer iterations. | Standards enforced, decisions not. |
| 3: Spec-first | Strong. Spec records what and why. | High. Same spec, consistent output. | Front-loaded thinking cuts loops. | Spec serves as compliance evidence. |
| 4: Spec-anchored | Full. Spec evolves with code in repo. | High. Spec and code stay in sync. | Minimal rework. | Living audit trail. |
| 5: Constitutional SDD | Complete. Principle-to-code traceability. | Highest. Constraints enforced by construction. | Optimized. Validation prevents waste. | Machine-verifiable compliance proof. |
Here are the levels broken down into more detail.
Level 0: Raw vibe coding
This level is defined by raw prompting: no persistent context, every session starts from scratch, and every repeated explanation burns tokens you have already spent.
Most readers have already moved past raw vibe coding, and have likely progressed to a later maturity level.
Level 1: Single rules file
This level is defined by moving beyond raw vibe coding to leveraging the single rules file, like the original .cursorrules or a basic CLAUDE.md dropped in the project root. It is git-tracked, which is better, but it quickly becomes bloated and contradictory as the team piles in preferences.
Level 2: Scoped rules
This is where most teams sit today: scoped rules systems. Cursor's .cursor/rules/ directory uses .mdc files with frontmatter that scopes rules by glob pattern. Claude Code reads hierarchical CLAUDE.md files that cascade from global settings down through project root into subdirectories. The emerging AGENTS.md standard, now under Linux Foundation governance, aims to provide a single format that all tools can read.
Even so, Level 2 is a real improvement on the earlier stages:
Styles are consistent.
Context is version-controlled.
Teams share conventions.
But here is what it does not solve:
Two developers can still prompt the agent to implement contradictory features.
Rules encode how to code, not what to build.
You can’t prove what architectural decisions were made and why.
ETH Zurich found that adding more context files actually increases agent steps without improving success rates. Only high-signal, non-inferable content helps. The answer is not more rules. It is a different kind of artifact entirely.
Level 3: Spec-first
Spec-first means you write your specifications, use it to guide the agent, and may discard it afterward. This is a major improvement from just using rules, because while rules solve style drift, specs solve architectural drift. Kognitos documented logic drift as one of the five enterprise failure modes of vibe coding: the agent silently makes architectural choices that nobody reviews and nobody records.
Back to the analogy: rules are building codes. You also need blueprints. A specification that tells the agent what to build, why, what the acceptance criteria are, and how to verify it is done. Not documentation written after the fact. A contract written before the first line of code.
Most productive teams already practice this somewhat informally. They write markdown spec files, run /architect sessions, and build custom CLAUDE.md hierarchies, all without calling it SDD or adopting any SDD-specific tool. You do not need Spec Kit or Kiro to do SDD.
Moving to Level 3 is a weekend investment: write a markdown file with acceptance criteria before your next non-trivial feature and feed it to the agent as context.
Level 4: Spec-anchored
Spec-anchored means the spec lives alongside the code and evolves together, being maintained throughout the system’s lifecycle. Changes to behavior require updating both the spec and the code, keeping them synchronized. This requires a greater tooling commitment, but the governance payoff is transformational for regulated industries.
Level 5: Constitutional SDD
For regulated industries, standard SDD may not be enough. Constitutional SDD, introduced by Marri (2026) on arXiv, takes it further by embedding non-negotiable security constraints as a versioned "constitution." Each principle maps directly to CWE and MITRE Top 25 entries with RFC 2119 enforcement levels: MUST, SHOULD, or MAY. A principle like SEC-001 maps to CWE-79 (cross-site scripting) and carries a MUST enforcement level, meaning the agent cannot generate code that violates it.
The results from a banking microservices case study are striking: 73% reduction in security defects compared to unconstrained generation, with full traceability from constitutional principles to specific code locations at file and line number granularity. This is governance by construction. Security is baked into the spec layer so AI-generated code adheres by default, rather than being caught by inspection after the fact.
The regulatory pressure makes this urgent. EU AI Act obligations for high-risk systems begin in August 2026, with fines reaching 35 million euros or 7% of global turnover. Yet Deloitte found that only one in five companies has mature governance for autonomous AI agents. For fintech, healthcare, and other regulated industries, constitutional SDD closes the loop between "the agent wrote it" and "we can prove it meets compliance requirements."
The benefits of SDD with AI for enterprises
The enterprise evidence is for SDD with AI real, not theoretical.
The NYSE CTO described "rewiring our engineering process" with Claude Code, building custom agents that take Jira tickets through to committed code with internal spec-to-implementation flows, processing over a trillion messages on peak trading days.
Box has 85% of its developers on Cursor daily with .cursor/rules/ enforcing team conventions, and reports a 30 to 50% increase in roadmap throughput along with 80 to 90% faster migrations.
Prezi tried Spec Kit at a company offsite where engineers built complete apps in one to four hours. A staff engineer called the experience "both terrifying and exciting".
Spotify merges over 650 AI-generated pull requests per month with a 90% reduction in migration engineering time.
An unnamed financial services company in the arXiv study achieved 75% API cycle time reduction using spec-driven contract validation.
None of these companies call it SDD. All of them practice the discipline.
The workflow, whether custom or tool-assisted, follows a consistent pattern:
Specify what and why (acceptance criteria, edge cases)
Plan the technical approach (stack, architecture, constraints)
Break the plan into discrete testable tasks
Let the agent implement task by task using the spec as context.
Human checkpoints sit at each gate. The tools are optional scaffolding: Spec Kit is open-source with 84.7K stars and support for 14 or more agent platforms, Kiro is AWS-native with EARS notation and Agent Hooks, and OpenSpec takes a lightweight proposal-first approach. Or you can use your own markdown templates. The discipline matters more than the tool.
The waterfall critique of SSD with AI
There’s critique that heavy up-front specification and big-bang releases fall into an SSD anti-pattern, running afoul of the typical drawbacks of the waterfall model. For example:
Marmelab tested Spec Kit on a date feature and generated 1,300 lines of spec for displaying the current date.
Bockeler, a ThoughtWorks Distinguished Engineer, tried Kiro on a small bug and got four user stories with sixteen acceptance criteria, calling it "using a sledgehammer to crack a nut".
The METR randomized controlled trial found a 19% slowdown when experienced developers used AI tools on real-world open source tasks.
And while these are fair criticisms, there’s another side of this: waterfall feedback loops were measured in months, but SDD with AI agents iterates in minutes.
The answer is calibrating rigor to task size. Use a markdown file for small changes. Use a full spec workflow for production features. Marmelab itself pivoted to building vibe-spec, which generates specs from agent logs, inverting the workflow entirely. That is still SDD. It is just a different entry point.
Cost fits naturally into this picture. Agent loops without specs are quadratic: each iteration resends growing conversation history. Specs front-load the thinking, reducing iterations. Teams using focused CLAUDE.md files with "decisions not descriptions" report 20% token reduction. Combined context strategies yield 40 to 70% reduction in API spend. For teams spending $500 to $2,000 per month per developer on API calls, even 30% less rework pays for itself.
How to upgrade your SSD with AI maturity level
If you are at Level 0
Add a CLAUDE.md or AGENTS.md to your repo today. Fifteen minutes of setup gives you immediate drift reduction and a shared baseline for every developer on the team.
If you are at Level 1 or 2
Write a markdown spec before your next non-trivial feature: acceptance criteria, edge cases, architectural constraints. Feed it to the agent as context and see if review cycles drop.
If you are at ten or more developers, evaluate Spec Kit or Kiro for workflow scaffolding, but remember that the spec files themselves are more valuable than the tool that generates them.
Beware of vendor lock-in
One warning about vendor lock-in, earned by evidence: OpenAI's Assistants API is sunsetting in August 2026. Claude Code went from nonexistent to the top AI coding tool in eight months. Cursor hit two billion dollars in annual recurring revenue but pricing shifts made teams roll back. Enterprise AI tool spend shifted from roughly 50% OpenAI to roughly 40% Anthropic in under a year.
The only durable investment is plain-markdown specification files and open standards like AGENTS.md. Invest in the discipline, not the tool, because the tools will change. The specs survive.
Conclusion
Karpathy moved from coining "vibe coding" to proposing "agentic engineering" in just one year. The community is moving from "prompt and pray" to "specify and verify." The developers who master spec-writing will define the next era of software development, because the agent writes the code, but the spec decides what gets built.
If you're interested in expanding your knowledge on AI systems, check out some of Axel Sirota's other guides:
- Multi-agent systems with MCP: Building AI teams that share tools
- Meter before you manage: How to cut LLM costs by up to 85%
- How to create agents with LlamaIndex
- How to use LangChain and LangGraph for Agentic AI
- Securing your RAG application: A comprehensive guide
- How to build a multimodal agentic RAG AI assistant
Additionally, if you're interested in learning more about implementing agentic AI in production environments, check out Pluralsight's learning path, "Integrating Agentic AI for Developers."
Advance your tech skills today
Access courses on AI, cloud, data, security, and more—all led by industry experts.