What Building a 15-Module AI Due Diligence Engine Taught Me About Production Agents

Most "AI in five minutes" demos are toys. The model gets a clean prompt, the data is hand-curated, and the output is something nobody would actually paste into a board meeting. The interesting question is what changes when you have to ship — when the system has to handle Apple, Aave, Terra/LUNA, and a memecoin without falling over, and produce a branded PDF a real CFO would read.

I've spent a few months on a system that does exactly that. It's a Claude Code skill called the Due-Diligence Engine. You point it at a company or a token, it dispatches 5–6 parallel subagents (web scraper, Perplexity researcher, financial data, regulatory, sentiment, and a crypto-only dev-activity agent), scores findings across 15 modules with a traffic-light criteria, and produces a consulting-grade PDF with a GO / NO-GO verdict. End-to-end run time: 5 to 8 minutes. Tests: 114. Live runs validated: 11, spanning Apple, NVIDIA, Circle, Ethereum, Solana, Aave, Dogecoin, and the regression case I most wanted to keep clean — Terra/LUNA, which still flags NO-GO as it should.

This is a build log. Three decisions made the difference between a slow toy and a system I'd ship to a paying client. Each one applies far beyond due diligence — they're the same calls that show up on most AI engagements I take through Auto Alpha's stage-gate process.

Decision 1: Parallel subagents over a single chain

The naive shape for an AI analyst is one chain that sequentially fetches data, reasons over it, scores it, and writes the report. It works in a notebook and dies in production. With 15 scoring modules each needing different inputs, a serial chain takes 30+ minutes per run. The total cost in tokens is fine. The wall-clock is the killer — nobody waits half an hour for an answer their analyst would have given them in the same time.

The fix is to parallelise the dispatch. The orchestrator's first job is to figure out which categories of data each module needs (financial filings? on-chain TVL? sentiment? regulatory filings?), batch the requests by category, and fan them out to specialist subagents that all work simultaneously. The web scraper hits Yahoo Finance and CoinGecko while the Perplexity agent runs disambiguation queries while the SEC agent pulls EDGAR filings while the dev-activity agent fetches GitHub commit signals. They all return structured JSON to the orchestrator, which then runs the scoring prompts.

Two side-effects of this design fell out for free.

First, specialisation made each subagent better, not just faster. A single agent trying to do "research" across financial filings, on-chain metrics, and dev activity ends up generic at all of them. Five subagents each with a tight scope can use a shorter, more disciplined prompt and return cleaner data. The financial data subagent doesn't know what "TVL" means. The dev-activity subagent doesn't know what a 10-K is. They don't need to.

Second, failure isolation came along for the ride. If the SEC EDGAR API is rate-limited that day, only the regulatory subagent fails. The system flags regulatory: degraded in the final report rather than crashing the whole run. Same for Etherscan when their token-holder endpoint went paid-tier-only — the on-chain agent gracefully drops to a Perplexity fallback for token economics, the run continues, and the PDF notes the data source. This is non-negotiable for a production system. Real APIs fail. The pipeline can't.

The general lesson: most "AI agents" you read about should be a parallel dispatch with a synthesizer at the end, not a chain. Chains are easier to reason about. Parallel dispatch is what actually scales to real-world latency budgets. Anthropic hit the same wall building their own multi-agent research system: a lead agent orchestrating parallel Claude subagents outperformed single-agent Claude Opus 4 by 90.2% on their internal research eval, and parallelising the subagent dispatch and tool calls cut research time by up to 90% on complex queries. Same architecture, same reason.

Decision 2: Type-aware criteria, not one-size-fits-all scoring

The original scoring prompt was a single rubric: revenue, growth, market position, risk, fundamentals. It worked fine for Apple. It produced nonsense for Dogecoin.

The problem isn't that an LLM can't reason about a memecoin. It's that the criteria for assessing a memecoin are not the same as for an L1 chain, which are not the same as a DeFi protocol, which are not the same as a stablecoin issuer. A memecoin scoring high on "revenue / FDV ratio" is a category error — there's no revenue. A high token-issuance rate is bad for a DeFi protocol and structurally fine for a payments network. Treating these the same produces verdicts that are confidently wrong.

The fix was a token-type detection step that runs before scoring. The system pulls CoinGecko categories, classifies the asset into one of eight types (L1, L2, DeFi, memecoin, AI-DePIN, infrastructure, privacy, other) using a precedence rule (memecoin beats privacy beats L2 beats L1 beats AI-DePIN beats infra beats DeFi beats other), and then routes the run to a type-specific criteria table. The scoring prompts for each module reference the type and apply the right thresholds.

The Terra/LUNA regression test is the punchline. Before the type-aware system, the engine would score Terra as a high-revenue stablecoin protocol (technically true, before the depeg). After the rewrite, three independent precedence rules fire — algorithmic-stablecoin classification, supply-pressure conditional weighting, and value-accrual mismatch — and the verdict comes out NO-GO. That outcome was unscripted. The criteria caught it. Worth remembering the stakes: Terra's death spiral erased roughly $40 billion in May 2022, per CNBC. A scoring system that calls that asset a high-revenue stablecoin protocol isn't neutral — a confidently wrong GO verdict is worse than no verdict at all.

The general lesson: production AI lives or dies on whether your evaluation framework matches the shape of the thing you're evaluating. "Better prompts" rarely help when the criteria are wrong. Pick a typology, encode it in the prompt structure, and let the criteria do the heavy lifting.

Decision 3: Held-out evaluation as a release gate

The single biggest mistake I see other AI projects make is shipping by vibes. The team eyeballs a few outputs, declares them "pretty good," and pushes. Two months later, a regression slips in and nobody notices because nobody's measuring. The base rate says this is the norm, not the exception: MIT's State of AI in Business 2025 study found 95% of organisations getting zero return on GenAI despite $30–40 billion in enterprise investment — just 5% of integrated pilots extract measurable value.

For the DD Engine, every release goes through a methodology gate: the change is categorised (A — model/prompt change, B — data source change, C — infrastructure), recall@15 is measured against a held-out partition of the Solodit corpus (so I'm not training and testing on the same examples), and the change doesn't ship if the mean Δ doesn't clear thresholds. When I went from V2.16a to V2.17, recall@15 doubled — from 25% to 50% — and that delta is now a permanent fixture of the test suite. If a future change drops it back to 30%, the build fails.

This is unsexy. It is also the thing that turns "an AI prototype" into "a system you can sell." The 114 tests aren't a vanity metric. They're what lets me push a change on a Friday afternoon and know I haven't broken Aave's TVL aggregation, or the L1-chain sector template, or the way the macro overlay handles a flat yield curve.

The general lesson, applied to any AI build: before you write the third prompt, write the harness that tells you whether prompt 1 is still working. Models change. APIs change. Your retrieval index drifts. The teams that ship are the teams who measure. The teams who don't, end up in Gartner's 30% abandoned-after-pilot cohort. Gartner's Rita Sallam was blunt about the mechanism when that prediction landed: "After last year's hype, executives are impatient to see returns on Gen AI investments, yet organizations are struggling to prove and realize value." Value you can't measure is value you can't prove — the harness is what makes it measurable.

What carries over to client work

This is the kind of system Auto Alpha's automations lane is built around. When we run a stage-gate engagement — Map → Pilot → Ship → Run — these three decisions show up at every stage:

Map: pick a typology that matches the client's actual workflow shapes, not a generic one. A property firm scoring tenants is a different problem from a hospitality group scoring suppliers, even if both look like "AI scoring" at first glance.
Pilot: parallel where possible, fail-soft on every external API. The pilot lives or dies on whether it works on a Tuesday morning when one upstream service is degraded.
Ship: a held-out evaluation set before any deploy. If the system can't be regression-tested, the client can't safely change it later — and "can't safely change it" is the single biggest reason engagements turn into hostage situations.
Run: tests in CI on every change. Monitoring on every external dependency. Real cost: minimal. Real value: the system keeps working when nobody's watching.

The DD Engine isn't a client deliverable — it's an internal tool that sharpens how we think about every other build. But the three decisions above are the shape of every production AI system I've seen actually pay back.

Read next: Why Most AI Projects Stall Before They Ship — the buyer-side argument for the same patterns. Four delivery failure modes and the stage-gate model we use to avoid them.

If you're considering an AI engagement and you want to talk through what would map to these patterns in your own business, we run a free audit — half an hour, three concrete builds we'd ship for you, sent as a written brief.