Back to blog
multi-agentretrieval-augmented-generationenterprise-searchevidence-groundingagentic-orchestrationlong-form-synthesis

77% Win Rate. This Is the RAG Architecture Behind It.

Atlassian's ADORE framework replaces single-pass RAG pipelines with a multi-agent, evidence-audited research loop that achieves a 77% win rate against ChatGPT Deep Research on business consulting tasks and outperforms all competitors on the DeepResearch Bench.

March 29, 20269 min read

Source Paper

Orchestrating Specialized Agents for Trustworthy Enterprise RAG

Xincheng You, Qi Sun, Neha Bora, Huayi Li, Shubham Goel, Kang Li, Sean Culatana · Atlassian

View Paper

The Enterprise Research Reports Your AI Is Getting Quietly Wrong

Your RAG system returned an answer. It was fluent, professionally formatted, and cited three internal documents. It also missed the regulatory exposure buried in document 17 of 40, because the system never retrieved document 17. Nobody flagged the gap. The report went to the CFO. A decision was made.

This is not a hallucination in the dramatic sense. The model did not invent a fact. It retrieved real documents accurately and synthesized them correctly. The failure happened upstream, in an architecture that was designed for simple lookup queries and then deployed against research-grade synthesis work it was never built to handle. "Analyze our Q3 vendor risks" is not a search query. It is a research brief. Single-pass RAG treats it like a search query every time.

Researchers at Atlassian published a paper in January 2026 addressing this directly. Their system, ADORE (Adaptive Deep Orchestration for Research in Enterprise), does not patch the RAG pipeline. It replaces the underlying architecture with something that more closely mirrors how a skilled human analyst actually works: clarify the brief first, build a structured evidence dossier, continuously audit for gaps, and only write when the evidence actually supports the claim. On the DeepResearch Bench, a public benchmark of 100 PhD-level research tasks, ADORE scored 52.65 and ranked first on the leaderboard at time of publication. In blind head-to-head evaluations against ChatGPT Deep Research on real business consulting tasks, it won 77.2% of the time and lost 4.4%. That gap is not a marginal quality improvement. It is what happens when you change the architecture entirely.

Why a Locked Evidence Store Changes Everything Downstream

ADORE is built around one central insight: a research report is only as trustworthy as the evidence it was built from, and that evidence must be tracked, constrained, and auditable at every step, not retrieved once and forgotten.

The framework introduces three interlocking mechanisms. Here is what each one does and what breaks without it:

ADORE MechanismWhat It DoesWhat Breaks Without It
Memory-locked synthesisReport generation is physically constrained to a structured evidence store. The AI cannot write a claim it cannot cite.The model fills evidence gaps with confident inference. The report looks complete. It is not.
Evidence-coverage-guided executionThe system continuously audits whether each planned section has sufficient source coverage. It stops only when the standard is met, not when a timer expires.The system runs a fixed number of retrieval passes and generates a report regardless of coverage. Thin sections go undetected.
Section-packed long-context groundingEach report section receives only the evidence relevant to that section, compressed and citation-preserved.A 100-page document dump enters the context window. The model reads the beginning and end. Everything in the middle is systematically underweighted.

The Memory Bank is the structural foundation for all three. It is a persistent claim-evidence graph: every fact and source the system encounters during research is logged with an explicit link between the claim being made and the document supporting it. Nothing enters the final report unless it first lives in the Memory Bank with a traceable source. The audit trail is built into the architecture before a single sentence is written, not bolted on afterward.

The specialized agent roster exists to serve this architecture. A Grounding Agent handles clarification before any retrieval begins. A Planning Agent converts the refined brief into a structured research outline. An Execution Agent runs the iterative retrieval loop. A Report Generation Agent writes each section using only its section-scoped Memory Bank evidence. The Orchestrator routes between these agents based on query complexity, sending simple factoid questions through a standard retrieval path and complex analytical tasks through the full multi-agent pipeline.

The Three Failure Modes RAG Was Never Designed to Catch

Every large organization deploying RAG today is hitting the same walls, whether they have named them or not.

Vague inputs, confident outputs. Business queries are informationally incomplete by nature. "Analyze our Q3 risks" contains no scope, no priority weighting, no success criteria. Standard RAG retrieves against the words in the prompt and produces a fluent, confident-sounding answer to a slightly different question than the one that actually needed answering. The error is invisible until a decision has already been made on it.

Unrecoverable retrieval failure. In a linear pipeline, a missed subtopic in the first retrieval pass simply disappears. There is no mechanism to notice the gap, flag it, and run a targeted follow-up. The report omits the section. Nobody catches it in review because the report does not say "I missed this." It just does not include it.

Long context does not equal long comprehension. This is the failure mode most enterprise teams have not fully internalized. Feeding a model a 100-page document does not mean the model reads all of it equally. Research consistently shows that information buried in the center of a long context window is systematically underweighted, a phenomenon known as the "lost in the middle" problem. Teams believe they are getting full-document analysis. They are often getting the opening and the conclusion.

ADORE is designed around the premise that all three failures share a common root: the system has no mechanism to know what it does not know, and no way to go back and fix it once it proceeds.

How the System Decides When It Knows Enough

The workflow ADORE runs is closer to a research project management system than a chatbot. The sequence matters because each stage uses the output of the previous one to constrain what happens next.

  1. Query clarification before retrieval. The Grounding Agent intercepts the request before any retrieval begins. It investigates the user's environment, formulates clarifying questions, and converts a vague brief like "analyze our vendor risk" into a concrete research specification with defined sections, success criteria, and constraints. This step eliminates misaligned outputs before they have a chance to propagate downstream.

  2. Human-editable research plan. The Planning Agent converts the refined brief into a section-by-section outline, presented to the user as an editable artifact. Humans stay in the loop before heavy retrieval begins. If the CFO wants more weight on regulatory exposure and less on operational risk, that preference is baked in before a single document is retrieved.

  3. Iterative retrieval with self-auditing. The Execution Agent runs searches, then immediately runs a reflection step that compares retrieved evidence against the research outline and identifies which sections are underfed. A Self-Evolution Engine then rewrites search queries based on what is missing, not just what was originally asked. The system uses what it found to figure out what it still needs.

  4. Memory-locked report generation. The Report Generation Agent writes each section using only the Memory Bank evidence scoped to that section. Every claim maps to a citation. Every citation maps to a source document. The agent cannot write around missing evidence; it must go find more.

  5. Evidence-driven stop signal. The loop terminates when section-level coverage meets the plan's requirements. Not when a fixed iteration count is reached. Not when a timer expires. This is a meaningful departure from systems that run a preset number of passes and hope the coverage is sufficient.

What This Looks Like Inside a Financial Services Due Diligence Team

Consider a mid-market investment bank where junior analysts currently spend 40 or more hours preparing due diligence memos per deal. The process is slow, quality varies significantly by analyst, and senior partners spend a meaningful portion of their time fact-checking citations before anything goes to a client.

Before: A partner submits a vague deal brief. Analysts retrieve documents from multiple internal repositories, synthesize findings manually, and produce a report that goes through two or three revision cycles before it is trustworthy enough to share. The environmental liability section is thin because nobody flagged that it needed a targeted secondary search. A senior partner catches it in review, sends the analyst back. Three more hours.

After: The Grounding Agent clarifies scope, sector focus, key risk dimensions, and output format in under ten minutes, before any retrieval has run. The Execution Agent runs parallel searches across internal deal databases, regulatory filings, and public financials. During the reflection step, it identifies that the environmental liability section has insufficient source coverage and triggers a targeted secondary search automatically. The resulting memo has every claim linked to a specific document and page number, stored in the Memory Bank. Legal and compliance reviewers audit any sentence in the report back to its source in seconds, rather than recreating the citation trail from scratch.

The measurable shift: Memo preparation time drops substantially. Compliance review shortens because reviewers are checking an existing audit trail rather than building one. Senior partner time shifts from fact-checking to judgment and client work. The same pattern applies anywhere professionals currently spend disproportionate time assembling and verifying documented evidence: strategy consulting, pharmaceutical research synthesis, procurement vendor analysis, policy review in regulated industries.

The Infrastructure Decision That Either Compounds or Resets

This paper describes something deployable, not purely theoretical. The sequencing question is how to test it without betting the enterprise AI roadmap on a single implementation.

Phase 1: Prove the architecture gap is real (Weeks 1-8). Pick one report type currently produced manually with a clear quality standard. Deal memos, regulatory summaries, and market assessments work well because they have existing review processes with measurable revision cycles. Baseline precisely: how long it takes, how many revision cycles it requires, how often citation errors surface in review. Run ADORE-style agentic retrieval on 15 to 20 real tasks. The metric to watch is not speed. It is whether the system catches retrieval gaps that your current one-pass system misses. If it does, you have evidence of structural improvement, not marginal quality gain. Failure looks like a pilot where every retrieved document was one the analyst would have found anyway.

Phase 2: Scale within one function and build the integration layer (Weeks 9-24). Expand to the full team that owns the validated use case. The Memory Bank architecture requires permissioned access to your internal document repositories. This is where the data and security conversation with your infrastructure team needs to happen before deployment, not after. Build a human review protocol for high-stakes outputs: a step where a reviewer spot-checks Memory Bank linkages before documents are finalized. This is your risk layer. Failure looks like a model that retrieves the right documents but cannot access the proprietary corpus where the most critical evidence lives.

Phase 3: Build for compounding, not just coverage (Months 6-18). At enterprise scale, the compounding value is in the Memory Bank itself. An evidence store built during a deal memo should be queryable by the next analyst working in the same sector. A regulatory summary produced for one market should inform the next. Plan your infrastructure for this from the beginning, because retrofitting a stateless deployment into a persistent evidence architecture is significantly harder than building for it. The metrics that matter at this stage are reduction in senior reviewer hours, decrease in citation error rates caught during review, and improvement in decision cycle times for research-intensive functions.

The organizations that pull ahead in enterprise AI adoption will not be the ones running the most powerful models. They will be the ones that recognized, earlier than their competitors, that trustworthy AI output is an architecture decision, not a model selection decision.