skill-orchestrationmulti-agentDAG-planningtool-usebenchmarkingretrieval

Agent Skills Don't Compound. The Framework That Changes That.

AgentSkillOS shows that structured skill composition via tree-based retrieval and DAG orchestration dramatically outperforms flat skill provisioning, even when agents have access to identical tools.

March 17, 20268 min read

Source Paper

Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, Shuyue Hu · Shanghai Artificial Intelligence Laboratory

View Paper

Your AI Agent Has 280,000 Skills and Still Produces a Mediocre Deliverable

Your team spent three months curating an AI agent toolkit. Skills for data analysis, document formatting, visualization, web scraping, presentation generation. The demos worked. Production results look like a competent intern's first draft, not the polished, multi-format output you were promised. You add more tools. The results do not improve. In some cases they get worse.

This is not a model quality problem. It is a composition problem, and it gets structurally worse as your tool library grows. When an AI agent is given a flat list of hundreds of skills, it cannot see most of them. The ecosystem scales. The results do not.

Researchers at Shanghai Artificial Intelligence Laboratory published AgentSkillOS in March 2026, the first principled framework specifically designed to manage, retrieve, and orchestrate agent skills at ecosystem scale. The context that makes this paper urgent: as of late February 2026, more than 280,000 Claude agent skills were publicly available on open marketplaces, the overwhelming majority built and maintained by decentralized, third-party contributors. The paper's headline finding is precise and should unsettle every enterprise AI leader who has been investing in tool library expansion. Tested across ecosystem sizes from 200 to 200,000 skills, DAG-based orchestration substantially outperformed flat skill invocation even when both systems were given the exact same, hand-selected skills. Not a better library. Not a stronger model. The same skills, structured differently.

Why Composition Matters More Than the Size of Your Tool Library

The paper's central finding is counterintuitive enough to be worth stating twice. When the researchers gave the vanilla Claude Code agent the oracle skill set, meaning the ideal skills hand-selected by human experts who already knew which ones were needed, it still performed significantly worse than AgentSkillOS using DAG orchestration. The controlling variable was not which skills were available. It was whether those skills were composed in a structured, dependency-aware pipeline or invoked in a flat, unstructured sequence.

This reframes the enterprise AI investment question entirely. Most tool library spending is justified on the premise that more capabilities equal better outcomes. This paper provides direct experimental evidence that premise is wrong as a standalone strategy.

The two mechanisms AgentSkillOS introduces to address this work in sequence:

Mechanism	What It Does	What Fails Without It
Capability Tree	Organizes all skills into a hierarchical index built offline, before any task runs. The agent traverses this tree to find relevant skills rather than scanning a flat list.	Skills become statistically invisible as the library grows. An agent with 200,000 skills finds fewer relevant ones than an agent with 200 well-organized ones.
DAG Orchestration	Decomposes the task into subtasks, assigns each to a skill, and makes dependencies explicit. Skill B does not run until Skill A has produced the artifact B requires. Parallel subtasks run concurrently.	Skills are invoked in whatever sequence seems locally sensible. No skill knows what the others produced. The output reflects the last tool called, not a coordinated pipeline.

The capability tree finding on its own is significant: tree-based retrieval approached oracle skill selection in experiments, meaning the skills it surfaced were nearly as good as those hand-picked by human experts. This matters operationally because it means you do not need a human curator maintaining a tight shortlist. The tree does that work automatically, and updates incrementally when new skills are added.

The Three Ways Flat Invocation Fails as Your Library Scales

The intuition behind the failure mode is worth holding carefully because it runs counter to how most organizations are currently investing.

Skills become invisible at scale. In a library of 200 skills, an agent can scan and reason about the full set. At 1,000 skills, signal degrades. At 200,000 skills, tools that are directly relevant to a task become statistically invisible without a navigational structure. The agent does not fail to use them because they are bad tools. It fails because it never finds them. Organizations respond by curating tighter shortlists, which works until the curation overhead becomes its own bottleneck.

Flat invocation transfers the composition problem to inference. Even when an agent finds the right skills, invoking them in a flat, unordered sequence means each skill operates without knowledge of what the others produced. There is no mechanism for Skill A's output to condition Skill B's inputs. The agent can theoretically compose. It does not compose in practice. You get the output of the last skill invoked, not the output of a coordinated pipeline.

The ecosystem's decentralized growth creates a governance gap. With the majority of the 280,000 available skills built by third-party contributors, enterprise platforms face overlapping functionalities, inconsistent naming, varying quality, and no global view of what the ecosystem can actually do. The paper explicitly flags malicious skill injection as an emerging threat vector. Without a structured management layer, the ecosystem is not just unnavigable. It is a security surface.

How the System Retrieves, Plans, and Executes

The AgentSkillOS workflow runs in three stages before a single skill is invoked. Understanding each stage clarifies what you would actually be building or evaluating if you adopted this approach.

Capability tree traversal. Given a task, the agent traverses the tree layer by layer, selecting relevant branches before reaching individual skills. This is fundamentally different from keyword search or embedding similarity retrieval. The tree-guided approach allows the agent to surface non-obvious, complementary skills that pure semantic search would miss. The resulting candidate set is pruned for duplicates, ranked by relevance, and capped at a shortlist. Skills not included in the active tree are covered by a dormant index with vector similarity search as a fallback.
DAG construction with strategy selection. Once the relevant skills are identified, the system decomposes the task into subtasks and maps explicit dependencies between them. Three strategies produce structurally distinct execution graphs:
- Quality-First: Adds preparation and refinement stages to maximize output polish. Best for external deliverables where quality is the constraint.
- Efficiency-First: Maximizes parallel execution by identifying independent subtasks that can run concurrently. Best when speed or cost is the constraint.
- Simplicity-First: Produces the most compact graph with only essential nodes. Best for rapid prototyping or cost-sensitive workflows.
Dependency-managed execution. Nodes in the same DAG layer run in parallel. Nodes in different layers run sequentially, each receiving a structured prompt that specifies its inputs, its expected outputs, and how downstream skills will consume what it produces. Every generated artifact is saved with an execution summary. Orchestration plans are cached and reused for similar future tasks, compounding efficiency gains over time.

What This Looks Like When a Team Is Producing Complex Deliverables Daily

Consider a corporate finance operations team using an AI agent to produce weekly client portfolio review packages. The required deliverable combines data analysis of portfolio performance, client-facing charts, a formatted PDF report, and an internal interactive dashboard for the investment committee.

Without structured orchestration: The agent has access to a large skill library but invokes them in a flat sequence. It finds the data analysis skill. It probably misses the PDF formatting skill because the name uses different keywords than the query. It almost certainly never surfaces the interactive dashboard skill, which is buried under 50 similar-sounding visualization tools. The output is a functional but visually plain document. The dashboard never gets built. Someone on the team spends two hours cleaning up the output and building what the agent missed.

With AgentSkillOS: The capability tree surfaces all four relevant skill categories simultaneously: data processing, visual design, document creation, and web interaction. The DAG maps the dependency chain: data analysis must complete before chart generation; charts must be ready before the PDF is assembled; the PDF and the dashboard can be built in parallel under Efficiency-First orchestration. Each skill receives a structured prompt that tells it what came before and what comes after. The visualization skill knows the chart dimensions required by the PDF template. The PDF compilation step knows which sections require data overlays. The output is a coordinated, multi-format package. The orchestration plan is cached so the following week's report runs faster with less compute.

The researchers validated this qualitative gap with a 30-task benchmark spanning data computation, document creation, motion video, visual design, and web interaction. Evaluated through LLM-based pairwise comparison with position-bias mitigation, aggregated via a Bradley-Terry model, AgentSkillOS consistently achieved the highest scores across all ecosystem sizes. The vanilla baseline, with the same underlying model and access to the same skills, produced outputs that were measurably lower quality across every category.

The Investment Thesis Your AI Roadmap Is Missing

The finding that DAG orchestration outperforms flat invocation given identical skills is not just a research result. It is a direct critique of the spending pattern currently dominant in enterprise AI.

The question every AI leader should be asking right now is not "how many tools does our agent have access to?" It is "does our agent have an explicit mechanism for managing dependencies between tool invocations?" If the answer is no, adding more tools to the library will produce diminishing returns. The ceiling is set by the composition architecture, not the library size.

The strategic implication runs deeper than tooling decisions. Organizations that recognize this first will not just get better outputs from their current skill investments. They will build the infrastructure that makes every future skill addition actually compound. Each new capability slots into an orchestration layer that knows how to use it in context, knows what it depends on, and knows what depends on it. That compounding dynamic is structurally unavailable to organizations running flat invocation at scale.

The 280,000-skill ecosystem will keep growing. The organizations that treat their AI skill library as a managed, structured, governed asset, with hierarchical organization, dependency-aware execution, and continuous quality signals feeding back into the system, will widen their output quality gap with every new skill that enters the market. The ones that keep adding integrations to a flat list will find themselves in the same position in 18 months: more tools, no better results.

Related Research

graph-augmented-retrievalstructural-query-processingtool-use

The Answer Lives in the Graph

An LLM given nine typed graph primitives as tools outperforms hand-coded query handlers, proving the barrier is operator vocabulary not model intelligence.

Siemens Digital Industries Software

July 5, 20269 minRead

agent-governanceai-safetyalignment

Safe Alone, Dangerous Together: The AI Agent Blind Spot

A governance taxonomy organizes AI agent interventions into five categories—alignment, control, visibility, security, and societal integration—to manage risks as agents approach human-level task performance.

IAPS (Institute for AI Policy and Strategy)

May 16, 20269 minRead

agent-safetyguardrailstest-time-adaptation

The Guardrail Blocking Half Your AI Agent's Legitimate Work

AGrail uses two collaborative LLMs with adaptive memory to defend AI agents against attacks, blocking 96% of threats while preserving 96% of legitimate actions.

The Ohio State University, University of Wisconsin-Madison, University of California, Davis

April 24, 202610 minRead