Tools vs. Skills vs. CLI vs. MCP vs. A2A

Why these approaches get conflated

“Connecting agents” is overloaded. People use the phrase to mean at least four different integration problems—often in the same system—so it’s easy to compare apples to oranges.

One problem is how an LLM triggers an action (e.g., “call this function with these arguments”). That’s the territory of tools (tool calling / function calling) in most commercial LLM APIs. citeturn5search0turn1view6turn0search26

A second problem is how you package repeatable expertise/workflows so an agent can do a task reliably without you hardcoding every step in your orchestration layer. That’s where filesystem-based skills (typically SKILL.md + optional scripts/resources) fit: they’re an instruction-and-assets distribution mechanism with progressive disclosure to control context usage. citeturn1view5turn13view0turn14search20

A third problem is how an agent talks to external systems in a vendor-neutral way, so multiple agent hosts/clients can reuse the same connectors. That’s what MCP targets: standardizing access to tools, resources, and prompts over a JSON-RPC-based protocol with defined transports and security guidance/specs. citeturn7search5turn7search4turn7search28turn7search2

A fourth problem is how agents talk to other agents as peers (not as “just another tool”), including discovery, task lifecycle, and modality negotiation. That’s what A2A is designed for, and its own docs explicitly position it as complementary to MCP (MCP = tool/context access; A2A = agent collaboration). citeturn1view2turn9view0turn10view0turn8view1

Finally, “API specs” (usually OpenAPI) are not a connectivity layer by themselves; they’re a contract surface you can compile into tool definitions (for direct tool calling) or into an MCP server (manual or generated). citeturn8view4turn15view3

The upshot: these are not mutually exclusive. Many production stacks end up using A2A for agent-to-agent coordination, MCP for agent-to-system access, tools for atomic actions, and skills for workflow packaging and token-efficient instruction loading—with OpenAPI as a “source of truth” feeding one or more of those layers. citeturn9view0turn7search5turn13view0turn15view3

A practical comparison framework

A useful comparison is less “which is best?” and more “which layer does this solve, and what are the trade-offs?” The aspects below are the ones that repeatedly show up in real-world docs, security guidance, and postmortems:

  • Primary integration target (system tools vs workflows vs agent peers) citeturn7search5turn10view0turn13view0
  • Token footprint drivers (tool schema preload, instruction load, intermediate results, caching/deferral) citeturn8view3turn8view5turn1view5
  • Determinism and type-safety (strict schemas, stable error handling, validation) citeturn7search13turn5search0turn8view0
  • Security model (credential isolation, supply-chain risk, prompt injection exposure, auth standards) citeturn13view2turn7search1turn14search8turn10view0
  • Interop and governance (open standard vs vendor feature; portability across hosts) citeturn7search5turn1view1turn1view0turn9view0
  • Observability and operational fit (auditing, rate limits, approvals, idempotency, long-running tasks) citeturn10view0turn8view5turn13view2turn5search0

A compact “layer map” that keeps comparisons honest:

Approach What it standardizes (core promise) Typical failure mode if misused
Tools Structured action invocation from an LLM (functions and built-in tools) citeturn5search0turn1view6 Too many tools → token/latency tax, tool confusion, brittle multi-step chains citeturn15view0turn8view3
Skills Instruction + assets packaging with progressive disclosure via filesystem/shell/scripts citeturn1view5turn13view0turn14search20 Supply-chain and shell/script risk; “instructions as code” non-determinism citeturn14search8turn14search4turn0search21
API specs (OpenAPI) Machine-readable contract to generate tool surfaces or clients citeturn8view4turn15view3 Endpoints ≠ agent-intents; auto-wrapping yields massive low-level tool catalogs citeturn15view0turn15view1
MCP Vendor-neutral protocol for tools/resources/prompts + transports + auth guidance/spec citeturn7search5turn7search4turn7search1 Context bloat from tool definitions/results; unsafe remote servers or prompt injection via external content citeturn8view3turn13view2turn4search7
A2A Vendor-neutral protocol for agent discovery + task lifecycle + modality negotiation + updates citeturn10view0turn9view0turn8view1 Using it as a generic RPC layer without task/identity discipline; unclear trust boundaries without proper security schemes citeturn10view0turn11view1

This table is synthesized directly from the normative docs/specs plus ecosystem writeups that document the recurring operational issues (schema bloat, multi-step brittleness, security exposure). citeturn8view3turn8view5turn1view5turn10view0turn13view2turn15view0

Tools and “direct API” integration

Tools are the lowest-latency conceptual bridge between a model and an action: the model outputs a structured tool call; something executes it; results come back; the model continues. citeturn5search0turn1view6

Where tools run matters

A persistent design fork is client-executed vs provider-hosted tool execution.

With classic function/tool calling, the model emits a call and your application executes it, then sends tool output back to the model in a subsequent request (or a later step in an agent loop). citeturn5search0turn13view1

By contrast, some “built-in tools” are executed inside the provider’s orchestration environment (web search, file search, code interpreter, computer/shell environments, or remote MCP calling depending on the platform). For example, one vendor describes an orchestrated loop where the API forwards commands into a container runtime and streams output back into the model context, including controls like bounded output and parallel sessions. citeturn13view1turn2search8

This execution choice affects:

  • Security boundaries (credentials stay with your app vs handled by a hosted connector/server). citeturn13view2turn5search5
  • Latency and caching (hosted systems can optimize the loop; client loops give you full control but require more engineering). citeturn13view1turn5search0
  • Observability and approvals (client-side is easier to gate; hosted surfaces often provide their own approval/allowlisting interfaces). citeturn13view2turn5search0

“Tool sets” in practice: deferring and searching tools

A modern response to “too many tools” is deferred loading + tool search, which keeps only a high-level searchable surface in context and loads detailed schemas on demand. citeturn8view5turn1view6

One vendor’s tool-search guide is explicit about the motivation (reduce upfront token/cost), the mechanism (mark functions/namespaces/MCP servers as defer_loading), and the practical guidance (“use namespaces or MCP servers” as the search surface; keep namespaces small). citeturn8view5turn2search2

This is a key bridge between “tools” and “MCP”: once you defer entire MCP servers as the searchable unit, you’re treating each MCP server as a tool set that can be lazily expanded by the model. citeturn8view5turn1view6

When “direct API calls” are the right answer

“Direct API calls” can mean two different things:

  1. Model proposes a tool call; your code calls the API. This is still “tools,” but you keep the API knowledge in your app, not in the model context. citeturn5search0turn13view1
  2. Model reasons over an API contract (OpenAPI) and calls endpoints via generated functions/tools.

The first approach is often best when you want high-level, intention-shaped operations (e.g., refund_customer_by_email) rather than exposing low-level endpoints. Multiple ecosystem critiques point out that “atomic, discoverable APIs” are great for humans but expensive for agents because each choice and each multi-step chain imposes token + latency costs. citeturn15view0turn15view1

The second approach is attractive when you already have a robust OpenAPI spec and want fast prototyping. OpenAI has long published examples converting OpenAPI specs into function/tool definitions. citeturn8view4turn5search4
But multiple practitioners warn that doing this naively (whether as direct function tools or as auto-generated MCP tools) tends to create huge catalogs of low-level operations, which increases context load and can worsen tool-selection reliability. citeturn15view0turn15view1turn15view2

Skills as workflow packaging

Skills are best understood as a distribution format for agent competence: instructions plus optional scripts/resources, loaded only when relevant.

The core mechanism: progressive disclosure

In the Claude skills docs, progressive disclosure is described as a three-level system: load small metadata for each skill at startup; load the SKILL.md body only when triggered; and keep larger resources in the filesystem, where scripts can be executed and only their outputs enter context. citeturn1view5turn14search3

A separate vendor’s “agent skills” docs describe essentially the same idea: metadata is visible first; the full SKILL.md is loaded only when the skill is selected; skills are directories containing SKILL.md plus optional scripts/references/assets; and the ecosystem is positioned as an open standard. citeturn13view0turn14search20

This approach is fundamentally about token economics and instruction reliability:

  • Token economics: load only what you need. citeturn1view5turn13view0
  • Instruction reliability: skills can encode a stable SOP (checklists, guardrails, “when to trigger”) that a generic base agent might not infer consistently. citeturn13view0turn14search12

“Leanest implementation” really means “filesystem + shell”

A skill system becomes particularly lean when the agent already has filesystem access and a shell: you can package workflows as scripts and docs, and let the model run commands through a shell tool to fetch live data, call CLIs, and transform results. citeturn13view1turn1view5turn0search12

That same vendor’s shell-tool article emphasizes a crucial point: the model only proposes commands; an orchestrator executes them and loops results back, and it highlights controls like bounded output to keep logs from consuming context. citeturn13view1

The hard trade-off: flexibility vs attack surface

Skills intentionally blur “prompt” and “code.” That power shows up directly in threat modeling and supply-chain research.

Snyk’s “ToxicSkills” research (February 2026) reports scanning thousands of publicly available skills and finding a meaningful fraction with critical security issues, including malicious payloads, prompt injection risk, and exposed secrets—explicitly framing skills as a supply-chain security problem because they can inherit shell/filesystem/API access from the host agent. citeturn14search8turn14search16
Separate Snyk writeups also focus on how trivially a SKILL.md plus shell execution can become a remote code execution pathway in poorly controlled environments. citeturn14search4turn0search21

This security posture is not an argument against skills; it’s an argument for treating them like packages with capabilities, requiring the same kinds of controls you’d apply to plugin ecosystems: provenance, sandboxing, least privilege, and auditing. citeturn14search8turn13view1

MCP for tool and context interoperability

MCP is explicitly framed as a “USB-C for AI applications” style standard: a way for AI hosts/clients to connect to external tools, data sources, and workflows through a shared protocol surface. citeturn1view4turn7search2

What MCP standardizes

At the protocol level, MCP uses JSON-RPC as its message encoding and defines standard transports including stdio and streamable HTTP. citeturn7search4turn7search0turn7search5
The spec’s overview enumerates scope such as lifecycle management, authorization for HTTP transports, and server features (resources/prompts/tools). citeturn7search28turn7search13

The adoption narrative is now also institutional: entity[“organization”,“Linux Foundation”,“open source foundation”] press announcements describe MCP being anchored inside the entity[“organization”,“Agentic AI Foundation”,“lf-directed agentic ai fund”], with founding contributions including MCP and other agent-adjacent standards. citeturn1view1
entity[“company”,“Anthropic”,“ai company”] also describes donating MCP into that foundation, co-founded with entity[“company”,“Block”,“fintech company”] and entity[“company”,“OpenAI”,“ai company”], with support from entity[“company”,“Google”,“tech company”], entity[“company”,“Microsoft”,“tech company”], entity[“company”,“Amazon Web Services”,“cloud provider”], entity[“company”,“Cloudflare”,“internet security company”], and entity[“company”,“Bloomberg”,“financial data company”]. citeturn0search3turn1view1

That governance shift matters in practice because MCP is increasingly treated as an integration substrate across ecosystems, including major AI platforms’ “remote MCP server” support. citeturn13view2turn1view4

The “N×M” promise vs the “token bloat” reality

MCP’s architectural pitch is to reduce custom one-off connectors (many agent hosts × many systems) into a standard protocol interaction pattern (clients talk to servers via MCP). citeturn7search2turn4search2

However, MCP’s most visible operational critique is that naïve clients/hosts often load all tool definitions up front, and that intermediate results are repeatedly streamed through the model context—driving cost, latency, and sometimes failure on large artifacts. citeturn8view3turn4search7
Anthropic’s own engineering writeup sketches a concrete token-cost scenario: routing a long transcript through the model loop can add tens of thousands of tokens and can exceed context limits. citeturn8view3

Mitigations that actually work

A pattern that is repeatedly recommended (by protocol implementers and platform vendors) is some form of lazy or selective tool exposure:

  • Tool search / deferred loading at the host/tooling layer, so the model sees only a high-level server/namespace description until it needs details. citeturn8view5turn8view3
  • Curated, intent-level tools instead of 1:1 endpoint exposure, because agents are weak at brittle multi-step flows and pay a high tax per tool call. citeturn15view1turn15view0
  • Compute-side filtering/aggregation (code execution modes) to keep large intermediate results out of the model context and return only summaries or selected slices. citeturn8view3turn13view1

These mitigations also show up as platform guidance. For example, entity[“company”,“OpenAI”,“ai company”]’s remote MCP server documentation explicitly frames MCP servers as a powerful extension mechanism but spends substantial space on risks (prompt injection via untrusted content, malicious servers, data leakage) and recommends strong authentication (OAuth, dynamic client registration) and careful trust decisions (prefer official servers, minimize sensitive data exposure in tool metadata). citeturn13view2turn7search1

Benchmarks and empirical signals about MCP in the wild

A 2025 research paper on “making REST APIs agent-ready” provides unusually concrete data: it reports mining GitHub and identifying 22,722 MCP-tagged repositories in the six months after MCP’s release, but only 1,164 that contained functional server implementations—suggesting both rapid interest and substantial “boilerplate/implementation effort” friction. citeturn15view3
The same work introduces a compiler (AutoMCP) that generates MCP servers from OpenAPI specs and evaluates it on 50 APIs (>5,000 endpoints), finding recurring failure modes largely driven by incomplete/inconsistent specs. citeturn15view3

This aligns with practitioner guidance: auto-generation is useful for bootstrapping, but production-quality MCP tool surfaces usually require curation and intent-shaping. citeturn15view0turn15view1

A2A for agent-to-agent collaboration

A2A is an open protocol explicitly designed to let agents collaborate “as agents,” even when they do not share internal memory, tools, or context. The original Google announcement frames it as complementary to MCP and oriented toward large-scale multi-agent deployments. citeturn1view2
The entity[“organization”,“Linux Foundation”,“open source foundation”] launch announcement similarly emphasizes secure agent-to-agent communication, discovery, and collaboration across platforms/vendors/frameworks. citeturn1view0

What A2A does that MCP can’t

MCP can expose an “agent-like service” as a tool server, but MCP’s normative surface is still tools/resources/prompts for a single host-model loop—it does not standardize peer-agent collaboration semantics like task objects, agent cards, or modality negotiation. citeturn7search28turn7search13turn10view0
A2A’s spec and reference materials make these peer semantics first-class:

Agent discovery and capability advertisement (AgentCard).
A2A standardizes discovery via “Agent Cards” that describe capabilities and connection information, and the spec includes explicit capability validation rules (e.g., how clients should interpret streaming/push-notification capability flags) plus security scheme declarations in the AgentCard. citeturn9view0turn11view1turn10view0

Task-oriented collaboration, not single function calls.
A2A defines a Task object with lifecycle/state, and treats collaboration as task fulfillment. The Hugging Face explainer highlights task lifecycle and names the output object (Artifact). citeturn8view1turn10view0turn11view2
By contrast, MCP tools are invoked as discrete operations (even if the tool itself triggers a long-running process), and the protocol’s “unit” is a tool/resource/prompt interaction rather than a standardized, cross-agent task lifecycle. citeturn7search13turn7search28turn10view0

Modality and UX negotiation.
A2A messages are explicitly built from “parts” with content types, allowing agents/clients to negotiate formats and UI features. citeturn8view1turn10view0
This is materially different from MCP’s focus on tool schemas and resource/prompt retrieval; MCP does not define a peer negotiation mechanism for “what UI modalities do we both support for this task?” as a first-class interoperable concept. citeturn7search28turn10view0

Streaming updates and asynchronous delivery as protocol objects.
A2A specifies streaming events for task status and artifact updates and supports push notification configuration. citeturn11view2turn10view0
MCP can stream via its transport mechanisms, but it does not define a standardized cross-agent “task status update event” model or a native push-notification control plane the way A2A does. citeturn7search4turn10view0turn11view2

Multiple bindings beyond JSON-RPC.
A2A defines multiple standard protocol bindings, including JSON-RPC and gRPC, plus an HTTP+JSON/REST binding with SSE streaming described in the spec. citeturn11view3turn11view4turn10view0
MCP’s spec similarly defines transports/bindings, but the point here is what is being bound: in A2A it’s a task-and-agent-collaboration model; in MCP it’s a tool/resource/prompt model. citeturn7search4turn7search28turn10view0

Summarizing the boundary in one line: MCP answers “how do I give a model access to tools and context?” while A2A answers “how do I make agents discover each other and collaborate on tasks without becoming each other’s tools?” citeturn1view2turn7search5turn10view0

When direct API calls make sense vs wrapping in MCP vs describing in skills

These choices are most coherent when you treat them as different points on a spectrum of intent-shaping vs reuse vs operational overhead.

Direct API calls (via your orchestration code) usually win when:

You want to expose a small number of high-level actions with strict validation and clear approvals, and you don’t need cross-host interoperability for that integration. This matches the standard “tool calling flow” where the model emits a call and your application executes it. citeturn5search0turn13view1
It also aligns with repeated critiques that agents do poorly when forced to chain many low-level API endpoints; putting the composition burden in your code can reduce tool calls and context pollution. citeturn15view0turn15view1

Wrapping APIs in MCP servers tends to make sense when:

You want one connector surface that multiple hosts/agent frameworks can reuse, you want to standardize discovery/metadata/security policies around the integration, or you need to expose not just actions but also resources/prompts in a consistent way. citeturn7search5turn7search28turn13view2
The empirical AutoMCP work suggests there is real engineering cost in manual MCP development (boilerplate, low-churn single-maintainer implementations), which is why MCP server generation from OpenAPI is attractive—but it also highlights that spec quality becomes a limiting factor. citeturn15view3

A practical “middle way” that shows up across practitioner guidance is: use OpenAPI-to-MCP to bootstrap, then curate into intent-level tools and implement selective exposure so the agent doesn’t ingest thousands of endpoints. citeturn15view0turn15view1turn15view2

Describing API usage in SKILL.md (or skills in general) tends to make sense when:

You have a shell/CLI-capable agent environment and the integration is best expressed as a workflow (“run these commands; parse results; apply policy checks”), especially when you can hide heavy logic in scripts whose outputs (not source code) enter context. citeturn1view5turn13view1turn0search12
This can be extremely token-efficient for broad competencies because only metadata is always loaded and detailed instructions/scripts are pulled in on demand. citeturn1view5turn13view0

But the security posture is fundamentally different: skills can become a supply-chain and shell-execution risk surface, and recent ecosystem scanning suggests a non-trivial rate of vulnerable or malicious skills in public registries. citeturn14search8turn14search4
So, skills are most appropriate when you can enforce provenance and sandboxing (enterprise policy, curated registries, least privilege), not as an unvetted “download and run” ecosystem. citeturn14search8turn13view2

The “A2A + MCP + tools/skills” architecture that actually scales

Many sources converge on a layered architecture:

  • Use A2A to find and coordinate specialized agents (task delegation, updates, artifacts). citeturn9view0turn10view0turn8view1
  • Inside each specialized agent, use MCP (or native tool calling) to access external systems with standardized controls. citeturn1view2turn7search5turn13view2
  • Use tool search / deferred loading and progressive disclosure (skills) to keep context lean as the integration surface grows. citeturn8view5turn1view5turn8view3

This is also consistent with A2A’s own reference materials, which explicitly teach A2A–MCP complementarity and emphasize preserving opacity (agents collaborate without exposing internal tools/state). citeturn9view0turn1view2