One Versatile Instrument Beats a Hundred Devoted Ones

LLM Evals Are Based mostly on Vibes — I Constructed the Lacking Layer That Decides What Ships

From Knowledge Analyst to Knowledge Engineer: My 12-Month Self-Research Roadmap

if you needed an LLM agent to speak to a system initially of 2026 was to put in an MCP server for it.

GitHub. Jira. Slack. Linear. Postgres. Neo4j. Every one ships a server that exposes a tidy menu of instruments, create_issue, list_pull_requests, merge_pull_request, get_repository, search_code, and so forth, and also you level your agent at it.

It’s a terrific onboarding expertise. It’s additionally, for a stunning variety of actual workloads, the flawed form.

The thesis is brief: MCP design often wraps every service as a pile of devoted instruments; a CLI fingers the agent one actually versatile instrument. With in the present day’s fashions, the versatile instrument wins.

The 2 shapes ask the mannequin to do totally different work. With a pile of devoted instruments, the agent simply has to choose the correct one off a menu. With a versatile instrument, it has to determine easy methods to put the items collectively itself. That second half was once the laborious one. Fashions would hallucinate flags, lose the thread on lengthy pipelines, misinterpret assist textual content, so wrapping each operation in a pre-baked instrument was a wise protection. That simply isn’t true anymore. At this time’s fashions learn a --help web page or SKILL.md when they should, know the canonical CLIs from coaching, string collectively bash with out supervision, and retry once they get a flag flawed. The laborious half bought simple, the simple half was at all times simple, and all these neatly-wrapped instruments principally simply bloat the mannequin’s context for nothing now.

In fact it’s not all roses and sunshine. Handing the agent a terminal additionally fingers it a a lot larger blast radius. The identical flexibility that lets it compose gh | jq | xargs into one thing helpful additionally lets a immediate injection speak it into one thing loads worse than a hostile Cypher question. So sure, there’s a trade-off, and it’s important to truly give it some thought (sandbox, allowlist, separate OS consumer, read-only function on the database, the standard stuff).

However if you may give the agent a terminal in a fairly protected manner, the versatile aspect nonetheless comes out forward.

The place CLI shines

The identical “wrap a service as a pile of devoted instruments” sample exhibits up wherever MCP does. Postgres MCPs vs. psql. Kubernetes MCPs vs. kubectl. Filesystem MCPs vs. cat, ls, mv, grep glued by pipes. Similar intuition each time, similar CLI counterpart each time. And the identical three failure modes too, as a result of they aren’t actually about anybody product.

Nothing within the MCP spec truly requires this method of piling up devoted instruments. The protocol asks for typed instruments, nothing extra; it says nothing about how slender every instrument needs to be. Implementations simply gravitate towards many small slender instruments for historic causes. You’ll be able to construct versatile instruments that take a single expressive enter the agent shapes nevertheless it needs, and more often than not you most likely ought to.

To make it concrete, we’ll take a look at an instance pitting Neo4j MCP server towards Neo4j CLI.

Disclaimer up entrance: I work at Neo4j. The selection is simply comfort, however the learnings apply to most different CLIs.

The Neo4j MCP server is the official server that exposes Neo4j to brokers by MCP, transport a handful of devoted instruments like learn question, write question, and get schema. neo4j.sh is the official command-line interface for Neo4j, a single binary you run in a terminal with credential profiles for every database you speak to. To maintain the comparability trustworthy, we’ll solely take a look at the read-query and schema pair on the MCP aspect towards the equal question invocation in neo4j.sh. Similar operations, similar database, similar Cypher going over the wire. The one factor that modifications is whether or not the agent reaches them by a typed instrument schema or by a string handed to a shell.

Querying throughout environments

We already noticed how a pile of devoted instruments eats the context window with descriptions, and that some servers now ship deferred instruments to push that price off till the agent truly reaches for them. However there’s a second multiplier no person talks about: what occurs if you wish to speak to multiple occasion of the identical service. With MCP, the instrument depend doesn’t simply develop with options, it grows with environments.

Connecting to a number of database through MCP or CLI.

The agent needs a node depend from dev, staging, and prod. Via MCP, you get up a neo4j-mcp-server per surroundings, every one carrying its 4 instrument schemas into the agent’s context on each flip. Three databases is twelve schemas within the mannequin’s window, the identical 4 schemas thrice over, earlier than the agent has finished something.

Via the CLI, it’s a for loop:

$ for c in dev staging prod-ro; do
    neo4j-cli question -c $c --format toon 
      "MATCH (n) RETURN depend(n) AS nodes"
  finished

One binary, three credential profiles, zero per-turn context price. Including a fourth surroundings is another credential dbms add, not another MCP server course of. The identical form carries over to any “attain out to N related issues” workflow you may want: snapshotting prod earlier than a dangerous deploy, diffing the schema between staging and prod, operating a well being verify throughout each database the agent is aware of about.

Chaining queries

Say the agent is investigating a recognized fraud account: from a single seed, discover each account it transacted with, then discover which different accounts these counterparties transact with probably the most typically. Two queries towards the identical database, the place the second’s parameters are the output of the primary.

Via MCP, the mannequin needs to be the pipe. It calls read-cypher, the consequence comes again as an inventory of, say, 80 counterparty IDs, these 80 IDs sit within the mannequin’s context now, the mannequin codecs them into the parameter for the second read-cypher name, and solely then can question two run. The intermediate record rides the dialog verbatim, and each additional ID is one other row of context the agent pays for whether or not it ever reads it once more or not.

Via the CLI, the pipe is a literal |:

$ neo4j-cli question -c prod-ro --format json 
    --param "seed=acct_19f3" 
    "MATCH (:Account {id: $seed})-[:TRANSACTED]-(c:Account)
     WHERE c.id <> $seed
     RETURN accumulate(DISTINCT c.id) AS counterparties" 
  | neo4j-cli question -c prod-ro --params-from-stdin 
      "MATCH (a:Account)-[:TRANSACTED]-(b:Account)
       WHERE a.id IN $counterparties
         AND NOT b.id IN $counterparties + ['acct_19f3']
       RETURN b.id, depend(DISTINCT a) AS edges_into_cluster
       ORDER BY edges_into_cluster DESC LIMIT 20"

--params-from-stdin reads the earlier question’s JSON consequence and binds it as a parameter for the following. The counterparties record by no means enters the mannequin’s context, the agent’s token price is similar whether or not the cluster has 5 counterparties or 500.

That is the place the shell begins to really feel like a special class of instrument altogether. The agent isn’t choosing from a menu of operations anymore, it’s composing pipelines, and the intermediate knowledge by no means has to floor. A two-step question turns into a |. A fan-out turns into a for loop. A be a part of throughout two databases turns into one question piped into one other with --params-from-stdin. Every of these could be three or 4 MCP round-trips with each intermediate consequence paraded by the context window, and at that time the agent has spent extra tokens shuffling rows than eager about them.

Pipe throughout many CLIs

Similar downside, larger scale. Say the agent needs to materialize a mission’s current GitHub points into Neo4j: an :Difficulty node per ticket, a :Consumer node per creator, a :TAGGED relationship per label. The info lives in a single CLI (gh), needs reshaping (jq does that), and lands in one other CLI (neo4j-cli). Three totally different instruments in a single line. Via MCP, you’d hit GitHub’s MCP server for the difficulty record, each concern physique lands within the mannequin’s context, the mannequin extracts the fields it needs, and write-cypher fires as soon as per concern. A whole lot of spherical journeys by the mannequin, each concern physique sitting within the dialog alongside the best way.

Via the CLI, three packages in a pipe:

$ gh concern record --repo neo4j/neo4j --limit 100 
    --json quantity,title,creator,labels 
  | jq -c '.[]' 
  | whereas learn concern; do
      neo4j-cli question --rw -c prod 
        --param "knowledge=$concern" 
        "WITH apoc.convert.fromJsonMap($knowledge) AS i
         MERGE (n:Difficulty {quantity: i.quantity}) SET n.title = i.title
         MERGE (u:Consumer {login: i.creator.login})
         MERGE (u)-[:OPENED]->(n)
         FOREACH (label IN i.labels |
           MERGE (l:Label {identify: label.identify})
           MERGE (n)-[:TAGGED]->(l))"
    finished

gh pulls the problems, jq reshapes every one right into a single JSON line, the whereas loop fingers every line to neo4j-cli as a Cypher parameter. The mannequin writes this script as soon as after which steps off; the information flows by bash, not by the agent. 100 points or ten thousand, the agent’s token price is similar.

The form generalizes properly past GitHub. Swap gh for some other CLI that emits JSON (jira concern record, linear, curl towards a webhook, your individual inner dump command), swap the Cypher sample for no matter database you’re constructing, and the pipeline carries. Two MCP instruments can’t pipe to one another; two CLIs can, and so can ten.

Terminal management is highly effective, and that’s the catch

The terminal isn’t a set floor, it’s probably the most versatile instrument you possibly can hand an agent as a result of it composes with all the things else on the field.

That energy can be the catch. A versatile instrument used badly does versatile injury. With nice terminal entry comes the apparent duty: sandbox the shell, allowlist the verbs you truly need, run the agent as a separate OS consumer, bind credentials to roles that bodily can’t do the harmful factor. None of that is novel, it’s simply sysadmin hygiene utilized to an LLM that varieties quick. And when you can’t do any of that, an MCP server with a small fastened floor remains to be the correct reply; the protocol-level assure that the agent can’t cat ~/.ssh/id_rsa is an actual factor.

The broader level holds even when you keep solely inside MCP. The rationale the terminal wins isn’t that bash is particular, it’s that bash is one instrument with very versatile enter. Pipes, variables, substitution, looping. That’s the form price copying. Learn the terminal as MCP’s restrict case and design towards it: fewer instruments, every one accepting expressive enter, the agent doing the composing as an alternative of you anticipating each mixture upfront. Most MCP servers are a protracted record of slender endpoints as a result of that’s how the underlying API was already formed, not as a result of the agent works higher that manner. The servers that age properly would be the ones that picked a smaller, extra expressive floor on objective.

All photographs on this weblog put up are created by the creator.