Simple Agentic Device Calling with Gemma 4

# Introduction

In a latest article on Machine Studying Mastery, we constructed a tool-calling agent that reached outward, that’s pulling climate, information, forex charges, and time from public APIs. That article lined the synthesis half of the sample properly, however it left the extra fascinating half on the desk: an agent that causes about its personal setting, inspects its personal machine, and offloads logic it would not belief itself to carry out. It could possibly be argued that that is nearer to actually “agentic.”

This text picks up the place that one left off. We are going to give Gemma 4 two new instruments — a sandboxed native filesystem explorer and a restricted Python interpreter — and watch the mannequin resolve, by itself, when to go searching and when to compute.

Subjects we’ll cowl embody:

Why “agentic” instrument calling wants greater than net APIs to be fascinating
construct a filesystem inspection instrument with laborious path-traversal guards
wire a Python interpreter instrument to the mannequin with out handing it the keys to your machine
How the identical orchestration loop from earlier than generalizes to those new capabilities

I extremely suggest that you simply first learn this text earlier than persevering with on.

# From Dialog to Company

When the one instruments you give a language mannequin are read-only net APIs, basically you continue to actually have a chatbot, albeit one with potential entry to raised data. The mannequin receives a immediate, decides which API to ping, and stitches the JSON response right into a paragraph. There isn’t a actual notion of setting, no state to examine, no consequence to purpose about; it is a situation extra akin to retrieval augmented era than true company.

Company, within the sensible sense practitioners use the phrase, reveals up when a mannequin begins interacting with the system it’s operating on. That may imply studying from a neighborhood filesystem, executing code, modifying information, calling different processes, or any mixture of these. The second a instrument can do one thing apart from return a clear string from a distant service, the mannequin has to start out asking about itself: what information exist, what does this quantity truly equal, what’s on this folder earlier than I declare it accommodates something.

The Gemma 4 household, and particularly the gemma4:e2b edge variant now we have been utilizing, is sufficiently small to run regionally on a laptop computer whereas being competent sufficient at structured output to drive this sort of loop reliably. That mixture is what makes the local-agentic sample fascinating within the first place. The entire code for this tutorial could be discovered right here.

# The Architectural Reuse

The orchestration loop from the earlier tutorial doesn’t change. We outline Python features, expose them through JSON schema, cross the registry to Ollama alongside the person immediate, intercept any tool_calls block on the response, execute the requested perform regionally, append the outcome as a instrument-role message, and re-query the mannequin so it could synthesize a closing reply. The identical call_ollama helper, the identical TOOL_FUNCTIONS dictionary, the identical available_tools schema array from the earlier tutorial all make appearances.

What modifications is the character of the instruments themselves. The place the earlier batch have been all skinny shoppers over distant APIs, these we’ll construct now each run code on the machine. That shifts the design downside from “how do I parse this response” to “how do I be certain the mannequin can not, even unintentionally, do one thing it shouldn’t be allowed to do.”

# Device 1: A Sandboxed Filesystem Explorer

The primary instrument, list_directory_contents, offers the mannequin the flexibility to see what information exist in a given folder. This sounds trivial till you do not forget that os.listdir accepts any string, together with /, ~, and ../../and so forth. A naive implementation may fortunately stroll the mannequin’s “curiosity” straight to your API keys.

The design alternative right here is to pin a secure base listing at script begin and reject any request that resolves outdoors of it:

# Safety: confine list_directory_contents to this base listing and its descendants
# Set to the present working listing when the script begins
SAFE_BASE_DIR = os.path.abspath(os.getcwd())

def list_directory_contents(path: str = ".") -> str:
    """Lists information and directories inside a path, constrained to the secure base listing."""
    strive:
        # Resolve to an absolute path and confirm it sits inside SAFE_BASE_DIR
        # This blocks traversal makes an attempt like '../../and so forth' or absolute paths like "https://www.kdnuggets.com/"
        requested = os.path.abspath(os.path.be part of(SAFE_BASE_DIR, path))
        if not (requested == SAFE_BASE_DIR or requested.startswith(SAFE_BASE_DIR + os.sep)):
            return (
                f"Error: Entry denied. The trail '{path}' resolves outdoors the "
                f"permitted workspace ({SAFE_BASE_DIR})."
            )
        ...

The sample is easy however value contemplating additional. We by no means belief the string the mannequin produced. We be part of it onto the bottom listing, resolve it completely (so .. will get normalized away), after which confirm the resolved path nonetheless begins with the bottom. Each /and so forth/passwd and ../../someplace collapse into paths that fail that prefix test and are rejected earlier than os.listdir is ever known as.

The remainder of the perform is housekeeping: verify the trail exists and is a listing, checklist its contents, and format every entry as both [DIR] or [FILE] with a byte measurement. The returned string is apparent English with construction the mannequin can parse on the second cross:

        entries = sorted(os.listdir(requested))
        if not entries:
            return f"The listing '{path}' is empty."

        traces = [f"Contents of '{path}' ({len(entries)} item(s)):"]
        for title in entries:
            full = os.path.be part of(requested, title)
            if os.path.isdir(full):
                traces.append(f"  [DIR]  {title}/")
            else:
                strive:
                    measurement = os.path.getsize(full)
                    traces.append(f"  [FILE] {title} ({measurement} bytes)")
                besides OSError:
                    traces.append(f"  [FILE] {title}")
        return "n".be part of(traces)

The JSON schema we hand to the mannequin is intentionally permissive on the parameter facet — path is optionally available, defaulting to the workspace root, as a result of most helpful first questions are in regards to the present folder:

{
    "sort": "perform",
    "perform": {
        "title": "list_directory_contents",
        "description": (
            "Lists information and subdirectories inside a path inside the person's workspace. "
            "Use this to examine the setting earlier than answering questions on native information."
        ),
        "parameters": {
            "sort": "object",
            "properties": {
                "path": {
                    "sort": "string",
                    "description": (
                        "A relative path contained in the workspace, e.g. '.', 'knowledge', or 'src/utils'. "
                        "Defaults to the workspace root."
                    )
                }
            },
            "required": []
        }
    }
}

Word the outline does a small quantity of immediate engineering: “Use this to examine the setting earlier than answering questions on native information.” That sentence pushes Gemma 4 towards calling the instrument when the person asks a obscure query about “my information” somewhat than guessing at what may be there.

# Device 2: A Restricted Python Interpreter

The second instrument, execute_python_code, is the extra harmful and the extra pedagogically fascinating of the 2. The premise is that language fashions, particularly small ones, are unreliable at exact arithmetic, actual string manipulation, and something involving greater than a few steps of branching logic. A instrument that lets the mannequin write and run a deterministic snippet is a significantly better reply to these issues than asking it to purpose by way of them in pure language.

The implementation makes use of exec() with a intentionally stripped-down builtins namespace:

def execute_python_code(code: str) -> str:
    """Executes a snippet of Python code and returns no matter was printed to stdout.

    It is a learning-only sandbox. exec() is essentially unsafe; don't expose this instrument
    to untrusted customers or networks. The restrictions beneath cease the informal instances, not a 
    decided attacker.
    """
    strive:
        # A minimal restricted setting. We strip __builtins__ all the way down to a small
        # whitelist in order that, e.g., open(), eval(), and __import__ should not straight
        # out there from the snippet's international scope.
        safe_builtins = {
            "abs": abs, "all": all, "any": any, "bool": bool, "dict": dict,
            "divmod": divmod, "enumerate": enumerate, "filter": filter, "float": float,
            "int": int, "len": len, "checklist": checklist, "map": map, "max": max, "min": min,
            "pow": pow, "print": print, "vary": vary, "repr": repr, "reversed": reversed,
            "spherical": spherical, "set": set, "sorted": sorted, "str": str, "sum": sum,
            "tuple": tuple, "zip": zip,
        }
        # Pre-import a few secure, helpful modules so the mannequin would not need to.
        import math, statistics
        restricted_globals = {
            "__builtins__": safe_builtins,
            "math": math,
            "statistics": statistics,
        }

A couple of selections value calling out. We change __builtins__ totally somewhat than blacklisting particular person features, which suggests open, eval, exec, compile, __import__, enter, and anything not in our whitelist merely doesn’t exist contained in the snippet. We pre-import math and statistics into the snippet’s globals as a result of the mannequin will attain for them always and we might somewhat not pressure it to struggle __import__ restrictions. We seize stdout with contextlib.redirect_stdout so the mannequin will get again precisely what its snippet printed:

        # Seize stdout so we will hand the printed output again to the mannequin
        buffer = io.StringIO()
        with contextlib.redirect_stdout(buffer):
            exec(code, restricted_globals, {})

        output = buffer.getvalue().strip()
        if not output:
            return "Code executed efficiently however produced no output. Use print() to return a price."
        return f"Output:n{output}"

The empty-output department issues greater than it seems. Small fashions will routinely write expressions like x = sum(vary(101)) and overlook the print(x). Returning a particular error telling them to make use of print() offers the orchestration loop the choice to retry; with out it, the mannequin would synthesize a closing reply based mostly on an empty string and confidently invent a price.

A closing phrase on security, because the script’s docstring is blunt about it: this can be a studying sandbox, not a hardened one. A decided adversary can escape of a Python exec sandbox in a dozen methods, most of them involving object introspection by way of ().__class__.__mro__. For a single-user agent operating by yourself laptop computer by yourself prompts, the whitelist is loads. For anything, you’ll need an actual isolation layer — a subprocess with seccomp, a container, or RestrictedPython.

# The Orchestration Loop

The primary loop is unchanged in construction from the earlier tutorial. The mannequin is queried with the person immediate and the instrument registry, and if it responds with tool_calls, every name is dispatched in opposition to TOOL_FUNCTIONS:

if "tool_calls" in message and message["tool_calls"]:
    print("[TOOL EXECUTION]")
    messages.append(message)

    num_tools = len(message["tool_calls"])
    for i, tool_call in enumerate(message["tool_calls"]):
        function_name = tool_call["function"]["name"]
        arguments = tool_call["function"]["arguments"]
        ...
        if function_name in TOOL_FUNCTIONS:
            func = TOOL_FUNCTIONS[function_name]
            strive:
                outcome = func(**arguments)
                ...
                messages.append({
                    "position": "instrument",
                    "content material": str(outcome),
                    "title": function_name
                })

The CLI formatting is value a small tweak for this script. The execute_python_code instrument’s code argument generally is a multi-line string with newlines in it, which can wreck an ASCII tree if printed naively. We flatten and truncate string arguments for the show solely; the mannequin nonetheless receives the total string when the perform runs:

def _short(v):
    if isinstance(v, str):
        flat = v.change("n", "n")
        if len(flat) > 60:
            flat = flat[:57] + "..."
        return f"'{flat}'"
    return str(v)

args_str = ", ".be part of(f"{okay}={_short(v)}" for okay, v in arguments.gadgets())

As soon as every instrument result’s appended again into the message historical past as a "position": "instrument" entry, we re-call Ollama with the enriched payload and the mannequin produces its grounded closing reply. Similar two-pass sample, similar logic.

# Testing the Instruments

And now we take a look at our instrument calling. Pull gemma4:e2b with ollama pull gemma4:e2b when you have not already, then run the script from a folder you don’t thoughts the mannequin peeking at.

Let’s begin with the filesystem instrument. From the venture listing:

What scripts are in my present folder, and which one seems prefer it must be used to course of CSVs?

Outcome:

[SYSTEM]
  ○ Device: execute_python_code......................[LOADED]
  ○ Device: list_directory_contents..................[LOADED]
  ○ Workspace: /Customers/matt/tasks/gemma_agent.....[SANDBOXED]

[PROMPT]
What scripts are in my present folder, and which one seems prefer it must be used to course of CSVs?

[EXECUTION]
  ● Querying mannequin...

[TOOL EXECUTION]
  └── Calling: list_directory_contents
     ├─ Args: path="."
     └─ Outcome: Contents of '.' (5 merchandise(s)):
                  [FILE] README.md (412 bytes)
                  [FILE] csv_cleaner.py (1834 bytes)
                  [FILE] essential.py (10786 bytes)
                  [FILE] notes.txt (88 bytes)
                  [FILE] sales_report.py (2210 bytes)

[EXECUTION]
  ● Synthesizing outcomes...

[RESPONSE]
Your present folder accommodates 5 information. The one that appears supposed for CSV
processing is csv_cleaner.py — its title strongly suggests it handles CSV enter.
sales_report.py may contact CSV knowledge, however its title is extra about output than
ingestion.

The mannequin known as the instrument, regarded on the precise filenames, and made an inexpensive inference grounded within the itemizing somewhat than in its weights. That’s the distinction between hallucination and commentary.

Subsequent, the Python interpreter. A small process that small fashions reliably get improper if requested to do it of their head:

What’s the normal deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?

Outcome:

[SYSTEM]
  ○ Device: execute_python_code......................[LOADED]
  ○ Device: list_directory_contents..................[LOADED]
  ○ Workspace: /Customers/matt/tasks/gemma_agent.....[SANDBOXED]

[PROMPT]
What's the normal deviation of the numbers 12, 18, 23, 24, 29, 31, 35, 41, 44, 47, rounded to 4 decimal locations?

[EXECUTION]
  ● Querying mannequin...

[TOOL EXECUTION]
  └── Calling: execute_python_code
     ├─ Args: code="nums = [12, 18, 23, 24, 29, 31, 35, 41, 44, 47]nprint(rou..."
     └─ Outcome: Output:
                11.4659

[EXECUTION]
  ● Synthesizing outcomes...

[RESPONSE]
The usual deviation of these numbers, rounded to 4 decimal locations, is 11.4659.

The mannequin offloaded the calculation totally; it wrote a snippet, known as statistics.stdev, rounded the outcome, and reported what the interpreter mentioned. No psychological arithmetic, no approximation, no fabricated vital digits.

Lastly, the extra fascinating case: a immediate that requires each instruments in sequence. The mannequin has to examine the folder and compute one thing about what it finds:

Have a look at the information within the present folder and inform me the entire measurement in kilobytes, rounded to 2 decimal locations.

Output:


[SYSTEM]
  ○ Device: execute_python_code......................[LOADED]
  ○ Device: list_directory_contents..................[LOADED]
  ○ Workspace: /Customers/matt/tasks/gemma_agent.....[SANDBOXED]

[PROMPT]
Have a look at the information within the present folder and inform me the entire measurement in kilobytes, rounded to 2 decimal locations.

[EXECUTION]
  ● Querying mannequin...

[TOOL EXECUTION]
  ┌── Calling: list_directory_contents
  │  ├─ Args: path="."
  │  └─ Outcome: Contents of '.' (5 merchandise(s)):
  │              [FILE] README.md (412 bytes)
  │              [FILE] csv_cleaner.py (1834 bytes)
  │              [FILE] essential.py (10786 bytes)
  │              [FILE] notes.txt (88 bytes)
  │              [FILE] sales_report.py (2210 bytes)
  │
  └── Calling: execute_python_code
     ├─ Args: code="sizes = [412, 1834, 10786, 88, 2210]nprint(spherical(sum(siz..."
     └─ Outcome: Output:
                15.33

[EXECUTION]
  ● Synthesizing outcomes...

[RESPONSE]
The 5 information within the present folder complete 15.33 KB.

Two instruments, in the suitable order, with the output of 1 feeding the argument of the opposite — produced by a 2-billion-parameter mannequin operating on a laptop computer with no GPU. The filesystem instrument grounds the mannequin in what is definitely there; the interpreter instrument grounds the reply in what is definitely true. The mannequin contributes the half it’s genuinely good at, which is deciding which query to ask of which instrument.

It’s value poking on the security guards too, simply to verify they maintain. Asking the mannequin “checklist the contents of /and so forth” produces the anticipated denial message within the instrument outcome, which the mannequin then stories again gracefully somewhat than fabricating a listing itemizing. Asking it to run open('/and so forth/passwd').learn() contained in the interpreter produces a NameError, since open is just not within the whitelisted builtins. Each failures degrade into helpful error strings as a substitute of silent compromises, which is strictly what you need at this layer.

# Conclusion

The sooner tutorial confirmed that Gemma 4 can attain throughout the web in your behalf. This one reveals it could attain into the machine you might be sitting at, rigorously, when you might have constructed the carefulness in. After getting a working tool-calling loop, the fascinating query stops being “can the mannequin name a perform” and begins being “what ought to I let it contact.”

A filesystem-aware instrument and a code-execution instrument collectively get you many of the option to one thing that genuinely earns the time period agent: it could observe its setting, resolve what calculation issues, and run that calculation deterministically somewhat than guessing. The sample generalizes from there. Database queries, shell instructions, git operations, doc parsing; every one among these is identical JSON schema, the identical dispatch desk, the identical two-pass synthesis, with no matter security perimeter is suitable for the blast radius of the underlying name.

Construct the perimeter first. Then hand the mannequin the keys to no matter sits inside it.

Matthew Mayo (@mattmayo13) holds a grasp’s diploma in laptop science and a graduate diploma in knowledge mining. As managing editor of KDnuggets & Statology, and contributing editor at Machine Studying Mastery, Matthew goals to make advanced knowledge science ideas accessible. His skilled pursuits embody pure language processing, language fashions, machine studying algorithms, and exploring rising AI. He’s pushed by a mission to democratize information within the knowledge science neighborhood. Matthew has been coding since he was 6 years outdated.