
# Introducing MCP
Each developer constructing with native AI hits the identical wall ultimately. The mannequin works. It causes nicely, writes stable code, and solutions complicated questions. Nevertheless it can not do all the things. It can not question your database, open a GitHub situation, or name your inner API. You’re left writing customized Python wrappers for each software you want, hardcoding the glue between mannequin output and power execution, and sustaining these wrappers each time an API adjustments.
The Mannequin Context Protocol (MCP) was designed to unravel precisely this. It’s an open customary by Anthropic: a common, pluggable protocol for AI software connectivity. Outline a software as soon as as an MCP server. Any MCP-compatible consumer, any mannequin, any framework, can uncover and name it with zero customized integration code per mannequin.
Qwen3.6-35B-A3B is probably the most succesful native mannequin for this type of work proper now. It has a 262,144-token context window, a Combination of Specialists (MoE) structure that prompts solely 3B of its 35B parameters per ahead move (which is why it suits on {hardware} that shouldn’t be capable of run a 35B mannequin), and was explicitly educated and evaluated on MCP-based agentic duties.
This text builds an area GitHub developer assistant: an agent that reads a repository’s open points, searches the related code, drafts a repair, and creates a pull request. The entire thing runs in your {hardware}, by way of MCP servers, with no cloud dependency.
# Understanding Qwen3.6-35B-A3B
Understanding the structure issues right here as a result of it immediately explains what {hardware} you want and why the mannequin performs the way in which it does on agentic duties.
The identify encodes the important thing reality: 35B whole parameters, A3B that means 3B activated per ahead move. It’s an MoE mannequin with 256 specialists per layer, routing 8 plus 1 shared specialists per token. You get the data capability of a 35B mannequin on the inference compute price of a 3B mannequin. That trade-off is why it suits on {hardware} that may collapse underneath a dense 35B.
The hidden structure is the place Qwen3.6 diverges most from different MoE fashions. Every block within the 40-layer stack follows a 3:1 ratio of Gated DeltaNet layers to Gated Consideration layers. DeltaNet is a linear consideration mechanism; it processes sequences extra effectively than full quadratic consideration, particularly at lengthy context lengths. The interleaved full Gated Consideration layers present the deep relational reasoning that linear consideration alone misses. For an agent working by way of a 500-file repository, that mixture issues: environment friendly processing at size mixed with exact reasoning on the related sections.
The context window is 262,144 tokens natively, extensible to 1,010,000 with YaRN scaling. For agent work, context size just isn’t a consolation characteristic; it’s an operational constraint. An agent studying supply information, sustaining software name historical past, monitoring a multi-step plan, and injecting software outcomes again into context wants actual headroom. Most 7B and 13B fashions cap at 8k or 32k tokens. Operating out of context mid-task means the agent loses its personal historical past and begins hallucinating software outcomes.
Qwen3.6 was explicitly educated and evaluated on MCP-based agentic benchmarks. Two headline options got here out of that coaching:
- Agentic Coding. Frontend workflows and repository-level reasoning — the mannequin handles multi-file refactoring duties with coherent reasoning throughout information, not simply single-file edits in isolation.
- Considering Preservation. A
preserve_thinkingflag that retains reasoning traces from prior turns in a multi-turn dialog. When an agent causes by way of a plan in flip one after which executes software calls in turns two by way of 5,preserve_thinking=Trueretains the turn-one reasoning out there within the KV cache. Every subsequent flip advantages from that prior reasoning with out paying the price of re-deriving it.
# System Necessities
There are three practical deployment paths, and which one you utilize relies upon completely in your {hardware}.
- GPU inference (really helpful for manufacturing agent workloads). Qwen3.6-35B-A3B in bfloat16 requires roughly 70 GB VRAM. In This fall quantization, it suits in roughly 20–24 GB. A single RTX 4090 (24 GB) handles This fall. Two RTX 3090s with tensor parallelism deal with This fall as nicely. An A100 80 GB handles the complete bfloat16 mannequin.
- CPU/Hybrid through KTransformers. KTransformers is the accessible path for builders with out a 24 GB GPU. It offloads compute-heavy layers to GPU when out there and runs the remaining on CPU. With 64 GB system RAM, you may run Qwen3.6-35B-A3B in a usable (if slower) configuration. Response latency can be 30–120 seconds per flip relying in your CPU, which is workable for an agent doing background repository evaluation however not for interactive coding classes.
- Smaller fashions for tutorial testing. Your entire MCP integration sample on this article is similar no matter mannequin dimension. If you wish to observe alongside with out the {hardware} for the complete 35B mannequin, use
Qwen/Qwen2.5-7B-Instructthrough Ollama (ollama pull qwen2.5:7b) or the Qwen3-8B mannequin. The serving API is identical, the code is similar, and you’ll swap within the 35B mannequin when {hardware} permits.
Software program necessities:
# Python 3.11+ required
python --version
python -m venv qwen-mcp-env
supply qwen-mcp-env/bin/activate # macOS / Linux
qwen-mcp-envScriptsactivate # Home windows
# Core packages
pip set up
"openai>=1.30.0"
"qwen-agent>=0.0.10"
"mcp>=1.0.0"
"httpx>=0.27.0"
# Serving framework -- select one
pip set up "vllm>=0.19.0" # NVIDIA GPU
pip set up "sglang>=0.5.10" # NVIDIA GPU (quicker prefill for lengthy context)
pip set up "ktransformers" # CPU/hybrid
# Node.js 18+ is required for pre-built MCP servers put in through npx
node --version
# Serving Qwen3.6 Regionally with an OpenAI-Suitable API
Earlier than wiring in any MCP servers, you want a working inference server. Each SGLang and vLLM expose an OpenAI-compatible API that the MCP integration layer talks to — the identical API floor, simply pointed at localhost as an alternative of api.openai.com.
// SGLang (Beneficial for Lengthy-Context Agent Workloads)
# Set up SGLang with full dependencies
pip set up "sglang[all]>=0.5.10"
# Serve Qwen3.6-35B-A3B with reasoning and tool-call parsers enabled.
# --reasoning-parser qwen3 accurately handles the ... blocks.
# --tool-call-parser qwen3_coder routes software name outputs to the suitable format.
# --enable-prefix-caching is vital for agent workloads -- allows KV cache reuse
# throughout turns, which is what makes preserve_thinking environment friendly in follow.
python -m sglang.launch_server
--model-path Qwen/Qwen3.6-35B-A3B
--host 0.0.0.0
--port 30000
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-prefix-caching
--tp 2 # tensor parallel throughout 2 GPUs; take away if utilizing single GPU
// vLLM
pip set up "vllm>=0.19.0"
# vLLM equal with the identical vital flags
vllm serve Qwen/Qwen3.6-35B-A3B
--host 0.0.0.0
--port 8000
--reasoning-parser qwen3
--tool-call-parser qwen3_coder
--enable-prefix-caching-v2
--tensor-parallel-size 2
// Smaller Mannequin through Ollama
ollama pull qwen2.5:7b
ollama serve
# Ollama's API is OpenAI-compatible at http://localhost:11434/v1
As soon as the server is working, confirm it earlier than going any additional:
# Well being verify -- ought to return {"standing": "okay"} or related
curl http://localhost:30000/well being
# Check the chat completions endpoint with a easy question
curl http://localhost:30000/v1/chat/completions
-H "Content material-Sort: software/json"
-d '{
"mannequin": "Qwen/Qwen3.6-35B-A3B",
"messages": [{"role": "user", "content": "Reply with: ready"}],
"max_tokens": 10
}'
Should you get a JSON response with a decisions array, the server is prepared. Don’t proceed to MCP setup till this works. Each integration failure you’ll encounter later is simpler to debug when the serving layer is stable.
# Understanding MCP and Why It Adjustments the Agent Structure
Earlier than writing any agent code, it helps to grasp what MCP really does on the protocol stage, as a result of that understanding prevents a class of bugs that come from treating MCP as only a fancier function-calling API.
MCP is a JSON-RPC 2.0 protocol working over stdio or HTTP transport. When an MCP consumer connects to a server, the very first thing it does is name instruments/checklist to find what instruments the server exposes. Every software comes again with a reputation, an outline, and an enter schema outlined in JSON Schema. The mannequin reads this schema. It’s the mannequin’s contract with the software.
When the mannequin needs to name a software, it emits a structured software name object. The MCP consumer — not the mannequin — really executes the decision by sending a instruments/name request to the server. The server handles execution and returns a consequence. The consumer injects that consequence again into the dialog as a software function message. The mannequin reads the consequence and decides the following step.
This separation is vital. The mannequin decides what to name and with what arguments. The consumer handles execution. The server handles the precise work. Your code by no means hardwires a software to a mannequin; you simply inform the consumer which servers can be found.
There are two methods to make use of MCP with Qwen3.6:
- By way of Qwen-Agent: the official
qwen_agentlibrary handles software discovery, name parsing, consequence injection, and multi-turn dialog administration robotically. Much less code, much less management. Proper for many use circumstances. - By way of the MCP Python SDK immediately: you deal with the agentic loop your self utilizing
mcp.ClientSession. Extra code, full visibility into each message, full management over error dealing with and retry logic. Proper for manufacturing programs the place that you must monitor each step.
This text covers each, beginning with Qwen-Agent.
# Constructing the Native GitHub Developer Assistant
The agent does 4 issues in sequence: reads open points from a GitHub repository, finds the related code, drafts a repair, and opens a pull request. All regionally, all by way of MCP.
// Half 1: Setting and MCP Server Setup
# Set your GitHub private entry token
# Required by the GitHub MCP server for API calls
export GITHUB_TOKEN=ghp_your_token_here
# Pre-built MCP servers set up through npx -- no separate set up step
# npx handles this on first use when the agent begins the servers
# Confirm npx is accessible:
npx --version
Create a venture listing:
mkdir qwen-github-agent
cd qwen-github-agent
// Half 2: Qwen-Agent Implementation
The quickest path to a working agent. Qwen-Agent handles the complete loop robotically.
# github_agent_qwenagent.py
# Conditions: pip set up qwen-agent openai
# npm / npx have to be put in for the MCP servers
# GITHUB_TOKEN env var have to be set
# Native serving endpoint have to be working (see earlier part)
#
# How you can run:
# python github_agent_qwenagent.py
from qwen_agent.brokers import Assistant
# ── Server configuration ──────────────────────────────────────────────────────
# Level at your native serving endpoint.
# Change the base_url to match whichever server you began:
# SGLang: http://localhost:30000/v1
# vLLM: http://localhost:8000/v1
# Ollama: http://localhost:11434/v1
LLM_CONFIG = {
"mannequin": "Qwen/Qwen3.6-35B-A3B",
"model_server": "http://localhost:30000/v1",
"api_key": "EMPTY", # Native servers don't require an actual key
# Considering mode sampling params (from the official mannequin card finest practices)
"generate_cfg": {
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"thought_in_history": True, # That is the preserve_thinking flag in Qwen-Agent
},
}
# ── MCP server configuration ──────────────────────────────────────────────────
# Every server key names the server; the worth is the stdio launch command.
# Qwen-Agent begins every server as a subprocess and manages the MCP classes.
MCP_SERVERS = {
"mcpServers": {
"filesystem": {
"command": "npx",
"args": [
"-y",
"@modelcontextprotocol/server-filesystem",
# Grant the agent access to the current working directory
# In production, restrict to the specific repository path
"."
]
},
"github": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-github"],
"env": {
# The GitHub MCP server reads this env var for API authentication
"GITHUB_TOKEN": "${GITHUB_TOKEN}"
}
},
}
}
# ── System immediate ─────────────────────────────────────────────────────────────
SYSTEM_PROMPT = """You're a senior software program engineer with full entry to a GitHub repository
through MCP instruments.
When given a repository and process:
1. Checklist open points to grasp what wants fixing
2. Use filesystem instruments to learn related supply information and checks
3. Determine the foundation trigger based mostly on the code and the problem description
4. Write a focused repair -- minimal adjustments, no refactoring unrelated to the bug
5. Create a pull request with a transparent title and outline referencing the problem
All the time clarify your reasoning at every step. Assume by way of edge circumstances earlier than writing code.
If you're unsure a couple of file's objective, learn it earlier than modifying it."""
# ── Agent setup ───────────────────────────────────────────────────────────────
agent = Assistant(
llm=LLM_CONFIG,
identify="GitHub Developer Assistant",
description="Reads points, fixes bugs, opens pull requests -- regionally through MCP.",
system_message=SYSTEM_PROMPT,
mcp_servers=MCP_SERVERS,
)
# ── Run the agent ─────────────────────────────────────────────────────────────
def run_agent(process: str):
"""
Run the agent on a process description and stream the output.
The agent will make software calls robotically; Qwen-Agent handles
the complete loop together with software execution and consequence injection.
"""
messages = [{"role": "user", "content": task}]
print(f"Activity: {process}n{'─' * 70}")
# Qwen-Agent's run() is a generator that yields intermediate steps
# Every yielded message reveals a software name, a software consequence, or the ultimate reply
for response in agent.run(messages=messages):
# response is an inventory of messages representing the dialog to date
# The final message comprises the newest output
final = response[-1]
function = final.get("function", "")
content material = final.get("content material", "")
if function == "assistant" and content material:
# Strip and show the considering block individually for readability
import re
considering = re.search(r"(.*?) ", content material, re.DOTALL)
if considering:
print(f"[thinking] {considering.group(1).strip()[:200]}...")
clear = re.sub(r".*? ", "", content material, flags=re.DOTALL).strip()
if clear:
print(f"[agent] {clear}")
elif function == "software":
tool_name = final.get("identify", "unknown_tool")
print(f"[tool:{tool_name}] consequence acquired")
if __name__ == "__main__":
run_agent(
"Within the repository myorg/my-api-project, discover the open situation about "
"the login endpoint returning 200 for invalid tokens. Learn the related "
"code and checks, repair the bug, and open a pull request."
)
How you can run:
python github_agent_qwenagent.py
// Half 3: Uncooked MCP SDK Implementation
For groups who want full management over each protocol message, customized error dealing with, per-tool retry logic, and audit logging of each software name and consequence:
# github_agent_raw.py
# Conditions: pip set up mcp openai httpx
# GITHUB_TOKEN env var have to be set, native server have to be working
#
# How you can run:
# python github_agent_raw.py
import asyncio
import json
import os
import re
from openai import AsyncOpenAI
from mcp import ClientSession, StdioServerParameters
from mcp.consumer.stdio import stdio_client
# ── Native serving consumer ───────────────────────────────────────────────────────
consumer = AsyncOpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY",
)
MODEL = "Qwen/Qwen3.6-35B-A3B"
# ── Response processing ───────────────────────────────────────────────────────
def strip_thinking(textual content: str) -> str:
"""Take away ... blocks. Used after we solely want the motion."""
return re.sub(r".*? ", "", textual content, flags=re.DOTALL).strip()
def extract_thinking(textual content: str) -> str:
"""Extract the content material of the considering block for logging."""
m = re.search(r"(.*?) ", textual content, re.DOTALL)
return m.group(1).strip() if m else ""
def process_response(response, preserve_thinking: bool = True) -> dict:
"""
Course of a chat completion response from Qwen3.6.
Handles two output codecs:
1. Device name through the API's function_call / tool_calls discipline (when --tool-call-parser is lively)
2. Device name embedded within the message content material as JSON
Args:
response: The OpenAI-compatible completion response
preserve_thinking: If True, preserve considering content material in output for
the following flip's KV cache profit
Returns:
dict with considering, tool_calls, final_answer, has_tool_calls, is_terminal
"""
selection = response.decisions[0]
message = selection.message
# Path 1: Device calls within the structured discipline (most well-liked -- requires tool-call-parser flag)
if message.tool_calls:
tool_calls = [
{
"name": tc.function.name,
"arguments": json.loads(tc.function.arguments),
"call_id": tc.id,
}
for tc in message.tool_calls
]
considering = extract_thinking(message.content material or "")
return {
"considering": considering if preserve_thinking else "",
"tool_calls": tool_calls,
"final_answer": "",
"has_tool_calls": True,
"is_terminal": False,
}
# Path 2: Device calls embedded in content material textual content (fallback)
content material = message.content material or ""
tag_matches = re.findall(r"(.*?) ", content material, re.DOTALL)
tool_calls = []
for m in tag_matches:
attempt:
tool_calls.append(json.masses(m.strip()))
besides json.JSONDecodeError:
move
considering = extract_thinking(content material)
final_answer = re.sub(r".*? ", "", content material, flags=re.DOTALL)
final_answer = re.sub(r".*? ", "", final_answer, flags=re.DOTALL).strip()
return {
"considering": considering if preserve_thinking else "",
"tool_calls": tool_calls,
"final_answer": final_answer,
"has_tool_calls": len(tool_calls) > 0,
"is_terminal": len(tool_calls) == 0 and bool(final_answer),
}
# ── Core agent loop ───────────────────────────────────────────────────────────
async def run_github_agent(process: str, repo: str, max_turns: int = 20):
"""
Run the GitHub developer assistant agent.
Connects to filesystem and GitHub MCP servers, discovers their instruments,
and runs the Qwen3.6 agent loop till the duty is full or max_turns reached.
"""
# Begin each MCP servers and set up classes
fs_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-filesystem", "."],
)
gh_params = StdioServerParameters(
command="npx",
args=["-y", "@modelcontextprotocol/server-github"],
env={**os.environ, "GITHUB_TOKEN": os.environ.get("GITHUB_TOKEN", "")},
)
async with stdio_client(fs_params) as (fs_read, fs_write),
ClientSession(fs_read, fs_write) as fs_session,
stdio_client(gh_params) as (gh_read, gh_write),
ClientSession(gh_read, gh_write) as gh_session:
# Initialize each classes
await fs_session.initialize()
await gh_session.initialize()
# Uncover all out there instruments from each servers
fs_tools_result = await fs_session.list_tools()
gh_tools_result = await gh_session.list_tools()
# Construct the OpenAI-format software checklist for the mannequin
all_tools = []
tool_to_session = {} # Maps software identify to the MCP session that owns it
for software in fs_tools_result.instruments:
all_tools.append({
"sort": "operate",
"operate": {
"identify": software.identify,
"description": software.description,
"parameters": software.inputSchema,
}
})
tool_to_session[tool.name] = fs_session
for software in gh_tools_result.instruments:
all_tools.append({
"sort": "operate",
"operate": {
"identify": software.identify,
"description": software.description,
"parameters": software.inputSchema,
}
})
tool_to_session[tool.name] = gh_session
print(f"Instruments out there: {len(all_tools)} ({len(fs_tools_result.instruments)} filesystem, "
f"{len(gh_tools_result.instruments)} GitHub)")
# Construct dialog historical past
system_prompt = f"""You're a senior software program engineer with entry to the repository {repo}.
Use the out there instruments to analyze points, learn code, write fixes, and create pull requests.
Assume step-by-step. Learn earlier than you modify. Minimal adjustments solely."""
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": task},
]
# ── Agent loop ─────────────────────────────────────────────────────────
for flip in vary(max_turns):
print(f"n[Turn {turn + 1}]")
# Name the mannequin
response = await consumer.chat.completions.create(
mannequin=MODEL,
messages=messages,
instruments=all_tools if all_tools else None,
tool_choice="auto",
# Considering mode sampling params from the official finest practices
temperature=0.6,
top_p=0.95,
top_k=20,
min_p=0.0,
max_tokens=4096,
extra_body={
# preserve_thinking retains reasoning context throughout turns
# for KV cache effectivity on lengthy agent classes
"preserve_thinking": True,
}
)
consequence = process_response(response, preserve_thinking=True)
if consequence["thinking"]:
print(f"[thinking] {consequence['thinking'][:200]}...")
# Terminal state -- agent has produced a closing reply
if consequence["is_terminal"]:
print(f"n[DONE]n{consequence['final_answer']}")
return consequence["final_answer"]
# Device name state -- execute every software and inject outcomes
if consequence["has_tool_calls"]:
# Append the assistant's message with software calls to historical past
messages.append({
"function": "assistant",
"content material": response.decisions[0].message.content material or "",
"tool_calls": response.decisions[0].message.tool_calls or [],
})
for name in consequence["tool_calls"]:
tool_name = name["name"]
tool_args = name.get("arguments", {})
call_id = name.get("call_id", "")
print(f"[tool] {tool_name}({json.dumps(tool_args)[:80]}...)")
session = tool_to_session.get(tool_name)
if not session:
result_content = f"Error: software '{tool_name}' not discovered"
else:
attempt:
tool_result = await session.call_tool(tool_name, tool_args)
result_content = str(tool_result.content material)
# Truncate very lengthy outcomes to guard context price range
if len(result_content) > 12000:
result_content = result_content[:12000] + "n...[truncated]"
besides Exception as e:
result_content = f"Error: {e}"
print(f"[result] {result_content[:150]}...")
messages.append({
"function": "software",
"content material": result_content,
"tool_call_id": call_id,
"identify": tool_name,
})
print(f"[WARNING] max_turns ({max_turns}) reached with out terminal state")
# ── Entry level ───────────────────────────────────────────────────────────────
if __name__ == "__main__":
asyncio.run(run_github_agent(
process=(
"Discover the open situation in regards to the login endpoint returning 200 for invalid tokens. "
"Learn src/auth.py and checks/test_auth.py to grasp the bug. "
"Repair the verify_token operate and open a pull request along with your adjustments."
),
repo="myorg/my-api-project",
))
How you can run:
python github_agent_raw.py
The uncooked SDK path offers you what Qwen-Agent abstracts: you may see each software name, each consequence, and each message injected into the dialog historical past. The tool_to_session routing dict is the important thing mechanism; it maps every software identify to the MCP session that owns it, so the agent can name any software from any related server with out understanding which server gives it.
# Writing a Customized MCP Server
Pre-built MCP servers deal with the filesystem and GitHub. While you want one thing that doesn’t exist — querying an inner database, wrapping a CI/CD API, working code evaluation instruments — you write an MCP server. Here’s a full code_quality server that exposes ruff and pytest as MCP instruments.
# code_quality_server.py
# A customized MCP server exposing code high quality instruments to Qwen3.6.
#
# Conditions:
# pip set up mcp ruff pytest
#
# How you can run standalone (for testing):
# python code_quality_server.py
#
# So as to add to the Qwen-Agent config:
# "code_quality": {
# "command": "python",
# "args": ["/absolute/path/to/code_quality_server.py"]
# }
import asyncio
import json
import subprocess
import sys
from mcp.server.fastmcp import FastMCP
# FastMCP is a high-level MCP server framework -- reduces boilerplate considerably
mcp = FastMCP("code_quality")
@mcp.software()
def run_linter(file_path: str, repair: bool = False) -> str:
"""
Run ruff linter on a Python file and return structured lint outcomes.
Use this earlier than modifying a file to grasp its present high quality state,
and after making adjustments to confirm the repair didn't introduce new points.
Args:
file_path: Absolute or relative path to the Python file to lint.
repair: If true, robotically repair protected points in place.
Returns:
JSON string with points checklist, situation depend, and information modified.
"""
cmd = ["python", "-m", "ruff", "check", file_path, "--output-format=json"]
if repair:
cmd.append("--fix")
attempt:
consequence = subprocess.run(cmd, capture_output=True, textual content=True, timeout=30)
# ruff returns exit code 1 when points are discovered -- not an error
output = consequence.stdout or consequence.stderr
# Parse ruff's JSON output
attempt:
points = json.masses(output) if output.strip() else []
besides json.JSONDecodeError:
points = []
formatted = [
{
"line": issue.get("location", {}).get("row", 0),
"col": issue.get("location", {}).get("column", 0),
"code": issue.get("code", ""),
"message": issue.get("message", ""),
"fix_available": issue.get("fix") is not None,
}
for issue in issues
if isinstance(issue, dict)
]
return json.dumps({
"file": file_path,
"points": formatted,
"total_issues": len(formatted),
"fastened": "auto-fix utilized" if repair else "no auto-fix",
}, indent=2)
besides subprocess.TimeoutExpired:
return json.dumps({"error": "Linter timed out after 30s", "file": file_path})
besides FileNotFoundError:
return json.dumps({"error": "ruff not discovered -- set up with: pip set up ruff"})
@mcp.software()
def run_tests(goal: str, verbose: bool = False) -> str:
"""
Run pytest on a module or listing and return structured move/fail outcomes.
Use this after writing a repair to confirm the repair makes failing checks move
with out breaking different checks.
Args:
goal: Path to the take a look at file or listing to run (e.g. checks/, checks/test_auth.py)
verbose: If true, embrace full pytest output within the consequence.
Returns:
JSON string with move depend, fail depend, failure particulars, and length.
"""
cmd = ["python", "-m", "pytest", target, "--json-report", "--json-report-file=-", "-q"]
if verbose:
cmd.append("-v")
attempt:
consequence = subprocess.run(cmd, capture_output=True, textual content=True, timeout=120)
output = consequence.stdout
# Parse pytest-json-report output if out there
attempt:
report = json.masses(output)
abstract = report.get("abstract", {})
failures = [
{
"test": t["nodeid"],
"message": t.get("name", {}).get("longrepr", "")[:500],
}
for t in report.get("checks", [])
if t.get("final result") == "failed"
]
return json.dumps({
"goal": goal,
"handed": abstract.get("handed", 0),
"failed": abstract.get("failed", 0),
"errors": abstract.get("error", 0),
"whole": abstract.get("whole", 0),
"length": abstract.get("length", 0),
"failures": failures,
"stdout": consequence.stdout[:2000] if verbose else "",
}, indent=2)
besides json.JSONDecodeError:
# Fallback: return uncooked output if JSON report not out there
return json.dumps({
"goal": goal,
"stdout": consequence.stdout[:3000],
"stderr": consequence.stderr[:1000],
"exit_code": consequence.returncode,
})
besides subprocess.TimeoutExpired:
return json.dumps({"error": f"Assessments timed out after 120s for goal: {goal}"})
besides FileNotFoundError:
return json.dumps({"error": "pytest not discovered -- set up with: pip set up pytest"})
if __name__ == "__main__":
mcp.run(transport="stdio")
Add it to both agent implementation’s server config:
# In Qwen-Agent MCP_SERVERS dict:
"code_quality": {
"command": "python",
"args": ["/absolute/path/to/code_quality_server.py"]
}
# Within the uncooked SDK, add a 3rd StdioServerParameters:
cq_params = StdioServerParameters(
command="python",
args=["/absolute/path/to/code_quality_server.py"],
)
Check the server standalone earlier than connecting the agent:
# Check the server in MCP inspector mode
npx @modelcontextprotocol/inspector python code_quality_server.py
# Opens a browser UI the place you may name run_linter and run_tests immediately
# Tuning Considering Mode and Preserving Reasoning
The considering mode choice impacts latency considerably sufficient that it’s value treating as an express structure selection, not an afterthought.
In considering mode, Qwen3.6 generates a chain-of-thought reasoning hint inside tags earlier than producing its motion. For a 5-step agent process, that hint provides 1,000 to five,000 tokens per flip relying on process complexity. These tokens take time to generate and devour context price range.
When that price is value paying:
- Planning steps the place the agent decides what to do subsequent.
- Debugging classes the place the issue is genuinely ambiguous.
- Multi-file refactoring the place the agent must purpose about unwanted effects throughout information.
The reasoning hint catches errors earlier than they change into software calls with improper arguments. When it’s not value paying: mechanical tool-call loops the place every step is unambiguous — checklist listing → learn file → write file → commit. The mannequin doesn’t must suppose onerous about these steps. Non-thinking mode is quicker and produces the identical high quality output.
Swap modes per-request, not globally:
# Considering mode (planning, debugging, complicated multi-file duties)
THINKING_PARAMS = {
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
}
# Non-thinking mode (mechanical loops, quick standing checks)
# Move enable_thinking=False within the chat template, or use system immediate:
# Add "/no_think" to the system immediate to suppress considering mode.
NON_THINKING_PARAMS = {
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"min_p": 0.0,
}
The preserve_thinking flag — the Qwen3.6-specific functionality that retains reasoning context throughout turns — immediately impacts inference effectivity when prefix caching is lively. Right here is why it issues virtually: in a 10-turn agent session, every flip shares a prefix of the dialog historical past. When preserve_thinking=True, the complete reasoning hint from prior turns stays within the historical past. The KV cache on the server facet acknowledges the shared prefix throughout turns and avoids recomputing it. The efficient tokens-per-second price for lengthy classes is meaningfully larger than with out it, notably when serving infrastructure like SGLang with --enable-prefix-caching is working.
The sensible rule: use preserve_thinking=True for agent classes that can run for greater than 5 turns. Use preserve_thinking=False (or non-thinking mode) for single-turn queries and quick pipelines the place the overhead is a waste.
# Conclusion
Qwen3.6-35B-A3B’s MoE structure offers you 35B mannequin high quality at 3B activation price. Its 262k context window offers you room to carry a whole code evaluate session in context. Its express coaching on MCP-based agentic benchmarks means it is aware of easy methods to use instruments accurately, not simply name them.
MCP gives the connective tissue. Outline a software as soon as as an MCP server. Each Qwen3.6 session and each different MCP-compatible mannequin can uncover and name it with out customized glue. The GitHub and filesystem servers on this article are two of lots of of pre-built servers within the MCP ecosystem. The customized code_quality server reveals the sample for something that doesn’t exist already.
The GitHub developer assistant on this article is one software of the sample. The identical structure — native mannequin, MCP instruments, and agentic loop — works for a analysis assistant that searches educational databases and drafts literature opinions, a DevOps agent that reads CloudWatch logs and opens incident tickets, or an information pipeline agent that reads SQL schemas, writes transformation code, and validates outputs. The MCP ecosystem is rising quick. The native mannequin functionality is already there.
Shittu Olumide is a software program engineer and technical author enthusiastic about leveraging cutting-edge applied sciences to craft compelling narratives, with a eager eye for element and a knack for simplifying complicated ideas. You may as well discover Shittu on Twitter.
















