develop extra advanced, conventional logging and monitoring fall quick. What groups really want is observability: the power to hint agent choices, consider response high quality robotically, and detect drift over time—with out writing and sustaining giant quantities of customized analysis and telemetry code.
Subsequently, groups have to undertake the precise platform for observability whereas they give attention to the core job of constructing and enhancing the brokers’ orchestration. And combine their software to the observability platform with minimal overhead to their practical code. On this article, I’ll reveal how one can arrange an open-source AI observability platform to carry out the next utilizing a minimal-code method:
- LLM-as-a-Choose: Configure pre-built evaluators to attain responses for Correctness, Relevance, Hallucination and extra. Show scores throughout runs with detailed logs and analytics.
- Testing at scale: Arrange datasets to retailer regression take a look at instances for measuring accuracy towards anticipated floor fact responses. Proactively detect LLM and agent drift.
- MELT information: Monitor metrics (latency, token utilization, mannequin drift), occasions (API calls, LLM calls, device utilization), logs (consumer interplay, device execution, agent resolution making) with detailed traces – all with out detailed telemetry and instrumentation code.
We will probably be utilizing Langfuse for observability. It’s open-source and framework-agnostic and might work with fashionable orchestration frameworks and LLM suppliers.
Multi-agent software
For this demonstration, I’ve connected the LangGraph code of a Buyer Service software. The applying accepts tickets from the consumer, classifies into Technical, Billing or Each utilizing a Triage agent, then routes it to the Technical Help agent, Billing Help agent or to each of them. Then a finalizer agent synthesizes the response from each brokers right into a coherent, extra readable format. The flowchart is as follows:

The code is connected right here
# --------------------------------------------------
# 0. Load .env
# --------------------------------------------------
from dotenv import load_dotenv
load_dotenv(override=True)
# --------------------------------------------------
# 1. Imports
# --------------------------------------------------
import os
from typing import TypedDict
from langgraph.graph import StateGraph, END
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
from langfuse.langchain import CallbackHandler
# --------------------------------------------------
# 2. Langfuse Consumer (WORKING CONFIG)
# --------------------------------------------------
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
langfuse_callback = CallbackHandler()
os.environ["LANGGRAPH_TRACING"] = "false"
# --------------------------------------------------
# 3. Azure OpenAI Setup
# --------------------------------------------------
llm = AzureChatOpenAI(
azure_deployment=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
callbacks=[langfuse_callback], # 🔑 allows token utilization
)
# --------------------------------------------------
# 4. Shared State
# --------------------------------------------------
class AgentState(TypedDict, whole=False):
ticket: str
class: str
technical_response: str
billing_response: str
final_response: str
# --------------------------------------------------
# 5. Agent Definitions
# --------------------------------------------------
def triage_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
identify="triage_agent",
enter={"ticket": state["ticket"]},
) as span:
span.update_trace(identify="Buyer Service Question - LangGraph Demo")
response = llm.invoke([
{
"role": "system",
"content": (
"Classify the query as one of: "
"Technical, Billing, Both. "
"Respond with only the label."
),
},
{"role": "user", "content": state["ticket"]},
])
uncooked = response.content material.strip().decrease()
if "each" in uncooked:
class = "Each"
elif "technical" in uncooked:
class = "Technical"
elif "billing" in uncooked:
class = "Billing"
else:
class = "Technical" # ✅ secure fallback
span.replace(output={"uncooked": uncooked, "class": class})
return {"class": class}
def technical_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
identify="technical_support_agent",
enter={
"ticket": state["ticket"],
"class": state.get("class"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a technical support specialist. "
"Provide a clear, step-by-step solution."
),
},
{"role": "user", "content": state["ticket"]},
])
reply = response.content material
span.replace(output={"technical_response": reply})
return {"technical_response": reply}
def billing_support_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
identify="billing_support_agent",
enter={
"ticket": state["ticket"],
"class": state.get("class"),
},
) as span:
response = llm.invoke([
{
"role": "system",
"content": (
"You are a billing support specialist. "
"Answer clearly about payments, invoices, or accounts."
),
},
{"role": "user", "content": state["ticket"]},
])
reply = response.content material
span.replace(output={"billing_response": reply})
return {"billing_response": reply}
def finalizer_agent(state: dict) -> dict:
with langfuse.start_as_current_observation(
as_type="span",
identify="finalizer_agent",
enter={
"ticket": state["ticket"],
"technical": state.get("technical_response"),
"billing": state.get("billing_response"),
},
) as span:
components = [
f"Technical:n{state['technical_response']}"
for ok in ["technical_response"]
if state.get(ok)
] + [
f"Billing:n{state['billing_response']}"
for ok in ["billing_response"]
if state.get(ok)
]
if not components:
remaining = "Error: No agent responses out there."
else:
response = llm.invoke([
{
"role": "system",
"content": (
"Combine the following agent responses into ONE clear, professional, "
"customer-facing answer. Do not mention agents or internal labels. "
f"Answer the user's query: '{state['ticket']}'."
),
},
{"position": "consumer", "content material": "nn".be a part of(components)},
])
remaining = response.content material
span.replace(output={"final_response": remaining})
return {"final_response": remaining}
# --------------------------------------------------
# 6. LangGraph Development
# --------------------------------------------------
builder = StateGraph(AgentState)
builder.add_node("triage", triage_agent)
builder.add_node("technical", technical_support_agent)
builder.add_node("billing", billing_support_agent)
builder.add_node("finalizer", finalizer_agent)
builder.set_entry_point("triage")
# Conditional routing
builder.add_conditional_edges(
"triage",
lambda state: state["category"],
{
"Technical": "technical",
"Billing": "billing",
"Each": "technical",
"__default__": "technical", # ✅ by no means dead-end
},
)
# Sequential decision
builder.add_conditional_edges(
"technical",
lambda state: state["category"],
{
"Each": "billing", # Proceed to billing if Each
"__default__": "finalizer",
},
)
builder.add_edge("billing", "finalizer")
builder.add_edge("finalizer", END)
graph = builder.compile()
# --------------------------------------------------
# 9. Primary
# --------------------------------------------------
if __name__ == "__main__":
print("===============================================")
print(" Conditional Multi-Agent Help System (Prepared)")
print("===============================================")
print("Enter 'exit' or 'stop' to cease this system.n")
whereas True:
# Get consumer enter for the ticket
ticket = enter("Enter your help question (ticket): ")
# Verify for exit command
if ticket.decrease() in ["exit", "quit"]:
print("nExiting the help system. Goodbye!")
break
if not ticket.strip():
print("Please enter a non-empty question.")
proceed
strive:
# --- Run the graph with the consumer's ticket ---
end result = graph.invoke(
{"ticket": ticket},
config={"callbacks": [langfuse_callback]},
)
# --- Print Outcomes ---
class = end result.get('class', 'N/A')
print(f"n✅ Triage Classification: **{class}**")
# Verify which brokers have been executed primarily based on the presence of a response
executed_agents = []
if end result.get("technical_response"):
executed_agents.append("Technical")
if end result.get("billing_response"):
executed_agents.append("Billing")
print(f"🛠️ Brokers Executed: {', '.be a part of(executed_agents) if executed_agents else 'None (Triage Failed)'}")
print("n================ FINAL RESPONSE ================n")
print(end result["final_response"])
print("n" + "="*60 + "n")
besides Exception as e:
# That is vital for debugging: print the exception kind and message
print(f"nAn error occurred throughout processing ({kind(e).__name__}): {e}")
print("nPlease strive one other question.")
print("n" + "="*60 + "n")
Observability Configuration
To arrange Langfuse, go to https://cloud.langfuse.com/, and arrange an account with a Billing tier (interest tier with beneficiant limits out there), then arrange a Mission. Within the venture settings, you’ll be able to generate the general public and secret keys which must be offered originally of the code. You additionally want so as to add the LLM connection, which will probably be used for the LLM-as-a-Choose analysis.

LLM-as-a-Choose setup
That is the core of the efficiency analysis setup for brokers. Right here you’ll be able to configure numerous pre-built Evaluators from the Evaluator Library which can rating the responses on numerous standards comparable to Conciseness, Correctness, Hallucination, Reply Critic and so forth. These ought to suffice for many use instances, else Customized Evaluators will be arrange additionally. Here’s a view of the Evaluator library:

Choose the evaluator, say Relevance, that you simply want to use. You’ll be able to select to run it for brand spanking new or present traces or for Dataset runs. As well as, evaluation the analysis immediate to make sure it satisfies your analysis goal. Most significantly, the question, era and different variables ought to be appropriately mapped to the supply (normally, to the Enter and Output from the appliance hint). For our case, these would be the ticket information entered by the consumer and the response generated by the finalizer agent respectively. As well as, for Dataset runs, you’ll be able to evaluate the generated responses to the Floor Fact responses saved as anticipated outputs (defined within the subsequent sections).
Right here is the configuration for the ‘GT Accuracy’ analysis I arrange for brand spanking new Dataset runs, together with the Variable mapping. The analysis immediate preview can be depicted. A lot of the evaluators rating inside a spread of 0 to 1:


For the customer support demo, I’ve configured 3 evaluators – Relevance, Conciseness which run for all new traces, and GT Accuracy, which deploys for Dataset runs solely.

Datasets setup
Create a dataset to make use of as a take a look at case repository. Right here, you’ll be able to retailer take a look at instances with the enter question and the perfect anticipated response. To create the dataset, there are 3 selections: create one document at a time, add a CSV of queries and anticipated responses, or, fairly conveniently, add inputs and outputs straight from the appliance traces whose responses are adjudged to be of excellent high quality by human specialists.
Right here is the dataset I’ve created for the demo. These are a mixture of technical, billing, or ‘Each’ queries, and I’ve created all of the data from software traces:

That’s it! The configuration is finished and we’re able to run observability.
Observability Outcomes
The Langfuse Residence web page is a dashboard of a number of helpful charts. It exhibits the rely of execution traces, scores and averages at a look, traces by time, mannequin utilization and value and so forth.

MELT information
Essentially the most helpful observability information is on the market within the ‘Tracing’ possibility, which shows summarized and detailed views of all executions. Here’s a view of the dashboard depicting the time, identify, enter, output and the essential latency and token utilization metrics. Be aware that for each agent execution of our software, there are 2 analysis traces generated for the Conciseness and Relevance evaluators we arrange.


Let’s take a look at the main points of one of many executions of the Buyer Service software. On the left panel, the agent move is depicted each as a tree in addition to a flowchart. It exhibits the LangGraph nodes (brokers) and the LLM calls together with the token utilization. If our brokers had device calls or human-in-the-loop steps, they might have been depicted right here as effectively. Be aware that the analysis scores for Conciseness and Relevance are additionally depicted on prime, that are 0.40 and 1 respectively for this run. Clicking on them exhibits the explanation for the rating and a hyperlink to take us to the evaluator hint.
On the precise, for every agent, LLM and gear name, we will see the Enter and generated output. As an illustration, right here we see that the question was categorized as ‘Each’, and subsequently within the left chart, it exhibits each the technical and billing help brokers have been referred to as, which confirms our move is working as anticipated.

On prime of the precise hand panel, there may be the ‘Add to datasets’ button. At any step of the tree, this button, when clicked, will open up a panel just like the one depicted under, the place you’ll be able to add the enter and output of that step on to a take a look at dataset created within the earlier part. It is a helpful characteristic for human specialists so as to add ceaselessly occurring consumer queries and good responses to the dataset throughout regular agent operations, thereby constructing a Regression take a look at repository with minimal effort. In future, when there’s a main improve or launch to the appliance, the Regression dataset will be run and the generated outputs will be scored towards the Anticipated outputs (floor fact) recorded right here utilizing the ‘GT Accuracy’ evaluator we created in the course of the LLM-as-a-judge setup. This helps to detect LLM drift (or agent drift) early and take corrective steps.

Right here is likely one of the analysis traces (Conciseness) for this software hint. The evaluator supplies the reasoning for the rating of 0.4 it adjudged this response to be.

Scores
The Scores possibility in Langfuse present an inventory of all of the analysis runs from the assorted energetic evaluators together with their scores. Extra pertinent is the Analytics dashboard, the place two scores will be chosen and metrics comparable to imply and customary deviation together with development traces will be seen.


Regression testing
With Datasets, we’re able to run regression testing utilizing the take a look at case repository of queries and anticipated outputs. We now have saved 4 queries in our Regression dataset, with a mixture of technical, billing and ‘Each’ queries.
For this, we will run the connected code which will get the related dataset and runs the experiment. All of the take a look at runs are logged together with the common scores. We will view the results of a specific take a look at with Conciseness, GT Accuracy and Relevance scores for every take a look at case in a single dashboard. And as wanted, the detailed hint will be accessed to see the reasoning for the rating.
You’ll be able to view the code right here.
from langfuse import get_client
from langfuse.openai import OpenAI
from langchain_openai import AzureChatOpenAI
from langfuse import Langfuse
import os
# Initialize shopper
from dotenv import load_dotenv
load_dotenv(override=True)
langfuse = Langfuse(
host="https://cloud.langfuse.com",
public_key=os.environ["LANGFUSE_PUBLIC_KEY"] ,
secret_key=os.environ["LANGFUSE_SECRET_KEY"]
)
llm = AzureChatOpenAI(
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT_NAME"),
api_version=os.environ.get("AZURE_OPENAI_API_VERSION", "2025-01-01-preview"),
temperature=0.2,
)
# Outline your job operate
def my_task(*, merchandise, **kwargs):
query = merchandise.enter['ticket']
response = llm.invoke([{"role": "user", "content": question}])
uncooked = response.content material.strip().decrease()
return uncooked
# Get dataset from Langfuse
dataset = langfuse.get_dataset("Regression")
# Run experiment straight on the dataset
end result = dataset.run_experiment(
identify="Manufacturing Mannequin Check",
description="Month-to-month analysis of our manufacturing mannequin",
job=my_task # see above for the duty definition
)
# Use format methodology to show outcomes
print(end result.format())


Key Takeaways
- AI observability doesn’t must be code-heavy.
Most analysis, tracing, and regression testing capabilities for LLM brokers will be enabled via configuration slightly than customized code, considerably lowering improvement and upkeep effort. - Wealthy analysis workflows will be outlined declaratively.
Capabilities comparable to LLM-as-a-Choose scoring (relevance, conciseness, hallucination, ground-truth accuracy), variable mapping, and analysis prompts are configured straight within the observability platform—with out writing bespoke analysis logic. - Datasets and regression testing are configuration-first options.
Check case repositories, dataset runs, and ground-truth comparisons will be arrange and reused via the UI or easy configuration, permitting groups to run regression assessments throughout agent variations with minimal further code. - Full MELT observability comes “out of the field.”
Metrics (latency, token utilization, value), occasions (LLM and gear calls), logs, and traces are robotically captured and correlated, avoiding the necessity for guide instrumentation throughout agent workflows. - Minimal instrumentation, most visibility.
With light-weight SDK integration, groups achieve deep visibility into multi-agent execution paths, analysis outcomes, and efficiency tendencies—liberating builders to give attention to agent logic slightly than observability plumbing.
Conclusion
As LLM brokers turn into extra advanced, observability is not non-obligatory. With out it, multi-agent methods shortly flip into black containers which can be troublesome to judge, debug, and enhance.
An AI observability platform shifts this burden away from builders and software code. Utilizing a minimal-code, configuration-first method, groups can allow LLM-as-a-Choose analysis, regression testing, and full MELT observability with out constructing and sustaining customized pipelines. This not solely reduces engineering effort but additionally accelerates the trail from prototype to manufacturing.
By adopting an open-source, framework-agnostic platform like Langfuse, groups achieve a single supply of fact for agent efficiency—making AI methods simpler to belief, evolve, and function at scale.
Need to know extra? The Buyer Service agentic software offered right here follows a manager-worker structure sample, which doesn’t work in CrewAI. Examine how observability helped me to repair this well-known subject with the manager-worker hierarchical technique of CrewAI, by tracing agent responses at every step and refining them to get the orchestration to work because it ought to. Full evaluation right here: Why CrewAI’s Supervisor-Employee Structure Fails — and Easy methods to Repair It
Join with me and share your feedback at www.linkedin.com/in/partha-sarkar-lets-talk-AI
All photographs and information used on this article are synthetically generated. Figures and code created by me
















