: You’ve constructed a fancy LLM software that responds to consumer queries a few particular area. You’ve spent days establishing the whole pipeline, from refining your prompts to including context retrieval, chains, instruments and eventually presenting the output. Nonetheless, after deployment, you understand that the applying’s response appears to be lacking the mark e.g., both you aren’t happy with its responses or it’s taking an exorbitant period of time to reply. Whether or not the issue is rooted in your prompts, your retrieval, API calls, or elsewhere, monitoring and observability can assist you kind it out.
On this tutorial, we’ll begin by studying the fundamentals of LLM monitoring and observability. Then, we’ll discover the open-source ecosystem, culminating our dialogue on Langfuse. Lastly, we’ll implement monitoring and observability of a Python primarily based LLM software utilizing Langfuse.
What’s Monitoring and Observability?
Monitoring and observability are essential ideas in sustaining the well being of any IT system. Whereas the phrases ‘monitoring’ and ‘observability’ are sometimes clipped collectively, they symbolize barely completely different ideas.
In response to IBM’s definition, monitoring is the method of amassing and analyzing system information to trace efficiency over time. It depends on predefined metrics to detect anomalies or potential failures. Widespread examples embody monitoring system’s CPU and reminiscence utilization and alerting when sure thresholds are breached.
Observability supplies a deeper understanding of the system’s inner state primarily based on exterior outputs. It permits you to diagnose and perceive why one thing is going on, not simply that one thing is flawed. For instance, observability permits you to hint inputs and outputs by varied elements of the system to identify the place a bottleneck is going on.
The above definitions are additionally legitimate within the realm of LLM functions. It’s by monitoring and observability that we are able to hint the inner states of an LLM software, equivalent to how consumer question is processed by varied modules (e.g., retrieval, era) and what are related latencies and prices.

Listed here are some key phrases used within the monitoring and observability:
Telemetry: Telemetry is a broad time period which encompasses amassing information out of your software whereas it’s working and processing it to grasp the conduct of the applying.
Instrumentation: Instrumentation is the method of including code to your software to gather telemetry information. For LLM functions, this implies including hooks at varied key factors to seize inner states, equivalent to API calls to the LLM or the retriever’s outputs.
Hint: Hint, a direct consequence of instrumentation, highlights the detailed execution journey of a request by your complete software. This encompasses enter/output at every key level and the corresponding time taken at every level. Every hint is made up of a sequence of spans.
Commentary: Every hint is made up of a number of observations, which may be of kind Span, Occasion or Era.
Span: Span is a unit of labor or operation, which explains the method being carried out on every key level.
Era: Era is a particular sort of span which tracks the enter request despatched to the LLM mannequin and its output response.
Logs: Logs are time stamped data of occasions and interactions throughout the LLM software.
Metrics: Metrics are numerical measurements that present mixture insights into the LLM’s conduct and efficiency equivalent to hallucinations or reply relevancy.

Why is LLM Monitoring and Observability Vital?
As LLM functions have gotten more and more complicated, LLM monitoring and observability can play a vital position in optimizing the applying efficiency. Listed here are some the reason why it will be important:
Reliability: LLM functions are important to organizations; efficiency degradation can instantly affect their companies. Monitoring ensures that the applying is performing throughout the acceptable limits by way of high quality, latency and uptime and many others.
Debugging: A posh LLM software may be unpredictable; it will probably produce faulty responses or encounter errors. Monitoring and Observability can assist establish issues within the software by sifting by the whole lifecycle of every request and pinpointing the basis trigger.
Person Expertise: Monitoring consumer expertise and suggestions is significant for LLM functions which instantly work together with the client base. This permits organizations to reinforce consumer expertise by monitoring the consumer conversations and making knowledgeable choices. Most significantly, it permits assortment of customers’ suggestions to enhance the mannequin and downstream processes.
Bias and Equity: LLMs are skilled on publicly obtainable information and subsequently typically internalize the attainable bias within the obtainable information. This would possibly trigger them to provide offensive or dangerous info. Observability can assist in mitigating such responses by correct corrective measures.
Price Administration: Monitoring can assist you monitor and optimize prices incurred throughout the common operations, equivalent to LLM’s API prices per token. It’s also possible to arrange alerts in case of over utilization.
Instruments for Monitoring and Observability
There are lots of superb instruments and libraries obtainable for enabling monitoring and observability of LLM functions. Loads of these instruments are open supply, providing free self-hosting options on native infrastructure in addition to enterprise degree deployment on their respective cloud servers. Every of those instruments affords widespread options equivalent to tracing, token rely, latencies, whole requests, and time-based filtering and many others. Aside from this, every resolution has its personal set of distinct options and strengths.
Right here, we’re going to identify just a few open-source instruments which provide free self-hosting options.
Langfuse: A preferred open supply LLM monitoring software, which is each mannequin and framework agnostic. It affords a variety of monitoring choices utilizing Shopper SDKs objective constructed for Python and JavaScript/TypeScript.
Arize Phoenix: One other widespread software which affords each self-hosting and Phoenix Cloud deployment. Phoenix affords SDKs for Python and JavaScript/TypeScript.
AgentOps: AgentOps is a widely known resolution which tracks LLM outputs, retrievers, permits benchmarking, and ensures compliance. It affords integration with a number of LLM suppliers.
Grafana: A basic and extensively used monitoring software which may be mixed with OpenTelemetry to offer detailed LLM tracing and monitoring.
Weave: Weights & Biases’ Weave is one other LLM monitoring and experimentation software for LLM primarily based functions, which affords each self-managed and devoted cloud environments. The Shopper SDKs can be found in Python and TypeScript.
Introducing Langfuse
Notice: Langfuse shouldn’t be confused with LangSmith, which is a proprietary Monitoring and Observability software, developed and maintained by the LangChain group. You may be taught extra concerning the variations right here.
Langfuse affords all kinds of options equivalent to LLM observability, tracing, LLM token and value monitoring, immediate administration, datasets and LLM safety. Moreover, Langfuse affords analysis of LLM responses utilizing varied strategies equivalent to LLM-as-a-Choose and consumer’s suggestions. Furthermore, Langfuse affords LLM playground to its premium customers, which lets you tweak your LLM prompts and parameters on the spot and watch how LLM responds to these adjustments. We’ll focus on extra particulars afterward in our tutorial.
Langfuse’s resolution to LLM monitoring and observability consists of two elements:
- Langfuse SDKs
- Langfuse Server
The Langfuse SDKs are the coding aspect of Langfuse, obtainable for varied platforms, which let you allow instrumentation in your software’s code. They’re nothing various traces of code which can be utilized appropriately in your software’s codebase.
The Langfuse server, alternatively, is the UI primarily based dashboard, together with different underlying companies, which can be utilized to log, view and persist all of the traces and metrics. The Langfuse’s dashboard is normally accessible by any fashionable net browser.
Earlier than establishing the dashboard, it’s necessary to notice that Langfuse affords three alternative ways of internet hosting dashboards, that are:
- Self-hosting (native)
- Managed internet hosting (utilizing Langfuse’s cloud infrastructure)
- On-premises deployment
The managed and on-premises deployment are past the scope of this tutorial. You may go to Langfuse’s official documentation to get all of the related info.
A self-hosting resolution, because the identify implies, allows you to merely run an occasion of Langfuse by yourself machine (e.g., PC, laptop computer, digital machine or net service). Nonetheless, there’s a catch on this simplicity. The Langfuse server requires a persistent Postgres database server to repeatedly keep its states and information. Which means together with a Langfuse server, we additionally must arrange a Postgres server. However don’t fear, we’ve got acquired issues below management. You may both use a Postgres server hosted on any cloud service (equivalent to Azure, AWS), or you possibly can simply self-host it, identical to Langfuse service. Capiche?
How is Langfuse’s self-hosting achieved? Langfuse affords a number of methods to try this, equivalent to utilizing docker/docker-compose or Kubernetes and/or deploying on cloud servers. In the intervening time, let’s follow leveraging docker instructions.
Setting Up a Langfuse Server
Now, it’s time to get hands-on expertise with establishing a Langfuse dashboard for an LLM software and logging traces and metrics onto it. Once we say Langfuse server, we imply the Langfuse’s dashboard and different companies which permit the traces to be logged, seen and continued. This requires a basic understanding of docker and its related ideas. You may undergo this tutorial, in case you are not already conversant in docker.
Utilizing docker-compose
Essentially the most handy and the quickest approach to arrange Langfuse by yourself machine is to make use of a docker-compose file. That is only a two-step course of, which includes cloning Langfuse in your native machine and easily invoking docker-compose.
Step 1: Clone the Langfuse’s repository:
$ git clone https://github.com/langfuse/langfuse.git
$ cd langfuse
Step 2: Begin all companies
$ docker compose up
And that’s it! Go to your net browser and open http://localhost:3000 to witness Langfuse UI working. Additionally cherish the truth that docker-compose takes care of the Postgres server routinely.
From this level, we are able to safely transfer on to the part of establishing Python SDK and enabling instrumentation in our code.
Utilizing docker
The docker setup of the Langfuse server is sort of a docker-compose implementation, with an apparent distinction: we’ll arrange each the containers (Langfuse and Postgres) individually and can join them utilizing an inner community. This is perhaps useful in eventualities the place docker-compose shouldn’t be the acceptable first selection, perhaps as a result of you have already got your Postgres server working, otherwise you wish to run each companies individually for extra management, equivalent to internet hosting each companies individually on Azure Net App Companies resulting from useful resource limitations.
Step 1: Create a customized community
First, we have to arrange a customized bridge community, which can permit each the containers to speak with one another privately.
$ docker community create langfuse-network
This command creates a community by the identify langfuse-network. Be happy to alter it in keeping with your preferences.
Step 2: Arrange a Postgres service
We’ll begin by working the Postgres container, since Langfuse service will depend on this, utilizing the next command:
$ docker run -d
--name postgres-db
--restart at all times
-p 5432:5432
--network langfuse-network
-v database_data:/var/lib/postgresql/information
-e POSTGRES_USER=postgres
-e POSTGRES_PASSWORD=postgres
-e POSTGRES_DB=postgres
postgres:newest
Clarification:
This command will run a docker picture of postgres:newest as a container with the identify postgres-db, on a community named langfuse-network and expose this service to port 5432 in your native machine. For persistence, (i.e. to maintain information intact for future use) it would create a quantity and join it to a folder named database_data in your native machine. Moreover, it would arrange and assign values to a few essential surroundings variables of a Postgres server’s superuser: POSTGRES_USER, POSTGRES_PASSWORD and POSTGRES_DB.
Step 3: Arrange the Langfuse service
$ docker run –d
--name langfuse-server
--network langfuse-network
-p 3000:3000
-e DATABASE_URL=postgresql://postgres:postgres@postgres-db:5432/postgres
-e NEXTAUTH_SECRET=mysecret
-e SALT=mysalt
-e ENCRYPTION_KEY=0000000000000000000000000000000000000000000000000000000000000000
-e NEXTAUTH_URL=http://localhost:3000
langfuse/langfuse:2
Clarification:
Likewise, this command will run a docker picture of langfuse/langfuse:2 within the indifferent mode (-d), as a container with the identify langfuse-server, on the identical community referred to as langfuse-network and expose this service to port 3000. It’ll additionally assign values to necessary surroundings variables. The NEXTAUTH_URL should level to the URL the place the langfuse-server could be deployed.
ENCRYPTION_KEY have to be 256 bits, 64 string characters in hex format. You may generate this in Linux by way of:
$ openssl rand -hex 32
The DATABASE_URL is an surroundings variable which defines the whole database path and credentials. The final format for Postgres URL is:
postgresql://[POSTGRES_USER[:POSTGRES_PASSWORD]@][host[:port]/[POSTGRES_DB]
Right here, the host is the host identify (i.e. container identify) of our PostgreSQL server or the IP tackle.
Lastly, go to your net browser and open http://localhost:3000 to confirm that the Langfuse server is offered.
Configuring Langfuse Dashboard
After you have efficiently arrange the Langfuse server, it’s time to configure the Langfuse dashboard earlier than you can begin tracing software information.
Go to the http://localhost:3000 in your net browser, as defined within the earlier part. You have to create a brand new group, members and a mission below which you’d be tracing and logging all of your metrics. Comply with by the method on the dashboard that takes you thru all of the steps.
For instance, right here we’ve got arrange a corporation by the identify of datamonitor, added a member by the identify data-user1 with “Proprietor” position, and a mission named data-demo. This can lead us to the next display screen:

This display screen shows each private and non-private API keys, which shall be used whereas establishing tracing utilizing SDKs; preserve them saved for future use. And with this step, we’re lastly achieved with configuring the langfuse server. The one different process left is to begin the instrumentation course of on the code aspect of our software.
Enabling Langfuse Tracing utilizing SDKs
Langfuse affords a simple approach to allow tracing of LLM functions with minimal traces of code. As talked about earlier, Langfuse affords tracing options for varied languages, frameworks and LLM fashions, equivalent to Langchain, LlamaIndex, OpenAI and others. You may even allow Langfuse tracing in serverless capabilities equivalent to AWS Lambda.
However earlier than we hint our software, let’s truly create a pattern software utilizing OpenAI’s framework. We’ll create a quite simple chat completion software utilizing OpenAI’s gpt-4o-mini for demonstration functions solely.
First, set up the required packages:
$ pip set up openai
import os
import openai
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('OPENAI_KEY','')
consumer = openai.OpenAI(api_key=api_key)
nation = 'Pakistan'
question = f"Identify the capital of {nation} in a single phrase solely"
response = consumer.chat.completions.create(
mannequin="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": query}],
max_tokens=100,
)
print(response.selections[0].message.content material)
Output:
Islamabad.
Let’s now allow langfuse tracing within the given code. It’s important to make minor changes to the code, starting with putting in the langfuse bundle.
Set up all of the required packages as soon as once more:
$ pip set up langfuse openai --upgrade
The code with langfuse enabled seems to be like this:
import os
#import openai
from langfuse.openai import openai
from dotenv import load_dotenv
load_dotenv()
api_key = os.getenv('OPENAI_KEY','')
consumer = openai.OpenAI(api_key=api_key)
LANGFUSE_SECRET_KEY="sk-lf-..."
LANGFUSE_PUBLIC_KEY="pk-lf-..."
LANGFUSE_HOST="http://localhost:3000"
os.environ['LANGFUSE_SECRET_KEY'] = LANGFUSE_SECRET_KEY
os.environ['LANGFUSE_PUBLIC_KEY'] = LANGFUSE_PUBLIC_KEY
os.environ['LANGFUSE_HOST'] = LANGFUSE_HOST
nation = 'Pakistan'
question = f"Identify the capital of {nation} in a single phrase solely"
response = consumer.chat.completions.create(
mannequin="gpt-4o-mini",
messages=[
{"role": "system", "content": "You are a helpful assistant"},
{"role": "user", "content": query}],
max_tokens=100,
)
print(response.selections[0].message.content material)
You see, we’ve got simply changed import openai with from langfuse.openai import openai to allow tracing.
When you now go to your Langfuse dashboard, you’ll observe traces of the OpenAI software.
A Full Finish-to-Finish Instance
Now let’s dive into enabling monitoring and observability on a whole LLM software. We’ll implement a RAG pipeline, which fetches related context from the vector database. We’re going to use ChromaDB as a vector database.
We’ll use the Langchain framework to construct our RAG primarily based software (confer with ‘primary LLM-RAG software’ determine above). You may be taught Langchain by pursuing this tutorial on find out how to construct LLM functions with Langchain.
If you wish to be taught the fundamentals of RAG, this tutorial generally is a good start line. As for the vector database, confer with this tutorial on establishing ChromaDB.
This part assumes that you’ve already arrange and configured the Langfuse server on the localhost, as achieved within the earlier part.
Step 1: Set up and Setup
Set up all required packages together with langchain, chromadb and langfuse.
pip set up -U langchain-community langchain-openai chromadb langfuse
Subsequent, we import all of the required packages and libraries:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from langfuse.callback import CallbackHandler
from dotenv import load_dotenv
The load_dotenv bundle is used to load all surroundings variables, that are saved in a .env file. Ensure that your OpenAI’s secret key’s saved as OPENAI_API_KEY within the .env file.
Lastly, we combine Langfuse’s Langchain callback system to allow tracing in our software.
langfuse_handler = CallbackHandler(
secret_key="sk-lf-...",
public_key="pk-lf-...",
host="http://localhost:3000"
)
Step 2: Arrange Data Base
To imitate a RAG system, we’ll:
- Scrape some insightful articles from the Confiz’ blogs part utilizing
WebBaseLoader - Break them into smaller chunks utilizing
RecursiveCharacterTextSplitter - Convert them into vector embeddings utilizing OpenAI’s embeddings
- Ingest them into our Chroma vector database. This can function the data base for our LLM to search for and reply consumer queries.
urls = [
"https://www.confiz.com/blog/a-cios-guide-6-essential-insights-for-a-successful-generative-ai-launch/",
"https://www.confiz.com/blog/ai-at-work-how-microsoft-365-copilot-chat-is-driving-transformation-at-scale/",
"https://www.confiz.com/blog/setting-up-an-in-house-llm-platform-best-practices-for-optimal-performance/",
]
loader = WebBaseLoader(urls)
docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=20,
length_function=len,
)
chunks = text_splitter.split_documents(docs)
# Create the vector retailer
vectordb = Chroma.from_documents(
paperwork=chunks,
embedding=OpenAIEmbeddings(mannequin="text-embedding-3-large"),
persist_directory="chroma_db",
collection_name="confiz_blog"
)
retriever = vectordb.as_retriever(search_type="similarity",search_kwargs={"ok": 3})
We’ve assumed a bit dimension of 500 tokens with an overlap of 20 tokens in Recursive Textual content Splitter, which considers varied components earlier than chunking on the given dimension. The vectordb object of ChromaDB is transformed right into a retriever object, permitting us to make use of it conveniently within the Langchain retrieval pipeline.
Step 3: Arrange RAG pipeline
The subsequent step is to arrange the RAG chain, utilizing the facility of LLM together with the data base of the vector database to reply consumer queries. As beforehand, we’ll use OpenAI’s gpt-4o-mini as our base mannequin.
mannequin = ChatOpenAI(
model_name="gpt-4o-mini",
)
template = """
You might be an AI assistant offering useful info primarily based on the given context.
Reply the query utilizing solely the supplied context."
Context:
{context}
Query:
{query}
Reply:
"""
immediate = PromptTemplate(
template=template,
input_variables=["context", "question"]
)
qa_chain = RetrievalQA.from_chain_type(
llm=mannequin,
retriever=retriever,
chain_type_kwargs={"immediate": immediate},
)
We’ve used RetrievalQA that implements end-to-end pipeline comprising doc retrieval and LLM’s query answering functionality.
Step 4: Run RAG pipeline
It’s time to run our RAG pipeline. Let’s concoct a number of queries associated to the articles ingested within the ChromaDB and observe LLM’s response within the Langfuse dashboard
queries = [
"What are the ways to deal with compliance and security issues in generative AI?",
"What are the key considerations for a successful generative AI launch?",
"What are the key benefits of Microsoft 365 Copilot Chat?",
"What are the best practices for setting up an in-house LLM platform?",
]
for question in queries:
response = qa_chain.invoke({"question": question}, config={"callbacks": [langfuse_handler]})
print(response)
print('-'*60)
As you might need seen, the callbacks argument within the qa_chain is what offers Langfuse the power to seize traces of the whole RAG pipeline. Langfuse helps varied frameworks and LLM libraries which may be discovered right here.
Step 5: Observing the traces
Lastly, it’s time to open Langfuse Dashboard working within the net browser and reap the fruits of our exhausting work. You probably have adopted our tutorial from the start, we created a mission named data-demo below the group named datamonitor. On the touchdown web page of your Langfuse dashboard, you can see this mission. Click on on ‘Go to mission’ and you can see a dashboard with varied panels equivalent to traces and mannequin prices and many others.

As seen, you possibly can regulate the time window and add filters in keeping with your wants. The cool half is that you just don’t must manually add LLM’s description and enter/output token prices to allow price monitoring; Langfuse routinely does it for you.However this isn’t simply it; within the left bar, choose Tracing > Traces to take a look at all the person traces. Since we’ve got requested 4 queries, we’ll observe 4 completely different traces every representing the whole pipeline in opposition to every question.

Every hint is distinguished by an ID, timestamp and accommodates corresponding latency and whole price. The utilization column exhibits the whole enter and output token utilization in opposition to every hint.
When you click on on any of these traces, the Langfuse will depict the whole image of the underlying processes, equivalent to inputs and outputs for every stage, overlaying every part from retrieval, LLM name and the era. Insightful, isn’t it?

Analysis Metrics
As a bonus characteristic, let’s additionally add our customized metrics associated to the LLM’s response on the identical dashboard. On a self-hosted resolution, identical to we’ve got carried out, this may be made attainable by fetching all traces from the dashboard, making use of custom-made analysis on these traces and publishing them again to the dashboard.
The analysis may be utilized by merely using one other LLM with appropriate prompts. In any other case, we are able to use analysis frameworks, equivalent to DeepEval or promptfoo and many others., which additionally use LLMs below the hood. We will go together with DeepEval, which is an open-source framework developed to guage the response of LLMs.
Let’s do that course of within the following steps:
Step 1: Set up and Setup
First, we set up deepeval framework:
$ pip set up deepeval
Subsequent, we make essential imports:
from langfuse import Langfuse
from datetime import datetime, timedelta
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
from dotenv import load_dotenv
load_dotenv()
Step 2: Fetching the traces from the dashboard
Step one is to fetch all of the traces, throughout the given time window, from the working Langfuse server into our Python code.
langfuse_handler = Langfuse(
secret_key="sk-lf-...",
public_key="pk-lf-...",
host="http://localhost:3000"
)
now = datetime.now()
five_am_today = datetime(now.12 months, now.month, now.day, 5, 0)
five_am_yesterday = five_am_today - timedelta(days=1)
traces_batch = langfuse_handler.fetch_traces(
restrict=5,
from_timestamp=five_am_yesterday,
to_timestamp=datetime.now()
).information
print(f"Traces in first batch: {len(traces_batch)}")
Notice that we’re utilizing the identical secret and public keys as beforehand, since we’re fetching the traces from our data-demo mission. Additionally word that we’re fetching traces from 5 am yesterday until the present time.
Step 3: Making use of Analysis
As soon as we’ve got the traces, we are able to apply varied analysis metrics equivalent to bias, toxicity, hallucinations and relevance. For simplicity, let’s stick solely to the AnswerRelevancyMetric metric.
def calculate_relevance(hint):
relevance_model = 'gpt-4o-mini'
relevancy_metric = AnswerRelevancyMetric(
threshold=0.7,mannequin=relevance_model,
include_reason=True
)
test_case = LLMTestCase(
enter=hint.enter['query'],
actual_output=hint.output['result']
)
relevancy_metric.measure(test_case)
return {"rating": relevancy_metric.rating, "motive": relevancy_metric.motive}
# Do that for every hint
for hint in traces_batch:
attempt:
relevance_measure = calculate_relevance(hint)
langfuse_handler.rating(
trace_id=hint.id,
identify="relevance",
worth=relevance_measure['score'],
remark=relevance_measure['reason']
)
besides Exception as e:
print(e)
proceed
Within the above code snippet, we’ve got outlined the calculate_relevance operate to calculate relevance of the given hint utilizing DeepEval’s customary metric. Then we loop over all of the traces and individually calculate every hint’s relevance rating. The langfuse_handler object takes care of logging that rating again to the dashboard in opposition to every hint ID.
Step 4: Observing the metrics
Now in case you give attention to the identical dashboard as earlier, the ‘Scores’ panel has been populated as effectively.

You’ll discover that relevance rating has been added to the person traces as effectively.

It’s also possible to view the suggestions supplied by the DeepEval, for every hint individually.

This instance showcases a easy approach of logging analysis metrics on the dashboard. In fact, there may be extra to it by way of metrics calculation and dealing with, however let’s preserve it for the long run. Additionally importantly, you would possibly marvel what probably the most acceptable approach is to log analysis metrics on the dashboard of a working software. For the self-hosting resolution, a simple reply is to run the analysis script as a Cron Job, at particular occasions. For the enterprise model, Langfuse affords dwell analysis metrics of the LLM response, as they’re populated on the dashboard.
Superior Options
Langfuse affords many superior options, equivalent to:
Immediate Administration
This permits administration and versioning of prompts utilizing the Langfuse Dashboard UI. This permits customers to keep watch over evolving prompts in addition to report all metrics in opposition to every model of the immediate. Moreover, it additionally helps immediate playground to tweak prompts and mannequin parameters and observe their results on the general LLM response, instantly within the Langfuse UI.
Datasets
Datasets characteristic permits customers to create a benchmark dataset to measure the efficiency of the LLM software in opposition to completely different mannequin parameters and tweaked prompts. As new edge-cases are reported, they are often instantly fed into the present datasets.
Person Administration
This characteristic permits organizations to trace the prices and metrics related to every consumer. This additionally implies that organizations can hint the exercise of every consumer, encouraging honest use of the LLM software.
Conclusion
On this tutorial, we’ve got explored LLM Monitoring and Observability and its associated ideas. We carried out Monitoring and Observability utilizing Langfuse—an open-source framework, providing free and enterprise options. Choosing the self-hosting resolution, we arrange Langfuse dashboard utilizing docker file together with PostgreSQL server for persistence. We then enabled instrumentation in our pattern LLM software utilizing Langfuse Python SDKs. Lastly, we noticed all of the traces within the dashboard and in addition carried out analysis on these traces utilizing the DeepEval framework.
In a future tutorial, we may discover superior options of the Langfuse framework or discover different open-source frameworks equivalent to Arize Phoenix. We may work on the deployment of Langfuse dashboard on a cloud service equivalent to Azure, AWS or GCP.
















