with an unlimited corpus of textual content information, the place they, throughout their pre-training stage, basically devour the complete web. LLMs thrive once they have entry to all related information to reply to person questions appropriately. Nevertheless, in lots of instances, we restrict the capabilities of our LLMs by not offering them sufficient information. On this article, I’ll talk about why you must care about feeding our LLM extra information, the best way to fetch this information, and particular purposes.
I’ll additionally begin with a brand new function in my articles: Writing out my essential purpose, what I need to obtain with the article, and what you must know after studying it. If profitable, I’ll begin writing it into every of my articles:
My purpose for this text is to focus on the significance of offering LLMs with related information, and how one can feed it into your LLMs for improved efficiency

You can too learn my articles on The right way to Analyze and Optimize Your LLMs in 3 Steps and Doc QA utilizing Multimodal LLMs.
Desk of contents
Why add extra information to LLMs?
I’ll begin my article off by stating why it’s vital. LLMs are extremely information hungry, that means that they require loads of information to -work effectively. That is generally proven within the pre-training corpus of LLMs, which consists of trillions of textual content tokens getting used for coaching the LLM.
Nevertheless, the idea of using loads of information additionally applies to LLMs throughout inference time (once you make the most of the LLM in manufacturing). It’s essential present the LLM with all needed information to reply a person request.
In loads of instances, you inadvertently scale back the LLM’s efficiency by not offering related info.
For instance, when you create a query answering system, the place customers can add information and speak to them. Naturally, you present the textual content contents of every file in order that the person can chat with the doc; nevertheless, you would, for instance, neglect so as to add the filenames of the paperwork to the context the person is chatting with. This may impression the LLM’s efficiency, for instance, if some info is simply current within the filename or the person references the filename within the chat. Another particular LLM purposes the place further information is beneficial are:
- Classification
- Info extraction
- Key phrase seek for discovering related paperwork to feed to LLM
In the remainder of the article, I’ll talk about the place you could find such information, strategies to retrieve further information, and a few particular use instances for the info.
On this part, I’ll talk about information that you just seemingly have already got accessible in your software. One instance is my final analogy, the place you’ve gotten a query answering system for information, however neglect so as to add the filename to the context. Another examples are:
- File extensions (.pdf, .docx, .xlsx)
- Folder path (if the person uploaded a folder)
- Timestamps (for instance, if a person asks about the newest doc, that is required)
- Web page numbers (the person would possibly ask the LLM to fetch particular info situated on web page 5)

There are a ton of different such examples of information you seemingly have already got accessible, or that you could rapidly fetch and add to your LLM’s context.
The kind of information you’ve gotten accessible will range extensively from software to software. Loads of the examples I’ve supplied on this article are tailor-made to text-based AI, since that’s the area I spend essentially the most time in. Nevertheless, when you, for instance, work extra on visible AI or audio-based AI, I urge you to seek out related examples in your area.
For visible AI, it could possibly be:
- Location information for the place the picture/video was taken
- The filename of the picture/video file
- The writer of the picture/video file
Or for audio AI, it could possibly be
- Metadata about who’s talking when
- Timestamps for every sentence
- Location information from the place the audio was recorded
My level being, there’s a plethora of accessible information on the market; all it is advisable do is search for it and contemplate how it may be helpful on your software.
Typically, the info you have already got accessible just isn’t sufficient. You need to present your LLM with much more information to assist it reply questions appropriately. On this case, it is advisable retrieve further information. Naturally, since we’re within the age of LLMs, we’ll make the most of LLMs to fetch this information.
Retrieving info beforehand
The best strategy is to retrieve further information by fetching it earlier than processing any stay requests. For doc AI, this implies extracting particular info from paperwork throughout processing. You would possibly extract the sort of doc (authorized doc, tax doc, or gross sales brochure) or particular info contained within the doc (dates, names, areas, …).
The benefit of fetching the knowledge beforehand is:
- Pace (in manufacturing, you solely have to fetch the worth out of your database)
- You may benefit from batch processing to cut back prices
At present, fetching this sort of info is somewhat easy. You arrange an LLM with a particular system immediate to fetch info, and feed the immediate together with the textual content into the LLM. The LLM will then course of the textual content and extract the related info for you. You would possibly need to contemplate evaluating the efficiency of your info extraction, wherein case you possibly can learn my article on Evaluating 5 Tens of millions LLM Requests with Automated Evals.
You seemingly additionally need to map out all the knowledge factors to retrieve, for instance:
When you’ve gotten created this checklist, you possibly can retrieve all of your metadata and retailer it within the database.
Nevertheless, the primary draw back of fetching info beforehand is that you must predetermine which info to extract. That is tough in loads of situations, wherein case you are able to do stay info retrieval, which I cowl within the subsequent part.
On-demand info retrieval
When you possibly can’t decide which info to retrieve beforehand, you possibly can fetch it on demand. This implies establishing a generic perform that takes in a knowledge level to extract and the textual content to extract it from. For instance
import json
def retrieve_info(data_point: str, textual content: str) -> str:
immediate = f"""
Extract the next information level from the textual content beneath and return it in a JSON object.
Knowledge Level: {data_point}
Textual content: {textual content}
Instance JSON Output: {{"consequence": "instance worth"}}
"""
return json.hundreds(call_llm(immediate))
You outline this perform as a instrument your LLM has entry to, and which it might probably name at any time when it wants info. That is basically how Anthropic has arrange their deep analysis system, the place they create one orchestrator agent that may spawn sub-agents to fetch further info. Observe that giving your LLM entry to make use of further prompts can result in loads of token utilization, so you must take note of you’re LLM’s token spend.
Till now, I’ve mentioned why you must make the most of further information and the best way to come up with it. Nevertheless, to totally grasp the content material of this text, I’ll additionally present particular purposes the place this information improves LLM efficiency.
Metadata filtering search

My first instance is that you could carry out a search with metadata filtering. Offering info reminiscent of:
- file-type (pdf, xlsx, docx, …)
- file measurement
- Filename
It might assist your software when fetching related info. This may, for instance, be info fetched to be fed into your LLM’s context, like when performing RAG. You may make the most of the extra metadata to filter away irrelevant information.
A person might need requested a query pertaining to solely Excel paperwork. Utilizing RAG to fetch chunks from information apart from Excel paperwork is, subsequently, unhealthy utilization of the LLM’s context window. You must as an alternative filter accessible chunks to solely discover Excel paperwork, and make the most of chunks from Excel paperwork to greatest reply the person’s question. You may study extra about dealing with LLM contexts in my article on constructing efficient AI brokers.
AI agent web search
One other instance is when you’re asking your AI agent questions on current historical past that occurred after the pre-training cutoff for the LLM. LLMs sometimes has a coaching information cutoff for pre-training information, as a result of the info must be rigorously curated, and preserving it totally updated is difficult.
This presents an issue when customers ask questions on current historical past, for instance, about current occasions within the information. On this case, the AI agent answering the question wants entry to an web search (basically performing info extraction on the web). That is an instance of on-demand info extraction.
Conclusion
On this article, I’ve mentioned the best way to considerably improve your LLM by offering it with further information. You may both discover this information in your present metadata (filenames, file-size, location information), or you possibly can retrieve the info by means of info extraction (doc sort, names talked about in a doc, and many others). This info is commonly vital to an LLM’s capability to efficiently reply person queries, and in lots of situations, the dearth of this information basically ensures the LLM’s failure to reply a query appropriately.
👉 Discover me on socials:
🧑💻 Get in contact
✍️ Medium
















