Semantically Compress Textual content to Save On LLM Prices | by Lou Kratz

LLMs are nice… if they’ll match your entire information

Picture by Christopher Burns on Unsplash

Initially revealed at https://weblog.developer.bazaarvoice.com on October 28, 2024.

Giant language fashions are incredible instruments for unstructured textual content, however what in case your textual content doesn’t match within the context window? Bazaarvoice confronted precisely this problem when constructing our AI Assessment Summaries function: tens of millions of person opinions merely gained’t match into the context window of even newer LLMs and, even when they did, it could be prohibitively costly.

Learn how to Maximize Technical Occasions — NVIDIA GTC Paris 2025

Find out how to Entry NASA’s Local weather Information — And How It’s Powering the Struggle Towards Local weather Change Pt. 1

On this put up, I share how Bazaarvoice tackled this downside by compressing the enter textual content with out lack of semantics. Particularly, we use a multi-pass hierarchical clustering method that lets us explicitly regulate the extent of element we need to lose in trade for compression, whatever the embedding mannequin chosen. The ultimate method made our Assessment Summaries function financially possible and set us as much as proceed to scale our enterprise sooner or later.

Bazaarvoice has been amassing user-generated product opinions for almost 20 years so we’ve got rather a lot of information. These product opinions are utterly unstructured, various in size and content material. Giant language fashions are wonderful instruments for unstructured textual content: they’ll deal with unstructured information and establish related items of data amongst distractors.

LLMs have their limitations, nevertheless, and one such limitation is the context window: what number of tokens (roughly the variety of phrases) could be put into the community without delay. State-of-the-art giant language fashions, resembling Athropic’s Claude model 3, have extraordinarily giant context home windows of as much as 200,000 tokens. This implies you may match small novels into them, however the web continues to be an enormous, every-growing assortment of information, and our user-generated product opinions are not any completely different.

We hit the context window restrict whereas constructing our Assessment Summaries function that summarizes the entire opinions of a selected product on our purchasers web site. Over the previous 20 years, nevertheless, many merchandise have garnered 1000’s of opinions that rapidly overloaded the LLM context window. Actually, we even have merchandise with tens of millions of opinions that will require immense re-engineering of LLMs to have the ability to course of in a single immediate.

Even when it was technically possible, the prices can be fairly prohibitive. All LLM suppliers cost primarily based on the variety of enter and output tokens. As you method the context window limits for every product, of which we’ve got tens of millions, we are able to rapidly run up cloud internet hosting payments in extra of six figures.

To ship Assessment Summaries regardless of these technical, and monetary, limitations, we centered on a relatively easy perception into our information: Many opinions say the identical factor. Actually, the entire concept of a abstract depends on this: assessment summaries seize the recurring insights, themes, and sentiments of the reviewers. We realized that we are able to capitalize on this information duplication to scale back the quantity of textual content we have to ship to the LLM, saving us from hitting the context window restrict and lowering the working value of our system.

To attain this, we would have liked to establish segments of textual content that say the identical factor. Such a job is less complicated mentioned than accomplished: usually folks use completely different phrases or phrases to precise the identical factor.

Thankfully, the duty of figuring out if textual content is semantically related has been an energetic space of analysis within the pure language processing subject. The work by Agirre et. al. 2013 (SEM 2013 shared job: Semantic Textual Similarity. In Second Joint Convention on Lexical and Computational Semantics) even revealed a human-labeled information of semantically related sentences generally known as the STS Benchmark. In it, they ask people to point if textual sentences are semantically related or dissimilar on a scale of 1–5, as illustrated within the desk under (from Cer et. al., SemEval-2017 Job 1: Semantic Textual Similarity Multilingual and Crosslingual Targeted Analysis):

The STSBenchmark dataset is usually used to guage how nicely a textual content embedding mannequin can affiliate semantically related sentences in its high-dimensional house. Particularly, Pearson’s correlation is used to measure how nicely the embedding mannequin represents the human judgements.

Thus, we are able to use such an embedding mannequin to establish semantically related phrases from product opinions, after which take away repeated phrases earlier than sending them to the LLM.

Our method is as follows:

First, product opinions are segmented the into sentences.
An embedding vector is computed for every sentence utilizing a community that performs nicely on the STS benchmark
Agglomerative clustering is used on all embedding vectors for every product.
An instance sentence — the one closest to the cluster centroid — is retained from every cluster to ship to the LLM, and different sentences inside every cluster are dropped.
Any small clusters are thought of outliers, and people are randomly sampled for inclusion within the LLM.
The variety of sentences every cluster represents is included within the LLM immediate to make sure the burden of every sentiment is taken into account.

This may occasionally appear simple when written in a bulleted record, however there have been some devils within the particulars we needed to kind out earlier than we might belief this method.

First, we had to make sure the mannequin we used successfully embedded textual content in an area the place semantically related sentences are shut, and semantically dissimilar ones are distant. To do that, we merely used the STS benchmark dataset and computed the Pearson correlation for the fashions we desired to think about. We use AWS as a cloud supplier, so naturally we wished to guage their Titan Textual content Embedding fashions.

Under is a desk displaying the Pearson’s correlation on the STS Benchmark for various Titan Embedding fashions:

Semantically Compress Textual content to Save On LLM Prices | by Lou Kratz | Dec, 2024

READ ALSO

Learn how to Maximize Technical Occasions — NVIDIA GTC Paris 2025

Find out how to Entry NASA’s Local weather Information — And How It’s Powering the Struggle Towards Local weather Change Pt. 1

Related Posts

Learn how to Maximize Technical Occasions — NVIDIA GTC Paris 2025

Find out how to Entry NASA’s Local weather Information — And How It’s Powering the Struggle Towards Local weather Change Pt. 1

Prescriptive Modeling Makes Causal Bets – Whether or not You Understand it or Not!

Classes Realized After 6.5 Years Of Machine Studying

Financial Cycle Synchronization with Dynamic Time Warping

How you can Unlock the Energy of Multi-Agent Apps

Harnessing idle GPU energy can drive a greener tech revolution

Leave a Reply Cancel reply

POPULAR NEWS

College endowments be a part of crypto rush, boosting meme cash like Meme Index

Gemini 2.0 Flash vs GPT 4o: Which is Higher?

Scaling Statistics: Incremental Customary Deviation in SQL with dbt | by Yuval Gorchover | Jan, 2025

Find out how to Preserve Knowledge High quality within the Provide Chain

Constructing Data Graphs with LLM Graph Transformer | by Tomaz Bratanic | Nov, 2024

EDITOR'S PICK

Visualization of Information with Pie Charts in Matplotlib | by Diana Rozenshteyn | Oct, 2024

Solana Targets $200 by 12 months-Finish, however DexBoss (DEBO) Might Be the Subsequent Crypto to Explode by 5000x

Break Free from the IC Mindset. You Are a Supervisor Now. | by Jose Parreño | Dec, 2024

Fisher Data: A Scientific Dissection of an Enigmatic Idea | by Sachin Date | Oct, 2024

About Us

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?