Synthetic intelligence chatbots will be too chatty when answering questions on authorities providers, swamping correct data and making errors if informed to be extra concise, in line with analysis.
The Open Knowledge Institute (ODI) examined 11 giant language fashions (LLMs) on greater than 22,000 questions, evaluating their responses to solutions primarily based on materials from the official GOV.UK web site. Its researchers judged the LLM output on verbosity, accuracy, and the way typically they refused to reply.
They discovered that fashions typically waffled, burying the details or going past authoritative authorities data, whereas telling them to be extra concise decreased their accuracy.
“Verbosity is understood conduct of LLMs – they’re liable to ‘phrase salad’ responses that make them more durable to make use of and reduce their reliability,” the researchers wrote in a abstract [PDF].
Some, together with Anthropic’s Claude 4.5 Haiku, have been extra verbose than others.
The researchers added that LLMs are good at combining materials from a number of sources, which is beneficial in some conditions, however makes errors extra probably on this one. They really useful that customers be informed about dangers and the place to seek out authoritative data.
The ODI analysis discovered that whereas fashions typically answered appropriately, they made errors inconsistently and unpredictably. ChatGPT-OSS-20B stated somebody would solely be eligible for Guardian’s Allowance – a profit paid to individuals caring for a kid whose mother and father have died – if the kid themselves had died.
Llama 3.1 8B suggested {that a} courtroom order was required so as to add an ex-partner’s title to a toddler’s delivery certificates, when it truly simply requires re-registration of the delivery, and Qwen3-32B wrongly stated the £500 Certain Begin Maternity Grant is on the market in Scotland.
The researchers noticed fashions trying to reply virtually each query requested, no matter whether or not or not they have been able to doing so precisely. They described this failure to refuse to reply as “a harmful trait” because it may lead individuals to behave on misinformation.
Smaller, cheaper-to-run LLMs can ship comparable outcomes to giant closed supply ones akin to OpenAI’s ChatGPT 4.1, the ODI stated. This confirmed the necessity for flexibility in adopting AI and avoiding long-term contracts that lock organizations into utilizing particular suppliers.
“If language fashions are for use safely in citizen-facing providers, we have to perceive the place the expertise will be trusted and the place it can not,” stated ODI director of analysis Professor Elena Simperl. “Which means being open about uncertainty, protecting solutions tightly targeted on authoritative sources akin to GOV.UK, and addressing the excessive ranges of inconsistency seen in present programs.”
The analysis used CitizenQuery-UK, a set of twenty-two,066 synthetically generated questions residents would possibly ask and corresponding solutions primarily based on GOV.UK materials, which the ODI has launched on the Hugging Face platform.
In December, the Authorities Digital Service stated it deliberate to add a chatbot to its GOV.UK app early tin 2026, adopted by its web site. Since then, the federal government has stated it would work with provider Anthropic to construct such a service for job seekers, and the Division for Work and Pensions is experimenting with one for Common Credit score claimants. ®
Synthetic intelligence chatbots will be too chatty when answering questions on authorities providers, swamping correct data and making errors if informed to be extra concise, in line with analysis.
The Open Knowledge Institute (ODI) examined 11 giant language fashions (LLMs) on greater than 22,000 questions, evaluating their responses to solutions primarily based on materials from the official GOV.UK web site. Its researchers judged the LLM output on verbosity, accuracy, and the way typically they refused to reply.
They discovered that fashions typically waffled, burying the details or going past authoritative authorities data, whereas telling them to be extra concise decreased their accuracy.
“Verbosity is understood conduct of LLMs – they’re liable to ‘phrase salad’ responses that make them more durable to make use of and reduce their reliability,” the researchers wrote in a abstract [PDF].
Some, together with Anthropic’s Claude 4.5 Haiku, have been extra verbose than others.
The researchers added that LLMs are good at combining materials from a number of sources, which is beneficial in some conditions, however makes errors extra probably on this one. They really useful that customers be informed about dangers and the place to seek out authoritative data.
The ODI analysis discovered that whereas fashions typically answered appropriately, they made errors inconsistently and unpredictably. ChatGPT-OSS-20B stated somebody would solely be eligible for Guardian’s Allowance – a profit paid to individuals caring for a kid whose mother and father have died – if the kid themselves had died.
Llama 3.1 8B suggested {that a} courtroom order was required so as to add an ex-partner’s title to a toddler’s delivery certificates, when it truly simply requires re-registration of the delivery, and Qwen3-32B wrongly stated the £500 Certain Begin Maternity Grant is on the market in Scotland.
The researchers noticed fashions trying to reply virtually each query requested, no matter whether or not or not they have been able to doing so precisely. They described this failure to refuse to reply as “a harmful trait” because it may lead individuals to behave on misinformation.
Smaller, cheaper-to-run LLMs can ship comparable outcomes to giant closed supply ones akin to OpenAI’s ChatGPT 4.1, the ODI stated. This confirmed the necessity for flexibility in adopting AI and avoiding long-term contracts that lock organizations into utilizing particular suppliers.
“If language fashions are for use safely in citizen-facing providers, we have to perceive the place the expertise will be trusted and the place it can not,” stated ODI director of analysis Professor Elena Simperl. “Which means being open about uncertainty, protecting solutions tightly targeted on authoritative sources akin to GOV.UK, and addressing the excessive ranges of inconsistency seen in present programs.”
The analysis used CitizenQuery-UK, a set of twenty-two,066 synthetically generated questions residents would possibly ask and corresponding solutions primarily based on GOV.UK materials, which the ODI has launched on the Hugging Face platform.
In December, the Authorities Digital Service stated it deliberate to add a chatbot to its GOV.UK app early tin 2026, adopted by its web site. Since then, the federal government has stated it would work with provider Anthropic to construct such a service for job seekers, and the Division for Work and Pensions is experimenting with one for Common Credit score claimants. ®
















