Tech

Google DeepMind unveils ‘superhuman’ AI system that excels in fact-checking, saving prices and enhancing accuracy

March 29, 2024

363

[ad_1]

Join us in Atlanta on April tenth and discover the panorama of safety workforce. We will discover the imaginative and prescient, advantages, and use instances of AI for safety groups. Request an invitation right here.

A brand new examine from Google’s DeepMind analysis unit has discovered that a man-made intelligence system can outperform human fact-checkers when evaluating the accuracy of data generated by giant language fashions.

The paper, titled “Long-form factuality in large language models” and printed on the pre-print server arXiv, introduces a technique referred to as Search-Augmented Factuality Evaluator (SAFE). SAFE makes use of a big language mannequin to interrupt down generated textual content into particular person info, after which makes use of Google Search outcomes to find out the accuracy of every declare.

“SAFE utilizes an LLM to break down a long-form response into a set of individual facts and to evaluate the accuracy of each fact using a multi-step reasoning process comprising sending search queries to Google Search and determining whether a fact is supported by the search results,” the authors defined.

‘Superhuman’ efficiency sparks debate

The researchers pitted SAFE in opposition to human annotators on a dataset of roughly 16,000 info, discovering that SAFE’s assessments matched the human rankings 72% of the time. Even extra notably, in a pattern of 100 disagreements between SAFE and the human raters, SAFE’s judgment was discovered to be appropriate in 76% of instances.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour cease on April tenth. This unique, invite-only occasion, in partnership with Microsoft, will function discussions on how generative AI is reworking the safety workforce. Space is proscribed, so request an invitation in the present day.

Request an invitation

While the paper asserts that “LLM agents can achieve superhuman rating performance,” some specialists are questioning what “superhuman” actually means right here.

On a fast learn I can’t determine a lot concerning the human topics, nevertheless it seems to be like superhuman means higher than an underpaid crowd employee, reasonably a real human reality checker? That makes the characterization deceptive. (Like saying that 1985 chess software program was superhuman).…

— Gary Marcus (@GaryMarcus) March 28, 2024

Gary Marcus, a widely known AI researcher and frequent critic of overhyped claims, advised on Twitter that on this case, “superhuman” might merely imply “better than an underpaid crowd worker, rather a true human fact checker.”

“That makes the characterization misleading,” he stated. “Like saying that 1985 chess software was superhuman.”

Marcus raises a legitimate level. To actually show superhuman efficiency, SAFE would have to be benchmarked in opposition to knowledgeable human fact-checkers, not simply crowdsourced employees. The particular particulars of the human raters, equivalent to their {qualifications}, compensation, and fact-checking course of, are essential for correctly contextualizing the outcomes.

Cost financial savings and benchmarking high fashions

One clear benefit of SAFE is price — the researchers discovered that utilizing the AI system was about 20 instances cheaper than human fact-checkers. As the amount of data generated by language fashions continues to blow up, having a cost-effective and scalable option to confirm claims will likely be more and more important.

The DeepMind crew used SAFE to judge the factual accuracy of 13 high language fashions throughout 4 households (Gemini, GPT, Claude, and PaLM-2) on a brand new benchmark referred to as LongTruth. Their outcomes point out that bigger fashions typically produced fewer factual errors.

However, even the best-performing fashions generated a major variety of false claims. This underscores the dangers of over-relying on language fashions that may fluently categorical inaccurate data. Automatic fact-checking instruments like SAFE may play a key position in mitigating these dangers.

Transparency and human baselines are essential

While the SAFE code and LongTruth dataset have been open-sourced on GitHub, permitting different researchers to scrutinize and construct upon the work, extra transparency continues to be wanted across the human baselines used within the examine. Understanding the specifics of the crowdworkers’ background and course of is crucial for assessing SAFE’s capabilities in correct context.

As the tech giants race to develop ever extra highly effective language fashions for purposes starting from search to digital assistants, the power to routinely fact-check the outputs of those programs may show pivotal. Tools like SAFE symbolize an essential step in direction of constructing a brand new layer of belief and accountability.

However, it’s essential that the event of such consequential applied sciences occurs within the open, with enter from a broad vary of stakeholders past the partitions of anyone firm. Rigorous, clear benchmarking in opposition to human specialists — not simply crowdworkers — will likely be important to measure true progress. Only then can we gauge the real-world influence of automated fact-checking on the struggle in opposition to misinformation.

VB Daily

Stay within the know! Get the most recent information in your inbox every day

By subscribing, you conform to VentureBeat’s Terms of Service.

Thanks for subscribing. Check out extra VB newsletters right here.

An error occured.

Google DeepMind unveils ‘superhuman’ AI system that excels in fact-checking, saving prices and enhancing accuracy

‘Superhuman’ efficiency sparks debate

VB Event

Cost financial savings and benchmarking high fashions

Transparency and human baselines are essential

LEAVE A REPLY Cancel reply

ABOUT US

POPULAR POSTS

Choosing the Right Dog for You: A Guide to Popular Breeds

The Evolution of Fertility Technology in Modern Digital Health

The Great Digital Divorce: Why France Just Kicked Zoom and Teams to the Curb

POPULAR CATEGORY