Nearly 80% of Training Datasets May Be a Legal Hazard for Enterprise AI

0
272
Nearly 80% of Training Datasets May Be a Legal Hazard for Enterprise AI


A current paper from LG AI Research means that supposedly ‘open’ datasets used for training AI models may be offering a false sense of security – finding that nearly four out of five AI datasets labeled as ‘commercially usable’ actually contain hidden legal risks.

Such risks range from the inclusion of undisclosed copyrighted material to restrictive licensing terms buried deep in a dataset’s dependencies. If the paper’s findings are accurate, companies relying on public datasets may need to reconsider their current AI pipelines, or risk legal exposure downstream.

The researchers propose a radical and potentially controversial solution: AI-based compliance agents capable of scanning and auditing dataset histories faster and more accurately than human lawyers.

The paper states:

‘This paper advocates that the legal risk of AI training datasets cannot be determined solely by reviewing surface-level license terms; a thorough, end-to-end analysis of dataset redistribution is essential for ensuring compliance.

‘Since such analysis is beyond human capabilities due to its complexity and scale, AI agents can bridge this gap by conducting it with greater speed and accuracy. Without automation, critical legal risks remain largely unexamined, jeopardizing ethical AI development and regulatory adherence.

‘We urge the AI research community to recognize end-to-end legal analysis as a fundamental requirement and to adopt AI-driven approaches as the viable path to scalable dataset compliance.’

Examining 2,852 popular datasets that appeared commercially usable based on their individual licenses, the researchers’ automated system found that only 605 (around 21%) were actually legally safe for commercialization once all their components and dependencies were traced

The new paper is titled Do Not Trust Licenses You See — Dataset Compliance Requires Massive-Scale AI-Powered Lifecycle Tracing, and comes from eight researchers at LG AI Research.

Rights and Wrongs

The authors highlight the challenges faced by companies pushing forward with AI development in an increasingly uncertain legal landscape – as the former academic ‘fair use’ mindset around dataset training gives way to a fractured environment where legal protections are unclear and safe harbor is no longer guaranteed.

As one publication pointed out recently, companies are becoming increasingly defensive about the sources of their training data. Author Adam Buick comments*:

‘[While] OpenAI disclosed the main sources of data for GPT-3, the paper introducing GPT-4 revealed only that the data on which the model had been trained was a mixture of ‘publicly available data (such as internet data) and data licensed from third-party providers’.

‘The motivations behind this move away from transparency have not been articulated in any particular detail by AI developers, who in many cases have given no explanation at all.

‘For its part, OpenAI justified its decision not to release further details regarding GPT-4 on the basis of concerns regarding ‘the competitive landscape and the safety implications of large-scale models’, with no additional rationalization throughout the report.’

Transparency is usually a disingenuous time period  –  or just a mistaken one; as an illustration, Adobe’s flagship Firefly generative mannequin, skilled on inventory knowledge that Adobe had the rights to take advantage of, supposedly provided clients reassurances in regards to the legality of their use of the system. Later, some proof emerged that the Firefly knowledge pot had develop into ‘enriched’ with potentially copyrighted data from other platforms.

As we discussed earlier this week, there are growing initiatives designed to assure license compliance in datasets, including one that will only scrape YouTube videos with flexible Creative Commons licenses.

The problem is that the licenses in themselves may be erroneous, or granted in error, as the new research seems to indicate.

Examining Open Source Datasets

It is difficult to develop an evaluation system such as the authors’ Nexus when the context is constantly shifting. Therefore the paper states that the NEXUS Data Compliance framework system is based on ‘ various precedents and legal grounds at this point in time’.

NEXUS utilizes an AI-driven agent called AutoCompliance for automated data compliance. AutoCompliance is comprised of three key modules: a navigation module for web exploration; a question-answering (QA) module for information extraction; and a scoring module for legal risk assessment.

AutoCompliance begins with a user-provided webpage. The AI extracts key details, searches for related resources, identifies license terms and dependencies, and assigns a legal risk score. Source: https://arxiv.org/pdf/2503.02784

AutoCompliance begins with a user-provided webpage. The AI extracts key details, searches for related resources, identifies license terms and dependencies, and assigns a legal risk score. Source: https://arxiv.org/pdf/2503.02784

These modules are powered by fine-tuned AI models, including the EXAONE-3.5-32B-Instruct model, trained on synthetic and human-labeled data. AutoCompliance also uses a database for caching results to enhance efficiency.

AutoCompliance starts with a user-provided dataset URL and treats it as the root entity, searching for its license terms and dependencies, and recursively tracing linked datasets to build a license dependency graph. Once all connections are mapped, it calculates compliance scores and assigns risk classifications.

The Data Compliance framework outlined in the new work identifies various entity types involved in the data lifecycle, including datasets, which form the core input for AI training; data processing software and AI models, which are used to transform and utilize the data; and Platform Service Providers, which facilitate data handling.

The system holistically assesses legal risks by considering these various entities and their interdependencies, moving beyond rote evaluation of the datasets’ licenses to include a broader ecosystem of the components involved in AI development.

Data Compliance assesses legal risk across the full data lifecycle. It assigns scores based on dataset details and on 14 criteria, classifying individual entities and aggregating risk across dependencies.

Data Compliance assesses legal risk across the full data lifecycle. It assigns scores based on dataset details and on 14 criteria, classifying individual entities and aggregating risk across dependencies.

Training and Metrics

The authors extracted the URLs of the top 1,000 most-downloaded datasets at Hugging Face, randomly sub-sampling 216 items to constitute a test set.

The EXAONE model was fine-tuned on the authors’ customized dataset, with the navigation module and question-answering module utilizing artificial knowledge, and the scoring module utilizing human-labeled knowledge.

Ground-truth labels had been created by 5 authorized specialists skilled for a minimum of 31 hours in related duties. These human specialists manually recognized dependencies and license phrases for 216 check circumstances, then aggregated and refined their findings by dialogue.

With the skilled, human-calibrated AutoCompliance system examined towards ChatGPT-4o and Perplexity Pro, notably extra dependencies had been found throughout the license phrases:

Accuracy in identifying dependencies and license terms for 216 evaluation datasets.

Accuracy in figuring out dependencies and license phrases for 216 analysis datasets.

The paper states:

‘The AutoCompliance significantly outperforms all other agents and Human expert, achieving an accuracy of 81.04% and 95.83% in each task. In contrast, both ChatGPT-4o and Perplexity Pro show relatively low accuracy for Source and License tasks, respectively.

‘These results highlight the superior performance of the AutoCompliance, demonstrating its efficacy in handling both tasks with remarkable accuracy, while also indicating a substantial performance gap between AI-based models and Human expert in these domains.’

In phrases of effectivity, the AutoCompliance method took simply 53.1 seconds to run, in distinction to 2,418 seconds for equal human analysis on the identical duties.

Further, the analysis run value $0.29 USD, in comparison with $207 USD for the human specialists. It must be famous, nonetheless, that that is based mostly on renting a GCP a2-megagpu-16gpu node month-to-month at a charge of $14,225 per 30 days  – signifying that this sort of cost-efficiency is said primarily to a large-scale operation.

Dataset Investigation

For the evaluation, the researchers chosen 3,612 datasets combining the three,000 most-downloaded datasets from Hugging Face with 612 datasets from the 2023 Data Provenance Initiative.

The paper states:

‘Starting from the 3,612 target entities, we identified a total of 17,429 unique entities, where 13,817 entities appeared as the target entities’ direct or oblique dependencies.

‘For our empirical analysis, we consider an entity and its license dependency graph to have a single-layered structure if the entity does not have any dependencies and a multi-layered structure if it has one or more dependencies.

‘Out of the 3,612 target datasets, 2,086 (57.8%) had multi-layered structures, whereas the other 1,526 (42.2%) had single-layered structures with no dependencies.’

Copyrighted datasets can solely be redistributed with authorized authority, which can come from a license, copyright regulation exceptions, or contract phrases. Unauthorized redistribution can result in authorized penalties, together with copyright infringement or contract violations. Therefore clear identification of non-compliance is crucial.

Distribution violations found under the paper’s cited Criterion 4.4. of Data Compliance.

Distribution violations discovered below the paper’s cited Criterion 4.4. of Data Compliance.

The examine discovered 9,905 circumstances of non-compliant dataset redistribution, break up into two classes: 83.5% had been explicitly prohibited below licensing phrases, making redistribution a transparent authorized violation; and 16.5% concerned datasets with conflicting license situations, the place redistribution was allowed in idea however which did not meet required phrases, creating downstream authorized danger.

The authors concede that the danger standards proposed in NEXUS aren’t common and should range by jurisdiction and AI utility, and that future enhancements ought to concentrate on adapting to altering world laws whereas refining AI-driven authorized assessment.

Conclusion

This is a prolix and largely unfriendly paper, however addresses maybe the most important retarding consider present trade adoption of AI – the chance that apparently ‘open’ knowledge will later be claimed by varied entities, people and organizations.

Under DMCA, violations can legally entail huge fines on a per-case foundation. Where violations can run into the tens of millions, as within the circumstances found by the researchers, the potential authorized legal responsibility is really important.

Additionally, firms that may be confirmed to have benefited from upstream knowledge can not (as traditional) declare ignorance as an excuse, a minimum of within the influential US market. Neither do they presently have any reasonable instruments with which to penetrate the labyrinthine implications buried in supposedly open-source dataset license agreements.

The downside in formulating a system comparable to NEXUS is that it could be difficult sufficient to calibrate it on a per-state foundation contained in the US, or a per-nation foundation contained in the EU; the prospect of making a really world framework (a sort of ‘Interpol for dataset provenance’) is undermined not solely by the conflicting motives of the varied governments concerned, however the truth that each these governments and the state of their present legal guidelines on this regard are continually altering.

 

* My substitution of hyperlinks for the authors’ citations.
Six sorts are prescribed within the paper, however the ultimate two aren’t outlined.

First revealed Friday, March 7, 2025

LEAVE A REPLY

Please enter your comment!
Please enter your name here