Training data is a key component in developing the AI models that are taking over the IT industry.

Leading technology businesses including Google, Meta, OpenAI, Anthropic, and Microsoft are all looking for new data sources. Meta had explored buying Simon & Schuster, one of the world’s largest publishing houses.

Part of the issue is that publishers are increasingly accusing these corporations of hoarding copyrighted material. They would like to be paid for their efforts. Meta and OpenAI contended in comments to the US Copyright Office that posting copyrighted material on the internet makes it “publicly available” and thus subject to fair use.

However, they will still have to make that argument in court, as the corporation is facing many lawsuits over the copyrighted material.

READ MORE: Voice Of Siri, Susan Bennett Scarjo: Right To Lawyer Against Open AI

The Center for Investigative Reporting (CIR), a news group that amalgamated with Mother Jones and Reveal earlier this year, sued OpenAI and Microsoft in federal court last week. The lawsuit charges OpenAI with being “built on the exploitation of copyrighted works belonging to creators around the world, including CIR.”

Lawyers for the CIR accused OpenAI and Microsoft of utilizing copyrighted Mother Jones content to train their GPT and Copilot AI models.

“OpenAI and Microsoft started vacuuming up our stories to make their product more powerful, but they never asked for permission or offered compensation, unlike other organizations that license our material,” Monika Bauerlein, CEO of the Center for Investigative Reporting, said in an announcement about the lawsuit. “This free rider behavior is not only unfair, it is a violation of copyright.”

READ MORE: Microsoft, OpenAI, And Nvidia Are Being Probed Over Monopoly Rules

According to the lawsuit, “16,793 distinct URLs from Mother Jones’s web domain” appeared on a published list of the top web domains included in the company’s WebText training set.

In another class action case filed by the writers Guild, two writers claimed that the corporation exploited information from their books to teach ChatGPT. The New York Times launched a similar lawsuit against the business in December 2023.

In May, court records in the Author’s Guild complaint revealed that OpenAI erased two massive datasets used to train GPT 3. Lawyers for the group stated that the two sets likely contained “more than 100,000 published books.”

READ MORE: OpenAI Is Heading To Hollywood To Pitch Their Revolutionary “Sora”

According to court records, the two employees who compiled the data no longer work at OpenAI.

OpenAI has begun to establish licensing agreements with news organizations to ensure that their work is fairly used. The Associated Press, The Wall Street Journal and New York Post publishers, The Atlantic, Prisa Media, Le Monde daily, Financial Times, and Axel Springer, the parent company of Business Insider, have all inked similar agreements.

However, the volume of content required for these bots to continuously learn will necessitate far more than a few licensing deals.

One alternative is to use synthetic data, which is created intentionally rather than acquired from the real world and can be easily generated using machine learning algorithms.

OpenAI has contemplated using synthetic data to train its models, although CEO Sam Altman has expressed worries about producing quality data.

“As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine,” Altman stated at a tech conference in May 2023. The company has also investigated a method in which AI models collaborate – one AI system generates data, while another evaluates it.