- Publishers hope to receive compensation from OpenAI for using their work to train AI models.
- The Center for Investigative Reporting filed a lawsuit against the company this week.
- The New York Times and other media outlets have filed similar lawsuits against OpenAI.
OpenAI trains ChatGPT using any publicly available data, including books and articles on the internet, and now the people who own that data want to get paid for their work.
Training data is a vital component in creating the AI models that are taking the tech world by storm. Big tech companies like Google, Meta, OpenAI, Anthropic and Microsoft are rushing to find new data sources. Meta even considered acquiring Simon & Schuster, one of the world's largest publishers.
Part of the problem is that publishers are increasingly accusing these companies of siphoning off copyrighted data; publishers want to be paid for their work. Meta and OpenAI have argued to the U.S. Copyright Office that publishing copyrighted material on the internet is “publicly available” and therefore constitutes fair use.
But the company will likely have to make its case in court as it is being sued by multiple parties over copyrighted material.
The Center for Investigative Reporting, a nonprofit news outlet sometimes referred to by its acronym CIR, which merged with Mother Jones and Reveal earlier this year, sued OpenAI and Microsoft in federal court last week, accusing OpenAI of being “built on the exploitation of copyrighted material from creators around the world, including CIR.”
CIR's lawyers accused OpenAI and Microsoft of using Mother Jones' copyrighted material to train their GPT and Copilot AI models.
“OpenAI and Microsoft began siphoning off our content to power their own products, but unlike other organizations that license our content, they never asked for permission or offered to compensate us,” Monica Bauerlein, CEO of the Center for Investigative Journalism, said in announcing the lawsuit. “This free-riding behavior is not only unfair, it is also a violation of copyright.”
The complaint said the company's publicly available list of top web domains in its WebText training set included “16,793 distinct URLs from the Mother Jones web domain.”
In a separate class action lawsuit filed by the Writers Guild, two authors alleged that the company used information from their books to train ChatGPT. The New York Times also filed a similar lawsuit against the company in December 2023.
In May, court documents in the Authors Guild lawsuit revealed that OpenAI had deleted two massive datasets it used to train GPT-3, with lawyers for the Guild saying the two sets likely contained “more than 100,000 published books.”
The two employees who compiled the data no longer work at OpenAI, according to court documents.
OpenAI has begun signing licensing agreements with news organizations to ensure fair use of their copyrighted material. The company has such agreements with The Associated Press, publisher of The Wall Street Journal and The New York Post, The Atlantic, Prisa Media, Le Monde, The Financial Times and Axel Springer, parent company of Business Insider.
But the scale of content these bots need to continually learn will require much more than a few licensing agreements.
One solution is synthetic data, which is artificially generated rather than collected from the real world, and can be easily generated by machine learning algorithms.
OpenAI is exploring synthetic data as an option for training models, but CEO Sam Altman has expressed concerns about generating quality data.
“If we can get beyond the synthetic data event horizon, where the models get smart enough to create good synthetic data, then everything will be fine.” Altman is The company is also exploring the process of AI models working together, with one AI system generating data and another interpreting it.
OpenAI did not immediately respond to Business Insider's request for comment.