Inside Big Tech's underground race to buy AI training data

Written by Katie Paul and Anna Tong

NEW YORK (Reuters) – At its peak in the early 2000s, Photobucket was the world's top image hosting site. Media was his backbone for once-popular services like Myspace and Friendster, which boasted 70 million users and accounted for nearly half of the U.S. online photo market.

According to analytics tracker Similarweb, only 2 million people still use Photobucket. But the generative AI revolution may breathe new life into AI.

CEO Ted Leonard, who runs the 40-person Edwards, Colo.-based company, told Reuters that the company has licensed Photobucket's 13 billion photos and videos and will respond accordingly. He said he is in talks with multiple technology companies to train generative AI models that can generate new content. to the text prompt.

He said he's discussing rates ranging from 5 cents to $1 per photo and more than $1 per video, with prices varying widely depending on the buyer and the type of images sought.

“We talked to companies who said, 'We need more,' and one buyer said they want over a billion videos, more than what their platform has,” Leonard said. added.

“You'll scratch your head and say where did you get that?”

Photobucket declined to identify the prospective buyer, citing commercial confidentiality. Previously unreported ongoing negotiations suggest the company may have billions of dollars worth of content in its possession, and buzzing data emerge as it races to monopolize generative AI technology. You can get a glimpse of the market.

Big tech companies like Google, Meta, and Microsoft-backed OpenAI initially used vast amounts of data collected from the internet for free to train generative AI models like ChatGPT that can mimic human creativity. Ta. They claim it is legal and ethical to do so, but they are facing lawsuits from a series of copyright holders over the practice.

At the same time, these tech companies are also quietly paying for content locked behind paywalls and login screens, from chat logs to long-forgotten personal photos on faded social media apps. This has led to hidden transactions in all sorts of things.

“There is currently a rush to find copyright holders who own private collections of content that cannot be scraped,” said Edward Clarice of the law firm Clarice Law, who said he believes copyright owners with private collections of content that cannot be scraped are being searched for, with assets worth tens of millions of dollars. The company advises content owners on transactions. Pieces for licensing photo, film, and book archives for AI training.

Reuters interviewed more than 30 people familiar with the AI data trade, including current and former executives, lawyers and consultants, to provide the first thorough investigation into this emerging market, including the types of content being purchased and I researched the price in detail. Additionally, new concerns are emerging about the risk of personal data being incorporated into AI models without people's knowledge or explicit consent.

OpenAI, Google, Meta, Microsoft, Apple, and Amazon all declined to comment on the specific data transactions or discussions in this article, although Microsoft and Google referred to Reuters their supplier codes of conduct, including data privacy provisions. .

Google added that if it discovers violations, it will “take immediate action, up to and including termination” of contracts with suppliers.

Many major market research firms say they have not even begun to estimate the size of the opaque AI data market, where companies often do not disclose their contracts. Researchers such as Business Research Insights estimate that the market is currently around $2.5 billion and predict it could grow to nearly $30 billion within 10 years.

generative data gold rush

The data land grab comes as manufacturers of large-scale generative AI “foundation” models face increasing pressure to account for the large amounts of content they feed into their systems. This process is known as “training” and requires intensive computing power, often taking months to complete. .

Tech companies say the technology would be prohibitively expensive without access to vast archives of free scraped web page data, such as that provided by the nonprofit repository Common Crawl. It says it is “publicly available.”

Nevertheless, their approach has sparked a wave of copyright lawsuits and regulatory overdrive, prompting publishers to add code to their websites that blocks scraping.

In response, AI model makers have begun to hedge risk and secure their data supply chains, both through deals with content owners and the burgeoning data broker industry that has emerged to meet demand. Ta.

For example, in the months since ChatGPT debuted in late 2022, companies including Meta, Google, Amazon, and Apple have all sold stock image provider Shutterstock and hundreds of millions of images, videos, and music files in their libraries. They say they have signed a contract to use it for training. People who are familiar with arrangements.

Shutterstock Chief Financial Officer Jarrod Yajes told Reuters that the deals with big tech companies were initially in the range of $25 million to $50 million each, but most were later expanded. That's what it means. Smaller tech companies have followed suit, he added, with new “heavy activity” occurring in the past two months.

Yajes declined to comment on individual contracts. The size of the agreement with Apple and other agreements had not been previously disclosed.

Freepik, a Shutterstock competitor, told Reuters it has struck deals with two major tech companies to license most of its 200 million-image archive for 2 to 4 cents per image. CEO Joaquín Cuenca Abela said five more similar deals were in the works, declining to identify the buyers.

OpenAI, an early Shutterstock customer, also has licensing agreements with at least four news organizations, including The Associated Press and Axel Springer. Separately, Thomson Reuters, owner of Reuters News, said it had signed a deal to license news content that helps train large-scale AI language models, without providing further details.

“Ethically Sourced” Content

An industry of AI data specialist companies is also emerging, with short-term contract workers creating custom visuals and audio samples from scratch while securing the rights to real-world content such as podcasts, short-form videos, and interactions with digital assistants. We are also building a network. , similar to an Uber-esque data gig economy.

Seattle-based Defined.ai licenses data to a variety of companies including Google, Meta, Apple, Amazon and Microsoft, CEO Daniela Braga told Reuters. Ta.

Pricing varies by buyer and type of content, but Braga said companies typically charge $1 to $2 per image, $2 to $4 per short video, and $100 to $300 per hour of feature film. He says he intends to pay. She added that the market rate for text is $0.001 per word.

Nude images, which require the most delicate handling, cost between $5 and $7, he said.

Braga said Defined.ai splits its revenue 50-50 with content providers. The company added that it sells its datasets as “ethically sourced” because it obtains consent from the people using the data and removes personally identifying information.

One of the company's suppliers, a Brazil-based entrepreneur, said it pays the owners of the photos, podcasts and medical data it sources about 20-30% of the total value of the transaction.

The most expensive images in his portfolio are those used to train AI systems to block content such as graphic violence, which is prohibited by tech companies, he said, citing commercial considerations. said the supplier, who spoke on condition of not being identified.

To meet these demands, he obtains images of crime scenes, conflict violence, and surgery, primarily from police, freelance photojournalists, and medical students, respectively, but many of the images are commonly distributed in graphic formats. It is said that it is practiced in South America and Africa, where there are many countries in the world.

He said he had received images from freelance photographers in Gaza since the war began in October, and also from Israel at the start of the fighting.

He added that his company employs nurses accustomed to seeing violent injuries to anonymize and annotate images, but they can be unpleasant to the untrained eye.

“I think that's dangerous.”

Licensing could solve some legal and ethical issues, according to many industry insiders interviewed, but revive archives of old Internet names like Photobucket to fuel modern AI models. Doing so raises other issues, especially regarding user privacy.

AI systems have been caught spitting back exact copies of their training data, including watermarks from Getty Images, verbatim paragraphs from New York Times articles, and images of real people. This means that an individual's private photos and intimate thoughts posted decades ago could be incorporated into the AI's generated output without notice or explicit consent.

Photobucket CEO Leonard said the company's terms of service were updated in October to give it an “unrestricted right” to sell uploaded content for the purpose of training its AI systems. said it was based on a solid legal basis. He sees license data as an alternative to advertising sales.

“We need to pay our bills, which may allow us to continue supporting free accounts,” he said.

Defined.ai's Braga avoids acquiring content from “platform” companies like Photobucket, preferring to source from influencers who create social media photos and more clearly asserting licensing rights. said.

“I think it's very dangerous,” Braga said of the platform's content. “If you have an AI that generates something that resembles a photo of someone you haven’t authorized, that’s a problem.”

Photobucket isn't the only platform using licensing. Tumblr's parent company Automattic announced last month that it was sharing content with “select AI companies.” In February, Reuters reported that Reddit had struck a deal with Google to make its content available for training Google's AI models.

Ahead of its initial public offering in March, Reddit disclosed that its data licensing business was the subject of an investigation by the U.S. Federal Trade Commission and that it could run afoul of evolving privacy and intellectual property regulations. admitted.

The FTC warned companies in February not to retroactively change their terms of use for AI use, but declined to comment on the Reddit investigation or say whether it was considering other training data deals. Ta.

(Reporting by Katie Paul in New York and Anna Tong in San Francisco; Additional reporting by Krystal Hu in New York; Editing by Kenneth Li and Pravin Char)

Source link

What's Hot

What's happening to female boxers Imane Kherif and Lin Yuting after they “failed” the gender test?

Blue Jackets Hockey League Offers Growth Opportunities

Jazz insider expresses bold expectations for Cody Williams in Year 1

Inside Big Tech's underground race to buy AI training data

Ceres Power's Fuel Cell Technology Moves to Mass Production Through New Partnership

Nasdaq falls, NVIDIA drops 5% ahead of next earnings report from big tech companies

Not sure how the AI revolution will impact HR tech? We asked the experts

Polestar won't release self-driving tech until it's safer than humans

What's happening to female boxers Imane Kherif and Lin Yuting after they “failed” the gender test?

Andy Ruiz bets boxing future on comeback from hell

Indignation and anger as two transgender boxers face opposition from women in Paris in coming days

WBA Asia Tournament kicks off with tournament – World Boxing Association

Blue Jackets Hockey League Offers Growth Opportunities

University of Massachusetts Lowell announces 2024-25 hockey schedule

What is a Faceoff in Hockey?

What's Hot

Inside Big Tech's underground race to buy AI training data

Related Posts