The race to lead in AI is a desperate search for the digital data needed to advance the technology. Tech companies like OpenAI, Google and Meta have discussed cutting corners, ignoring corporate policies and bending the law to obtain that data, according to a New York Times investigation.
At Meta, which owns Facebook and Instagram, executives, lawyers and engineers bought publisher Simon & Schuster last year to source long-form works, according to records of internal meetings obtained by the Times. They said they talked about it. They also discussed collecting copyrighted data from across the Internet, even if it meant facing lawsuits. They said licensing negotiations with publishers, artists, musicians and the news industry would take too long.
Like OpenAI, Google was also transcribing YouTube videos to collect text for its AI models, five people familiar with the company's practices said. This may violate the copyright of the video, which belongs to the creator.
Last year, Google also expanded its terms of service. One motivation for the change is that Google is making more of its publicly available Google Docs, restaurant reviews on Google Maps, and other online materials available, according to members of the company's privacy team and internal messages seen by the Times. The purpose was to make the information available. AI products.
The companies' actions demonstrate how online information, including news articles, works of fiction, bulletin board posts, Wikipedia articles, computer programs, photos, podcasts, and movie clips, has become the lifeblood of the burgeoning AI industry. There is. Creating innovative systems requires enough data to teach technology that instantly produces text, images, audio, and video similar to those created by humans.