- Tech giants seek unconventional data sources as high-quality data becomes scarce.
- Google and Meta explore consumer data and creative acquisitions for AI training.
- OpenAI considers synthetic data and leverages YouTube videos transcribed by its Whisper tool.
As the AI arms race intensifies, tech giants like Meta, Google, and OpenAI are facing a potential shortage of high-quality data to train their AI models.
According to Epoch, an AI research institute, all the high-quality data could be exhausted by 2026. This has prompted major tech companies to explore new and creative data sources to keep their AI systems learning and growing.
Google’s internal discussions
Google’s legal department has been asking employees to broaden the language around using consumer data, potentially including data from free consumer versions of Google Docs, Sheets, Slides, and even restaurant reviews on Google Maps.
However, Google maintains that it hasn’t expanded the types of data it uses to train AI models, despite updating its privacy policy in July 2023.
Meanwhile, Meta executives have been brainstorming alternatives to address the waning supply of usable data. One idea floated during their meetings was to acquire the famed publishing house Simon & Schuster, which was recently purchased by private equity firm KKR for $1.62 billion.
A more budget-friendly option suggested was paying $10 a book to obtain the full licensing rights to new titles.
OpenAI’s synthetic data and whisper tool
OpenAI has been considering using synthetic data, which is generated by AI systems, to train its models. The issue with this approach is that it can reinforce some of the mistakes and limitations of AI. OpenAI is working on a process to address this, in which one AI system checks the output of another.
Additionally, OpenAI has built Whisper, a speech recognition tool that can translate YouTube videos and podcasts.
Its latest large language model, GPT-4, has been trained on over one million hours of YouTube videos transcribed by Whisper. OpenAI’s president, Greg Brockman, emphasized that the company relies on “numerous sources” of data for its systems.
Photobucket’s potential licensing deal
Photobucket, once known as “the world’s top image-hosting site,” hosted photos for early social media platforms like Myspace and Friendster.
Its vast database of pictures might soon be licensed to tech companies for training their AI systems, although Photobucket declined to identify prospective buyers.