Entertainment

What was Sora trained on? Creatives demand answers.

Published

11 months ago

February 16, 2024

Editor

On Thursday, OpenAI once again shook up the AI world with a video generation model called Sora.

The demos showed photorealistic videos with crisp detail and complexity, based off of simple text prompts. A video based on the prompt “Reflections in the window of a train traveling through the Tokyo suburbs” looked like it was filmed on a phone, shaky camera work and reflections of train passengers included. No weird distorted hands in sight.

Tweet may have been deleted

A video from the prompt, “A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors” looked like a Christopher Nolan-Wes Anderson hybrid.

Tweet may have been deleted

Another of golden retriever puppies playing in the snow rendered soft fur and fluffy snow so realistic you could reach out and touch it.

The 7 trillion dollar question is, how did OpenAI achieve this? We don’t actually know because OpenAI has barely shared anything about its training data. But in order to create a model this advanced, Sora needed lots of video data, so we can assume it was trained on video data scraped from all corners of the internet. And some are speculating that training data included copyrighted works. OpenAI did not immediately respond to request for comment on Sora’s training data.

In OpenAI’s technical paper it largely focuses on the method for achieving these results: Sora is a diffusion model that turns visual data into “patches” or pieces of data that the model can understand. But there’s scant mention of where the visual data came from.

OpenAI says it “take[s] inspiration from large language models which acquire generalist capabilities by training on internet-scale data.” The incredibly vague “taking inspiration” part is the only evasive reference to the source of Sora’s training data. Further down in the paper, OpenAI says, “training text-to-video generation systems requires a large amount of videos with corresponding text captions.” The only source of a massive amount of visual data can be found on the internet, another hint at where Sora comes from.

The legal and ethical issue of how training data is acquired for AI models has been around ever since OpenAI launched ChatGPT. Both OpenAI and Google have been accused of “stealing” data to train their language models, in other words using data scraped from social media, online forums like Reddit and Quora, Wikipedia, databases of private books, and news sites.

Until now the rationale for scraping the entirety of the internet for training data is that it’s publicly-available. But publicly-available doesn’t always translate to public domain. Case in point, the New York Times is suing OpenAI and Microsoft for copyright infringement, alleging OpenAI’s models used the Times‘ works word for word or incorrectly cited the stories.

Now it looks like OpenAI is doing the same thing, but with video. If this is the case, you can expect heavy-hitters in the entertainment industry to have something to say about it.

But the problem remains: We still don’t know the source of Sora’s training data. “The company (despite its name) has been characteristically close-lipped about what they have trained the models on,” wrote Gary Marcus, an AI expert who testified at the U.S. Senate AI Oversight Committee hearing. ” Many people have [speculated] that there’s probably a lot of stuff in there that is generated from game engines like Unreal. I would not at all be surprised if there also had been lots of training on YouTube visited, and various copyrighted materials,” said Marcus, before adding, “Artists are presumably getting really screwed here.”

Despite OpenAI’s refusal to divulge its secrets, artists and creatives are assuming the worst. Justine Bateman, a filmmaker and SAG-AFTRA generative AI advisor didn’t mince words. “Every nanosecond of this #AI garbage is trained on stolen work by real artists,” posted Bateman on X. “Repulsive,” she added.

Tweet may have been deleted

Others in creative industries are concerned about how the rise of Sora and video generating models will affect their jobs. “I work in film vfx, practically everyone I know is doom and gloom, panicking about what to do now,” posted @jimmylanceworth.

OpenAI didn’t completely ignore the explosive impact Sora might have. But that’s largely focused on potential harms involving deepfakes and misinformation. It is currently in red-teaming phase, which means it’s being stress-tested for inappropriate and harmful content. Towards the end of its announcement, OpenAI said it will be “engaging policymakers, educators and artists around the world to understand their concerns and to identify positive use cases for this new technology.”

But that doesn’t address the harms that may have already occurred by making Sora in the first place.

Topics
Artificial Intelligence
OpenAI

The Televisor

What was Sora trained on? Creatives demand answers.

Entertainment

What was Sora trained on? Creatives demand answers.

10 Sundance films you should know about now

What drives John Cena? The ‘What Drives You’ host speaks out

‘The Brutalist’ AI backlash, explained

OnePlus 13 review: A great option if you’re sick of the usual flagships

‘Night Call’ review: A bad day on the job makes for a superb action movie

How ‘Grand Theft Hamlet’ evolved from lockdown escape to Shakespearean success

If TikTok is banned in the U.S., this is what it will look like for everyone else

‘One of Them Days’ review: Keke Palmer and SZA are friendship goals

‘Back in Action’ review: Cameron Diaz and Jamie Foxx team up for Gen X action-comedy

‘September 5’ review: a blinkered, noncommittal thriller about an Olympic hostage crisis

What’s new to streaming this week? (Dec. 27, 2024)

2025’s public domain works and how you can use them, from Popeye to ‘The Sound and the Fury’

Beyoncé’s Christmas halftime show on Netflix: What to know about the NFL event

The greatest ’90s films on Prime Video

How to watch ‘Wicked’ at home: Release date, streaming deals, and more

CES 2025 highlights: 12 new gadgets you can buy already

‘American Primeval’ review: Can Netflix’s grimy Western mini-series greatest ‘Yellowstone’?

Tesla launched the new Model Y in China. Here’s what you need to know

Eight ways Mark Zuckerberg changed Meta ahead of Trump’s inauguration

Meta ditches fact-checking for community notes ahead of second Trump term

10 Sundance films you should know about now

What drives John Cena? The ‘What Drives You’ host speaks out

‘The Brutalist’ AI backlash, explained

OnePlus 13 review: A great option if you’re sick of the usual flagships

‘Night Call’ review: A bad day on the job makes for a superb action movie

How ‘Grand Theft Hamlet’ evolved from lockdown escape to Shakespearean success

If TikTok is banned in the U.S., this is what it will look like for everyone else

‘One of Them Days’ review: Keke Palmer and SZA are friendship goals

‘Back in Action’ review: Cameron Diaz and Jamie Foxx team up for Gen X action-comedy

‘September 5’ review: a blinkered, noncommittal thriller about an Olympic hostage crisis

Trending

The Televisor

What was Sora trained on? Creatives demand answers.

You may like

10 Sundance films you should know about now

What drives John Cena? The ‘What Drives You’ host speaks out

‘The Brutalist’ AI backlash, explained

OnePlus 13 review: A great option if you’re sick of the usual flagships

‘Night Call’ review: A bad day on the job makes for a superb action movie

How ‘Grand Theft Hamlet’ evolved from lockdown escape to Shakespearean success

If TikTok is banned in the U.S., this is what it will look like for everyone else

‘One of Them Days’ review: Keke Palmer and SZA are friendship goals

‘Back in Action’ review: Cameron Diaz and Jamie Foxx team up for Gen X action-comedy

‘September 5’ review: a blinkered, noncommittal thriller about an Olympic hostage crisis

What’s new to streaming this week? (Dec. 27, 2024)

2025’s public domain works and how you can use them, from Popeye to ‘The Sound and the Fury’

Beyoncé’s Christmas halftime show on Netflix: What to know about the NFL event

The greatest ’90s films on Prime Video

How to watch ‘Wicked’ at home: Release date, streaming deals, and more

CES 2025 highlights: 12 new gadgets you can buy already

‘American Primeval’ review: Can Netflix’s grimy Western mini-series greatest ‘Yellowstone’?

Tesla launched the new Model Y in China. Here’s what you need to know

Eight ways Mark Zuckerberg changed Meta ahead of Trump’s inauguration

Meta ditches fact-checking for community notes ahead of second Trump term

10 Sundance films you should know about now

What drives John Cena? The ‘What Drives You’ host speaks out

‘The Brutalist’ AI backlash, explained

OnePlus 13 review: A great option if you’re sick of the usual flagships

‘Night Call’ review: A bad day on the job makes for a superb action movie

How ‘Grand Theft Hamlet’ evolved from lockdown escape to Shakespearean success

If TikTok is banned in the U.S., this is what it will look like for everyone else

‘One of Them Days’ review: Keke Palmer and SZA are friendship goals

‘Back in Action’ review: Cameron Diaz and Jamie Foxx team up for Gen X action-comedy

‘September 5’ review: a blinkered, noncommittal thriller about an Olympic hostage crisis

Trending