TED- 人工智能模型如何窃取创意 及如何应对

摘要:There are three key things that AI companies need to build their models, three key resources -- people, compute and data.

The technology and vision behind generative AI is amazing, but stealing the work of the world's creators to build it is not.


There are three key things that AI companies need to build their models, three key resources -- people, compute and data.


That is, engineers to build the models, GPUs to run the training process and data to train the models on.

即工程师构建模型,GPU运行训练过程, 以及用于训练模型的数据。

AI companies spend vast sums on the first two, sometimes a million dollars per engineer and up to a billion dollars per model.

AI公司在前面两项上投入巨资,每位工程师的费用有时高达百万美元, 而每个模型的成本甚至可达到十亿美元。

But they expect to take the third resource, training data, for free.


Right now, many AI companies train on creative work they haven't paid for or even asked permission to use.


This is unfair and unsustainable.


But if we reset, and license our training data, we can build a better generative AI ecosystem that works for everyone, both the AI companies themselves and the creators, without whose work these models would not exist.

但如果我们重新设定规则, 对训练数据进行授权,我们就能构建一个更优秀的生成式AI生态系统, 造福所有人,包括AI公司自身和内容创作者,没有他们的贡献, 这些模型根本无法存在。

Most AI companies today do not license the majority of their training data.


They use web scrapers to find, download and train on as much content as they can gather.


They're often pretty secretive about what they do train on, but what's clear is that training on copyrighted work without a license is rife.

他们通常对其训练所用的数据相当保密,但显而易见的是, 未经许可使用受版权保护的作品进行训练的现象十分普遍。

For instance, when the Mozilla Foundation looked at 47 large language models released between 2019 and 2023, they found that 64 percent of them were trained, in part, on Common Crawl, a dataset that includes copyrighted works, such as newspaper articles from major publications.

例如,当Mozilla基金会审视2019年至2023年间发布的47个大型语言模型时,他们发现其中64%的模型部分基于Common Crawl数据集进行训练,该数据集包含了受版权保护的作品,比如来自主要出版物的新闻文章。

And a further 21 percent didn't reveal enough information to know either way.


Training on copyrighted work without a license has rapidly become standard across much of the generative AI industry.


But this training, this unlicensed training on creative work, has serious negative consequences for the people behind that work.

然而, 这种未经授权对创意作品的训练,对这些作品背后的创作者造成了严重的负面影响。

And this is for the simple reason that generative AI competes with its training data.


This is not the narrative that AI companies like to portray.


We like to talk about democratization, about letting more people be creative.


But the fact that AI competes with its training data is inescapable.


A large language model trained on short stories can create competing short stories.


An AI image model trained on stock images can create competing stock images.


An AI music model trained on music that's licensed to TV shows can create competing music to license to TV shows.


These models, however imperfect, are so quick and easy to use that this competition is inevitable.


And this isn't just theoretical.


Generative AI is still pretty new, but we're already seeing exactly the sort of effects you'd expect in a world in which generative AI competes with its training data.


For instance, the well-known filmmaker Ram Gopal Varma recently said that he'll use AI music in all his projects going forward.


Indeed, there are multiple reports of people starting to listen to AI music in place of human-produced music, and recently, an AI song hit number 48 in the German charts.

事实上,有多份报道称人们开始选择聆听AI音乐, 而非人类创作的音乐,最近,一首AI歌曲在德国音乐排行榜上更是攀升至第48位。

In all these cases, AI music is competing with the songs it was trained on.


Or take Kelly McKernan.


Kelly is an artist from Nashville.


For 10 years, they made enough money selling their work that art was their full-time income.


But in 2022, a dataset that included their works was used to train a popular AI image model.


Their name was one of many used by huge numbers of people to create art in the style of specific human artists.


Kelly's income fell by 33 percent almost overnight.


Illustrators around the world report similar stories, being outcompeted by AI models they have reason to believe were trained on their work.


The freelance platform Upwork wrote a white paper in which they looked at the effects, that they've seen on the job market, of generative AI.


They looked at how job postings on their platform have changed since the introduction of ChatGPT, and sure enough, they found exactly what you'd expect, that generative AI has reduced the demand for freelance writing tasks by 8 percent, which increases to 18 percent if you look at only what they term lower-value tasks.


So the initial data we have, plus the individual stories we hear, all align with the logical assumption: "Generative AI competes with the work it's trained on." It's so quick and easy to use, it's inevitable, and it competes with the people behind that work.


Now creators argue this training is illegal.


The legal framework of copyright affords creators the exclusive right to authorize copies of their work, and AI training involves copying.


Here, in the US, many AI companies argue that training AI falls under the fair use copyright exception, which allows unlicensed copying in a limited set of circumstances, such as creating parodies of a work.

在美国, 许多AI公司辩称, AI训练属于合理使用版权例外范畴,这一例外允许在特定情况下未经许可进行复制,例如创作作品的模仿版本。

Creators and rights holders strongly disagree, saying there's no way this narrow exception can be used to legitimize the mass exploitation of creative work to create automated competitors to that work.

创作者和版权持有者强烈反对,认为这一狭窄的例外条款无法被用来合法化大规模利用创意作品, 以创造出与原作竞争的自动化产品。

And for the record, I entirely agree.


Of course, this question is previously untested in the courts, and there are currently around 30 ongoing lawsuits brought by rights holders against AI companies, which will help to address this question.

当然, 这个问题在法庭上尚未经过检验,目前约有30起由权利持有人对人工智能公司提起的正在进行中的诉讼,这些案件将有助于解决这一问题。

But this will take time, and creators are suffering from what they see as unjust competition right now.


So they propose a solution that has been used and worked before -- licensing.


If a commercial entity wants to use copyrighted work, be it for merchandise manufacturing or building a streaming service, they license that work.


Now AI companies have a bunch of reasons why this shouldn't apply to them.


There's the fair use legal exception that I've already mentioned.


There's also the argument that since humans can train on copyrighted work without a license, AI should be allowed to, too.


But this is a very hard claim to justify.


Artists have been learning from each other for centuries.


When you create, you expect other people to learn from you.


You learn from a range of sources, from other art to textbooks to taking lessons.


Much of this you or someone else paid for, supporting the entire ecosystem.


In generative AI, commercial entities valued at millions or billions of dollars scrape as much content as they can, often against creators' will, without payment, making multiple copies along the way -- which are subject to copyright law -- to create a highly scalable competitor to what they're copying.

在生成式人工智能领域, 那些估值达数百万甚至数十亿美元的商业实体, 尽可能地抓取大量内容,这些行为常常违背创作者的意愿, 且未支付任何费用,在此过程中还制作了多份副本——这些副本受版权法保护——以此打造出一个高度可扩展的竞争对手, 与它们所复制的内容展开竞争。

So scalable, in fact, that there are AI image generators estimated to be making 2.5 million images a day and AI song generators outputting 10 songs a second.


To argue that human learning and AI training are the same and should be treated the same is preposterous.


AI companies also argue that licensing their training data would be impractical.


They use so much training data, they say, that individual payments to each creator behind the data would be small.

他们声称, 由于使用了海量的训练数据,对每位数据背后创作者的个别支付金额将会微乎其微。

But this is true of many content-licensing markets.


Creators still want to get paid, even if the payments are small.


AI companies also argue that they simply use too much data for licensing to even be feasible.


But this is harder and harder to believe in a world in which there is such a range of datasets that you can access with permission.


You can license data from media companies.


There have been 27 major deals between AI companies and rights holders in the last year alone, and that's to say nothing of the smaller ones that don't get reported.

仅在过去一年中, 人工智能公司与版权持有者之间就达成了27笔重大交易,更不用说那些未被报道的小规模交易了。

There are marketplaces of training data where you can get more data.


You can expand this with data that's in the public domain -- that is, in which no copyright exists, like the 500-billion-word dataset Common Corpus.

你可以通过使用公共领域的数据来扩展它——即那些不存在版权问题的数据,比如包含5000亿词汇的通用语料库(Common Corpus)。

You can expand this further with synthetic data, that is, data that's created itself by an AI model, in which usually no copyright exists.


So there are multiple options available to you if you want to build your model without infringing copyright.


But the strongest evidence that it's possible to license all your data is that there are multiple companies doing it already.


I know, because I've done it myself.


I've worked in what we now call generative AI for over a decade, and last September, my team at Stability AI released an AI music model that trained on licensed music.

我在我们如今称之为生成式人工智能的领域工作了十多年,去年九月, 我在Stability AI的团队发布了一个基于授权音乐训练的人工智能音乐模型。

A number of other companies have done the same thing, and I founded Fairly Trained in order to highlight this fact, and these companies.

许多其他公司也采取了相同做法,我创立Fairly Trained就是为了强调这一事实并表彰这些公司。

Fairly Trained is a nonprofit that certifies generative AI companies that don't train on copyrighted work without a license.

Fairly Trained是一家非营利组织,它认证那些在未经许可的情况下不使用受版权保护作品进行训练的生成式AI公司。

We launched in January of this year, and we've already certified 18 companies.


Now these companies take a variety of approaches to licensing their training data.


We have an AI voice model that's trained on individual voices it's licensed.


We have an AI music model that's licensed more than 40 music catalogs.


We have a large language model that's trained only on data in the public domain, mostly from government documents and records.

我们拥有一个大型语言模型, 该模型仅基于公共领域数据进行训练,主要来源于政府文件和记录。

We have companies who have paid upfront fees for their data.


We have companies who share their revenue with their data providers.


There is no one answer to the exact specifics of how one of these licensing deals has to work.


The beauty of licensing is that the two parties can come together and figure out what works for them.


And this is happening more and more now.


You will hear that a requirement to license training data somehow stifles innovation, that it's only the big AI companies that can afford these huge upfront licensing fees.

你会听到, 要求授权训练数据在某种程度上抑制了创新,只有大型人工智能公司才能承担得起这些高昂的预付授权费用。

But in reality, it's the smaller start-ups who are bothering to license all their data, and they're doing so, often, without hefty upfront licensing fees, but using models such as revenue shares.

但实际上, 是那些小型初创企业愿意费心去获取所有数据的许可,而且他们通常无需支付高昂的预付许可费,而是采用诸如收入分成等模式来实现。

And there's another major upside to licensing your training data.


All of this training on copyrighted work is forcing publishers to shut off access to their content.


The Data Provenance Initiative looked at 14,000 websites commonly used in AI training sets, and they found that, over the course of a single year, looking at only the domains of the highest value for AI training, the number that was restricted via opt-outs or terms of service increased from three percent to between 20 and 33 percent.


The web is being gradually closed due to unlicensed training.


Now this is bad for new AI models, for new entrants to the market, but also for everyone -- researchers, consumers and more, who benefit from an open internet.


It should come as no surprise that the general public do not agree with AI companies about what they can train their models on.


One poll from the AI Policy Institute, in April, asked people about the common policy among AI companies of training on publicly available data.


This is data that is openly available online, which of course includes a lot of copyrighted work, like news articles and, often, pirated media.

这些数据是公开在线获取的,当然其中包含大量受版权保护的作品,如新闻文章, 且常常涉及盗版媒体。

60 percent of people said this should not be allowed versus only 19 percent who said it should.


The same poll went on to ask whether AI companies should compensate data providers.


74 percent said yes, and only nine percent said no.


Time and time again, when we ask the public these questions, they show support for requirements around permission and payment, and a rejection of the notion that something being publicly available somehow makes it fair game.

一次次地, 当我们向公众提出这些问题时,他们表现出对许可和支付要求的认同,以及对那种认为公开可得之物便可以随意使用的观念的反对。

And the people who make the art that society consumes feel the same way.


Today, we launched a "Statement on AI Training, " a short, simple open letter, which simply reads: "The unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted." This has already been signed by 11,000 and counting creators around the world, including Nobel-winning authors, Academy Award-winning actors and Oscar-winning composers.

今日,我们发布了《关于AI训练的声明》,这是一封简短而直接的公开信,信中明确指出:“未经授权使用创意作品来训练生成式AI,是对这些作品背后创作者生计的重大不公威胁,绝不应被允许。 ” 目前,全球已有超过11,000位创作者签署了这份声明,其中包括诺贝尔文学奖得主、奥斯卡金像奖得主演员以及奥斯卡奖得主作曲家。

And if you agree with this sentiment, I encourage you to sign it today at aitrainingstatement.org.


What this statement and previous ones like it make abundantly clear is that these artists, these creators, view the unlicensed training on their work by generative AI models as totally unjust and potentially catastrophic to their professions.

这一声明及其类似先例明确无误地表明, 这些艺术家、创作者们认为,生成式人工智能模型未经许可对其作品进行训练, 是完全不公正的, 并可能对其职业造成灾难性影响。

So if you are an advocate for unlicensed AI training, just remember that the people who wrote the music that you are listening to and the books you're reading probably disagree.

因此,若你支持未经许可的AI训练,请记住,那些创作了你正在聆听的音乐和你正在阅读的书籍的人, 很可能持不同意见。

So where does this leave us?


Well, right now, many of the world's artists, writers, musicians, creators straight-up hate generative AI.


And we know, from their own words, that one of the reasons for this is that we're training on their work without asking them.


But it doesn't have to be this way.


The AI industry and the creative industries can be and should be mutually beneficial.


But for this mutually beneficial relationship to emerge, we have to start from a position of respect for the value of the works being trained on and the rights of the people who made them.


I'm not arguing that all AI development should be halted.


I'm not arguing that AI should not exist.


What I'm arguing is that the resources used to build generative AI should be paid for.


Licensing is hard work.


It will slow you down in the short term, but you'll ultimately reach exactly the same point -- models that are just as capable, just as powerful -- and you'll do so without forcing the world's publishers to batten down the hatches and destroy the commons, and without pitting the world's creators against you.


So I hope that more AI companies will follow the example set by those we've certified at Fairly Trained, and license all their training data.

因此, 我希望更多人工智能公司能效仿我们在Fairly Trained认证的那些企业的做法,对所有训练数据进行授权。

I hope that employees at these companies will demand this of their employers.


And I hope that everyone who uses generative AI will ask what their favorite models were trained on.


There is a future in which generative AI and human creativity can coexist, not just peacefully, but symbiotically.


It's been a rough start, but it's not too late to change course.


Thank you.


