TED- 人工智能模型如何窃取创意 及如何应对

B站影视 韩国电影 2025-03-19 21:04 1

摘要:There are three key things that AI companies need to build their models, three key resources -- people, compute and data.

The technology and vision behind generative AI is amazing, but stealing the work of the world's creators to build it is not.

生成式人工智能背后的技术和愿景令人惊叹,但为了构建它而窃取全球创作者的劳动成果则不可取。

There are three key things that AI companies need to build their models, three key resources -- people, compute and data.

构建AI模型需要三大关键要素,即三大核心资源:人才、计算能力和数据。

That is, engineers to build the models, GPUs to run the training process and data to train the models on.

即工程师构建模型,GPU运行训练过程, 以及用于训练模型的数据。

AI companies spend vast sums on the first two, sometimes a million dollars per engineer and up to a billion dollars per model.

AI公司在前面两项上投入巨资,每位工程师的费用有时高达百万美元, 而每个模型的成本甚至可达到十亿美元。

But they expect to take the third resource, training data, for free.

但他们期望能免费获取第三种资源——训练数据。

Right now, many AI companies train on creative work they haven't paid for or even asked permission to use.

目前,许多人工智能公司正在利用未经付费或未获许可使用的创意作品进行训练。

This is unfair and unsustainable.

这是不公平且不可持续的。

But if we reset, and license our training data, we can build a better generative AI ecosystem that works for everyone, both the AI companies themselves and the creators, without whose work these models would not exist.

但如果我们重新设定规则, 对训练数据进行授权,我们就能构建一个更优秀的生成式AI生态系统, 造福所有人,包括AI公司自身和内容创作者,没有他们的贡献, 这些模型根本无法存在。

Most AI companies today do not license the majority of their training data.

当今多数人工智能公司并未对其大部分训练数据进行授权。

They use web scrapers to find, download and train on as much content as they can gather.

他们利用网络爬虫工具来尽可能多地查找、下载并用于训练的内容。

They're often pretty secretive about what they do train on, but what's clear is that training on copyrighted work without a license is rife.

他们通常对其训练所用的数据相当保密,但显而易见的是, 未经许可使用受版权保护的作品进行训练的现象十分普遍。

For instance, when the Mozilla Foundation looked at 47 large language models released between 2019 and 2023, they found that 64 percent of them were trained, in part, on Common Crawl, a dataset that includes copyrighted works, such as newspaper articles from major publications.

例如,当Mozilla基金会审视2019年至2023年间发布的47个大型语言模型时,他们发现其中64%的模型部分基于Common Crawl数据集进行训练,该数据集包含了受版权保护的作品,比如来自主要出版物的新闻文章。

And a further 21 percent didn't reveal enough information to know either way.

另有21%的模型未能提供足够信息,使得人们无法判断其训练数据是否涉及版权内容。

Training on copyrighted work without a license has rapidly become standard across much of the generative AI industry.

在未经许可的情况下使用受版权保护的作品进行训练,已在生成式人工智能行业的大部分地区迅速成为常态。

But this training, this unlicensed training on creative work, has serious negative consequences for the people behind that work.

然而, 这种未经授权对创意作品的训练,对这些作品背后的创作者造成了严重的负面影响。

And this is for the simple reason that generative AI competes with its training data.

而这一点,简而言之,是因为生成式人工智能与其训练数据形成了竞争关系。

This is not the narrative that AI companies like to portray.

这并非人工智能公司乐于描绘的叙事。

We like to talk about democratization, about letting more people be creative.

我们乐于探讨民主化,即让更多人发挥创造力。

But the fact that AI competes with its training data is inescapable.

但人工智能与其训练数据之间的竞争这一事实是不可避免的。

A large language model trained on short stories can create competing short stories.

一个以短篇小说为训练基础的大型语言模型能够创作出与之竞争的短篇小说。

An AI image model trained on stock images can create competing stock images.

一个基于库存图片训练的人工智能图像模型能够生成与之竞争的库存图片。

An AI music model trained on music that's licensed to TV shows can create competing music to license to TV shows.

一个基于已授权给电视节目的音乐训练的人工智能音乐模型,能够创作出可与现有音乐竞争、同样适合授权给电视节目的音乐作品。

These models, however imperfect, are so quick and easy to use that this competition is inevitable.

这些模型尽管尚不完美,却因其快速便捷的使用特性,使得这种竞争不可避免。

And this isn't just theoretical.

而这种现象并非仅停留在理论层面。

Generative AI is still pretty new, but we're already seeing exactly the sort of effects you'd expect in a world in which generative AI competes with its training data.

生成式人工智能尚处于起步阶段,但我们已经看到了在一个生成式人工智能与其训练数据竞争的世界中,预期会出现的那种效应。

For instance, the well-known filmmaker Ram Gopal Varma recently said that he'll use AI music in all his projects going forward.

例如,知名导演拉姆·戈帕尔·维尔马近日表示,他将在未来的所有项目中使用AI音乐。

Indeed, there are multiple reports of people starting to listen to AI music in place of human-produced music, and recently, an AI song hit number 48 in the German charts.

事实上,有多份报道称人们开始选择聆听AI音乐, 而非人类创作的音乐,最近,一首AI歌曲在德国音乐排行榜上更是攀升至第48位。

In all these cases, AI music is competing with the songs it was trained on.

在这些情况下,AI音乐实际上是在与其训练数据中的歌曲进行竞争。

Or take Kelly McKernan.

或者以凯利·麦肯纳为例。

Kelly is an artist from Nashville.

凯利是来自纳什维尔的艺术家。

For 10 years, they made enough money selling their work that art was their full-time income.

十年来,他们通过出售作品赚取了足够收入,使得艺术创作成为他们的全职职业。

But in 2022, a dataset that included their works was used to train a popular AI image model.

然而在2022年,包含他们作品的数据集被用来训练了一款广受欢迎的AI图像模型。

Their name was one of many used by huge numbers of people to create art in the style of specific human artists.

他们的名字被众多人用来创作特定人类艺术家风格的艺术作品,成为其中之一。

Kelly's income fell by 33 percent almost overnight.

凯利的收入几乎在一夜之间骤降了33%。

Illustrators around the world report similar stories, being outcompeted by AI models they have reason to believe were trained on their work.

全球各地的插画师们纷纷反映类似情况,他们被认为基于自己作品训练的AI模型所超越。

The freelance platform Upwork wrote a white paper in which they looked at the effects, that they've seen on the job market, of generative AI.

自由职业平台Upwork撰写了一份白皮书,探讨了他们观察到的生成式AI对就业市场的影响。

They looked at how job postings on their platform have changed since the introduction of ChatGPT, and sure enough, they found exactly what you'd expect, that generative AI has reduced the demand for freelance writing tasks by 8 percent, which increases to 18 percent if you look at only what they term lower-value tasks.

他们考察了自ChatGPT问世以来,平台上职位发布情况的变化,果不其然,发现的结果正如预期:生成式AI使得自由职业写作任务的需求下降了8%,而如果仅看他们所称的低价值任务,这一降幅则上升至18%。

So the initial data we have, plus the individual stories we hear, all align with the logical assumption: "Generative AI competes with the work it's trained on." It's so quick and easy to use, it's inevitable, and it competes with the people behind that work.

因此,我们现有的初步数据,加上所听到的个人经历,都与这一合乎逻辑的假设相吻合:“生成式人工智能与其所训练的工作内容形成了竞争。”它使用起来既快捷又简便,这种竞争不可避免,且直接挑战了那些从事相关工作的人。

Now creators argue this training is illegal.

现在,创作者们辩称这种训练是非法的。

The legal framework of copyright affords creators the exclusive right to authorize copies of their work, and AI training involves copying.

版权法律框架赋予创作者对其作品复制进行授权的专有权利,而AI训练则涉及复制行为。

Here, in the US, many AI companies argue that training AI falls under the fair use copyright exception, which allows unlicensed copying in a limited set of circumstances, such as creating parodies of a work.

在美国, 许多AI公司辩称, AI训练属于合理使用版权例外范畴,这一例外允许在特定情况下未经许可进行复制,例如创作作品的模仿版本。

Creators and rights holders strongly disagree, saying there's no way this narrow exception can be used to legitimize the mass exploitation of creative work to create automated competitors to that work.

创作者和版权持有者强烈反对,认为这一狭窄的例外条款无法被用来合法化大规模利用创意作品, 以创造出与原作竞争的自动化产品。

And for the record, I entirely agree.

就事论事,我完全赞同。

Of course, this question is previously untested in the courts, and there are currently around 30 ongoing lawsuits brought by rights holders against AI companies, which will help to address this question.

当然, 这个问题在法庭上尚未经过检验,目前约有30起由权利持有人对人工智能公司提起的正在进行中的诉讼,这些案件将有助于解决这一问题。

But this will take time, and creators are suffering from what they see as unjust competition right now.

但这需要时间,目前创作者们正承受着他们认为的不公平竞争之苦。

So they propose a solution that has been used and worked before -- licensing.

因此,他们提出了一种曾经被采用并证明有效的解决方案——许可制度。

If a commercial entity wants to use copyrighted work, be it for merchandise manufacturing or building a streaming service, they license that work.

如果一家商业实体希望使用受版权保护的作品,无论是用于商品制造还是搭建流媒体服务,他们都需要获得该作品的许可。

Now AI companies have a bunch of reasons why this shouldn't apply to them.

如今,AI公司有一系列理由认为这不该适用于它们。

There's the fair use legal exception that I've already mentioned.

我已提及的合理使用法律例外。

There's also the argument that since humans can train on copyrighted work without a license, AI should be allowed to, too.

还有观点认为,既然人类无需许可就能在受版权保护的作品上进行训练,那么AI也应享有同样的权利。

But this is a very hard claim to justify.

但这一论点很难站得住脚。

Artists have been learning from each other for centuries.

艺术家们相互学习已有数百年之久。

When you create, you expect other people to learn from you.

创作之际,你期望他人能从你的作品中汲取智慧。

You learn from a range of sources, from other art to textbooks to taking lessons.

你从多种来源汲取知识,从其他艺术作品到教科书,再到参加课程。

Much of this you or someone else paid for, supporting the entire ecosystem.

其中大部分内容是你或他人付费获取的,支撑了整个生态系统。

In generative AI, commercial entities valued at millions or billions of dollars scrape as much content as they can, often against creators' will, without payment, making multiple copies along the way -- which are subject to copyright law -- to create a highly scalable competitor to what they're copying.

在生成式人工智能领域, 那些估值达数百万甚至数十亿美元的商业实体, 尽可能地抓取大量内容,这些行为常常违背创作者的意愿, 且未支付任何费用,在此过程中还制作了多份副本——这些副本受版权法保护——以此打造出一个高度可扩展的竞争对手, 与它们所复制的内容展开竞争。

So scalable, in fact, that there are AI image generators estimated to be making 2.5 million images a day and AI song generators outputting 10 songs a second.

实际上,这种扩展性如此之高,以至于据估计,某些AI图像生成器每天能产出250万张图片,而AI歌曲生成器则每秒可创作10首歌曲。

To argue that human learning and AI training are the same and should be treated the same is preposterous.

声称人类学习和人工智能训练是相同的,并且应受到同等对待,这简直是荒谬的。

AI companies also argue that licensing their training data would be impractical.

人工智能公司还辩称,许可其训练数据将是不切实际的。

They use so much training data, they say, that individual payments to each creator behind the data would be small.

他们声称, 由于使用了海量的训练数据,对每位数据背后创作者的个别支付金额将会微乎其微。

But this is true of many content-licensing markets.

但这种情况在许多内容授权市场中普遍存在。

Creators still want to get paid, even if the payments are small.

创作者依然希望能获得报酬,即便金额微薄。

AI companies also argue that they simply use too much data for licensing to even be feasible.

人工智能公司则辩称,他们使用的数据量过大,以至于通过许可来获取数据变得不切实际。

But this is harder and harder to believe in a world in which there is such a range of datasets that you can access with permission.

但在这样一个数据集种类繁多、且经许可即可访问的世界里,这种说法越来越难以令人信服。

You can license data from media companies.

你可以从媒体公司获取授权数据。

There have been 27 major deals between AI companies and rights holders in the last year alone, and that's to say nothing of the smaller ones that don't get reported.

仅在过去一年中, 人工智能公司与版权持有者之间就达成了27笔重大交易,更不用说那些未被报道的小规模交易了。

There are marketplaces of training data where you can get more data.

存在训练数据市场,您可以在那里获取更多数据。

You can expand this with data that's in the public domain -- that is, in which no copyright exists, like the 500-billion-word dataset Common Corpus.

你可以通过使用公共领域的数据来扩展它——即那些不存在版权问题的数据,比如包含5000亿词汇的通用语料库(Common Corpus)。

You can expand this further with synthetic data, that is, data that's created itself by an AI model, in which usually no copyright exists.

你可以进一步通过合成数据来扩展,即由AI模型自行生成的数据,这类数据通常不存在版权问题。

So there are multiple options available to you if you want to build your model without infringing copyright.

因此,如果你想在不侵犯版权的情况下构建模型,有多种选择可供你采用。

But the strongest evidence that it's possible to license all your data is that there are multiple companies doing it already.

但最有力的证据表明,所有数据都能获得授权,是因为已有众多公司正在这样做。

I know, because I've done it myself.

我深知这一点,因为我自己就曾亲身实践过。

I've worked in what we now call generative AI for over a decade, and last September, my team at Stability AI released an AI music model that trained on licensed music.

我在我们如今称之为生成式人工智能的领域工作了十多年,去年九月, 我在Stability AI的团队发布了一个基于授权音乐训练的人工智能音乐模型。

A number of other companies have done the same thing, and I founded Fairly Trained in order to highlight this fact, and these companies.

许多其他公司也采取了相同做法,我创立Fairly Trained就是为了强调这一事实并表彰这些公司。

Fairly Trained is a nonprofit that certifies generative AI companies that don't train on copyrighted work without a license.

Fairly Trained是一家非营利组织,它认证那些在未经许可的情况下不使用受版权保护作品进行训练的生成式AI公司。

We launched in January of this year, and we've already certified 18 companies.

我们于今年一月启动,迄今已认证了18家公司。

Now these companies take a variety of approaches to licensing their training data.

如今,这些公司在授权其训练数据方面采取了多种策略。

We have an AI voice model that's trained on individual voices it's licensed.

我们拥有一个AI语音模型,该模型基于已授权的个体声音进行训练。

We have an AI music model that's licensed more than 40 music catalogs.

我们拥有一个AI音乐模型,该模型已获得超过40个音乐目录的授权。

We have a large language model that's trained only on data in the public domain, mostly from government documents and records.

我们拥有一个大型语言模型, 该模型仅基于公共领域数据进行训练,主要来源于政府文件和记录。

We have companies who have paid upfront fees for their data.

我们有公司为他们的数据支付了预付款。

We have companies who share their revenue with their data providers.

我们有公司会与数据提供者分享其收入。

There is no one answer to the exact specifics of how one of these licensing deals has to work.

这些许可协议的具体细节并无固定答案,需由双方共同商讨,找到适合彼此的运作方式。

The beauty of licensing is that the two parties can come together and figure out what works for them.

授权之美在于,双方能够共同商讨并找出最适合他们的合作方式。

And this is happening more and more now.

这种现象现在正变得越来越普遍。

You will hear that a requirement to license training data somehow stifles innovation, that it's only the big AI companies that can afford these huge upfront licensing fees.

你会听到, 要求授权训练数据在某种程度上抑制了创新,只有大型人工智能公司才能承担得起这些高昂的预付授权费用。

But in reality, it's the smaller start-ups who are bothering to license all their data, and they're doing so, often, without hefty upfront licensing fees, but using models such as revenue shares.

但实际上, 是那些小型初创企业愿意费心去获取所有数据的许可,而且他们通常无需支付高昂的预付许可费,而是采用诸如收入分成等模式来实现。

And there's another major upside to licensing your training data.

授权您的训练数据还有另一个重大优势。

All of this training on copyrighted work is forcing publishers to shut off access to their content.

所有这些基于版权作品的训练正迫使出版商关闭对其内容的访问权限。

The Data Provenance Initiative looked at 14,000 websites commonly used in AI training sets, and they found that, over the course of a single year, looking at only the domains of the highest value for AI training, the number that was restricted via opt-outs or terms of service increased from three percent to between 20 and 33 percent.

数据来源倡议组织调查了14,000个常用于AI训练集的网站,发现仅在一年内,针对AI训练最有价值的域名中,通过选择退出或服务条款限制访问的比例从3%上升至20%到33%之间。

The web is being gradually closed due to unlicensed training.

由于未经授权的训练,互联网正逐渐变得封闭。

Now this is bad for new AI models, for new entrants to the market, but also for everyone -- researchers, consumers and more, who benefit from an open internet.

这对新的人工智能模型、市场新进入者不利,也对所有人——包括研究人员、消费者等从开放互联网中获益的人群——构成了负面影响。

It should come as no surprise that the general public do not agree with AI companies about what they can train their models on.

公众普遍不认同AI公司关于训练模型所用数据的主张,这一点不足为奇。

One poll from the AI Policy Institute, in April, asked people about the common policy among AI companies of training on publicly available data.

四月,AI政策研究所进行的一项民意调查询问了公众对AI公司普遍采用的、基于公开可用数据进行模型训练的政策看法。

This is data that is openly available online, which of course includes a lot of copyrighted work, like news articles and, often, pirated media.

这些数据是公开在线获取的,当然其中包含大量受版权保护的作品,如新闻文章, 且常常涉及盗版媒体。

60 percent of people said this should not be allowed versus only 19 percent who said it should.

60%的人认为这不应该被允许,而只有19%的人认为应该允许。

The same poll went on to ask whether AI companies should compensate data providers.

同一项调查进一步询问了人工智能公司是否应向数据提供者提供补偿。

74 percent said yes, and only nine percent said no.

74%的人表示赞同,仅有9%的人持反对意见。

Time and time again, when we ask the public these questions, they show support for requirements around permission and payment, and a rejection of the notion that something being publicly available somehow makes it fair game.

一次次地, 当我们向公众提出这些问题时,他们表现出对许可和支付要求的认同,以及对那种认为公开可得之物便可以随意使用的观念的反对。

And the people who make the art that society consumes feel the same way.

创作社会所消费艺术的人们也抱有同样的看法。

Today, we launched a "Statement on AI Training, " a short, simple open letter, which simply reads: "The unlicensed use of creative works for training generative AI is a major, unjust threat to the livelihoods of the people behind those works, and must not be permitted." This has already been signed by 11,000 and counting creators around the world, including Nobel-winning authors, Academy Award-winning actors and Oscar-winning composers.

今日,我们发布了《关于AI训练的声明》,这是一封简短而直接的公开信,信中明确指出:“未经授权使用创意作品来训练生成式AI,是对这些作品背后创作者生计的重大不公威胁,绝不应被允许。 ” 目前,全球已有超过11,000位创作者签署了这份声明,其中包括诺贝尔文学奖得主、奥斯卡金像奖得主演员以及奥斯卡奖得主作曲家。

And if you agree with this sentiment, I encourage you to sign it today at aitrainingstatement.org.

若您认同此观点,我鼓励您今天就在aitrainingstatement.org签署这份声明。

What this statement and previous ones like it make abundantly clear is that these artists, these creators, view the unlicensed training on their work by generative AI models as totally unjust and potentially catastrophic to their professions.

这一声明及其类似先例明确无误地表明, 这些艺术家、创作者们认为,生成式人工智能模型未经许可对其作品进行训练, 是完全不公正的, 并可能对其职业造成灾难性影响。

So if you are an advocate for unlicensed AI training, just remember that the people who wrote the music that you are listening to and the books you're reading probably disagree.

因此,若你支持未经许可的AI训练,请记住,那些创作了你正在聆听的音乐和你正在阅读的书籍的人, 很可能持不同意见。

So where does this leave us?

那么,这把我们置于何地呢?

Well, right now, many of the world's artists, writers, musicians, creators straight-up hate generative AI.

那么,现状如何呢?目前,全球众多艺术家、作家、音乐家及创作者们对生成式人工智能深恶痛绝。

And we know, from their own words, that one of the reasons for this is that we're training on their work without asking them.

我们通过他们自己的言辞了解到,其中一个原因是我们在未经他们许可的情况下,利用他们的作品进行训练。

But it doesn't have to be this way.

但这并非必然如此。

The AI industry and the creative industries can be and should be mutually beneficial.

人工智能产业与创意产业能够且应当实现互利共赢。

But for this mutually beneficial relationship to emerge, we have to start from a position of respect for the value of the works being trained on and the rights of the people who made them.

但为了实现这种互利关系,我们必须从尊重被训练作品的价值以及创作者权利的立场出发。

I'm not arguing that all AI development should be halted.

我并不是主张应该停止所有人工智能的发展。

I'm not arguing that AI should not exist.

我并不是主张人工智能不应存在。

What I'm arguing is that the resources used to build generative AI should be paid for.

我主张的是,用于构建生成式人工智能的资源应当得到相应的报酬。

Licensing is hard work.

许可工作并不轻松。

It will slow you down in the short term, but you'll ultimately reach exactly the same point -- models that are just as capable, just as powerful -- and you'll do so without forcing the world's publishers to batten down the hatches and destroy the commons, and without pitting the world's creators against you.

短期内,这可能会拖慢你的步伐,但最终你将达到同样的高度——模型同样能干、同样强大——而且你这样做无需迫使全球出版商紧闭门户、破坏公共资源,也无需与全球创作者对立。

So I hope that more AI companies will follow the example set by those we've certified at Fairly Trained, and license all their training data.

因此, 我希望更多人工智能公司能效仿我们在Fairly Trained认证的那些企业的做法,对所有训练数据进行授权。

I hope that employees at these companies will demand this of their employers.

我希望这些公司的员工会向雇主要求这一点。

And I hope that everyone who uses generative AI will ask what their favorite models were trained on.

我希望每个使用生成式AI的人都会询问他们喜爱的模型是基于什么训练的。

There is a future in which generative AI and human creativity can coexist, not just peacefully, but symbiotically.

未来存在着一种可能性,生成式人工智能与人类创造力不仅能够和平共处,更能实现共生共荣。

It's been a rough start, but it's not too late to change course.

虽然起步艰难,但改变方向仍为时不晚。

Thank you.

谢谢。

来源:英语东

相关推荐