Microsoft CEO of AI: Your online content is ‘freeware’ fodder for training models

Mustafa Suleyman, CEO of Microsoft AI, said this week that machine learning companies can “scrape” most of the content published online and use it to train neural networks because it is essentially “freeware.”

Shortly after, the Center for Investigative Reporting sued OpenAI and its largest investor Microsoft “for using the nonprofit news organization’s content without permission or offering compensation.”

This follows in the footsteps of eight newspapers that sued OpenAI and Microsoft in April over alleged content misuse, as did the New York Times four months earlier.

Then there are the two authors who sued OpenAI and Microsoft in January for training AI models on the authors’ works without permission. Also in 2022, several unidentified developers sued OpenAI and GitHub based on claims that the organizations used publicly posted programming code to train generative models in violation of software license terms.

When asked in an interview with CNBC’s Andrew Ross Sorkin at the Aspen Ideas Festival whether AI companies have effectively stolen the world’s intellectual property, Suleyman acknowledged the controversy and attempted to distinguish between content that people put online and content that is supported by corporate copyright holders.

“I think with respect to content that is already on the open web, the social contract of that content since the 1990s has been that it is fair use,” he opined. “Anyone can copy it, recreate it, reproduce it. That was freeware, if you like. That was the deal.”

Suleyman admitted that there is another category of content, namely the stuff published by companies with lawyers.

“There’s a separate category where a website or publisher or news organization has explicitly said ‘don’t scrape or crawl me for any reason other than to index me,’ so that other people can find that content,” he explained. “But that’s the gray area. And I think that will go through the courts.”

That’s putting it mildly. While Suleyman’s comments will certainly offend content creators, he’s not entirely wrong: it’s not clear where the legal boundaries lie regarding AI model training and model output.

Most people who post content online as individuals will have compromised their rights in some way by accepting the Terms of Service offered by major social media platforms. Reddit’s decision to license its users’ posts to OpenAI wouldn’t happen if the social media giant thought its users had a valid claim to their memes and manifestos.

The fact that OpenAI and other AI model makers are striking deals with major publishers shows that a strong brand, sufficient financial resources and a legal team can bring major tech companies to the negotiating table.

In other words, those who create content and post it online are creating freeware unless they hire or can attract lawyers willing to challenge Microsoft and the likes.

In a paper distributed last month via SSRN, Frank Pasquale, a law professor at Cornell Tech and Cornell Law School in the US, and Haochen Sun, an associate professor of law at the University of Hong Kong, explore the legal uncertainty surrounding the use of copyrighted data to train AI and whether courts will find such use fair. They conclude that AI needs to be addressed at the policy level, as current laws are ill-suited to answering the questions that now need to be answered.

“Given that there is significant uncertainty about the legality of AI providers’ use of copyrighted works, lawmakers will need to formulate a bold new vision for rebalancing rights and responsibilities, just as they did in the aftermath of the development of the Internet (leading to the Digital Millennium Copyright Act of 1998),” they argue.

The authors suggest that the continued uncompensated harvest of creative works threatens not only writers, composers, journalists, actors and other creative professionals, but also generative AI itself, which will ultimately be deprived of training data. People will stop making work available online, they predict, if it’s only used to power AI models that reduce the marginal cost of creating content to zero and deprive creators of the opportunity for any reward.

That is the future that Suleyman anticipates. “The economics of information is about to change radically, because we can reduce the cost of knowledge production to zero marginal cost,” he said.

Any of this freeware you may have helped create can be yours for a small monthly subscription fee. ®

Leave a Comment Cancel reply