OpenAI Ordered To Turn Over 20 Million ChatGPT Logs in New York Times Copyright Fight

jeudi 4 décembre 2025, 20:34 , par eWeek

OpenAI has been ordered to turn over 20 million de-identified ChatGPT conversation logs to a coalition of news publishers, including The New York Times, in a closely watched copyright battle over generative AI.

A US magistrate judge in Manhattan rejected OpenAI’s effort to keep the logs out of discovery, finding that the anonymized records are relevant to the case and protected by multiple privacy safeguards. The decision raises the stakes for both OpenAI and publishers, pressing claims that ChatGPT unlawfully used and reproduced their work.

Judge rejects OpenAI’s privacy arguments

US Magistrate Judge Ona Wang of the Southern District of New York denied OpenAI’s motion to reconsider an earlier order directing the company to produce a sample of 20 million consumer ChatGPT output logs for discovery in the consolidated copyright litigation involving the Times and other publishers.

The publishers argued that the logs are critical to determining whether ChatGPT reproduced their copyrighted articles and to testing OpenAI’s defenses, including fair use and substantial non-infringing uses.

OpenAI opposed the request, saying that turning over the logs would risk exposing confidential user information and that “99.99%” of the transcripts were irrelevant to the plaintiffs’ claims. Judge Wang rejected that characterization, noting that the 20 million logs represent only a small fraction of the “tens of billions” of consumer ChatGPT logs that OpenAI retains, and that the sample is relevant to issues including alleged reproductions, damages, and fair use.

The court stressed that there are “multiple layers of protection” for user privacy, including OpenAI’s de-identification of the logs, an existing protective order, and an “attorneys’ eyes only” designation for the data.

How the ChatGPT logs became a legal flashpoint

Publishers have been seeking output log data for more than a year to understand how ChatGPT interacts with their content. Early discovery requests swept in consumer, enterprise, and API logs, but the parties eventually narrowed the focus to a consumer log sample for merits discovery.

By mid-2025, the plaintiffs asked for a sample of 120 million logs spanning a two-year period. OpenAI countered with a proposal for 20 million logs, arguing that a smaller sample would be easier to de-identify and still useful for statistical analysis. The plaintiffs agreed to proceed on that basis.

After OpenAI finished, or nearly finished, de-identifying the logs, it told the publishers that it would not produce the full sample and instead suggested using keyword searches to narrow the set. The publishers moved to compel, and Wang granted the motion. OpenAI then sought reconsideration and also appealed the order to the presiding district judge.

Wang wrote that such motions are an “extraordinary remedy” and found that OpenAI had not pointed to any controlling law or facts that the court had previously overlooked.

News publishers have described the dispute in sharp terms. MediaNews Group executive editor Frank Pine said OpenAI’s leadership was “hallucinating when they thought they could get away with withholding evidence about how their business model relies on stealing from hardworking journalists,” according to reporting by Reuters.

OpenAI leans on privacy and security rhetoric

OpenAI has tried to frame its position around privacy and security concerns. A company spokesperson pointed to a blog post by Chief Information Security Officer Dane Stuckey, saying that the Times’ demand for chat logs “disregards long-standing privacy protections” and “breaks with common-sense security practices.”

In court, OpenAI argued that handing over the logs would compromise user confidentiality despite de-identification and the protective order. Judge Wang was not persuaded, noting that existing privacy protections were adequate.

The opinion also raised questions about OpenAI’s litigation strategy. Wang observed that if OpenAI never intended to produce all 20 million logs, it was unclear why the company invested time and money in de-identifying the entire sample. She suggested that either OpenAI changed its mind after initially planning to produce the data or de-identified the full set as a tactic or for some other reason that it did not disclose.

Bigger picture: Copyright, AI, and publisher leverage

The Times first sued in 2023, alleging that OpenAI and, in related cases, other technology companies used copyrighted material to train AI models without permission. Those suits have since been consolidated, and the case is emerging as a test of how existing copyright doctrines apply to AI training and outputs.

For publishers, the ordered log production could provide rare visibility into how LLMs actually handle news content, whether they reproduce it, paraphrase it, or avoid it. For AI developers, the ruling underscores that courts may not accept generalized privacy and burden arguments when faced with a limited, de-identified dataset that is central to the claims and defenses at issue.

Enterprise IT and legal teams will be watching the case for discovery standards as much as for the outcome. The way this court balances privacy, proportionality, and transparency could influence what regulators, plaintiffs, and partners can demand of AI systems that remain largely opaque to outside scrutiny.

In separate research, OpenAI is testing whether models can be taught to confess their own shortcuts and errors.
The post OpenAI Ordered To Turn Over 20 Million ChatGPT Logs in New York Times Copyright Fight appeared first on eWEEK.

Lire la suite sur eWeek