GEMA v OpenAI: Landmark European Ruling on AI Training Data and Copyright Infringement
- Case Background
- Key Court Findings
- Post-GEMA Ruling: AI Training Data Classification
- Case Prospects
- Practical Recommendations
On 11 November 2025, the Munich Regional Court became the first court in Europe to rule that an AI company had infringed copyright by training its models. OpenAI lost its case to GEMA, the German performing and mechanical rights organisation representing composers and lyricists. The court ruled that ChatGPT had illegally used song lyrics for training and subsequently reproduced them in response to simple user prompts.
What does this have to do with game development? It is quite straightforward. Game studios actively use generative AI for creating assets, procedural content generation, writing dialogue, and composing music and sound effects. If you train your model on third-party assets, concept art, music, or texts – following this ruling, you are in the risk zone. What was previously considered a legal grey area has now been explicitly classified by the court as an infringement.
The Arbitration & IT Disputes team at REVERA Law Firm has prepared a comprehensive analysis of the ruling and its implications for AI startups. We examine what data can now be used for training models and what data constitutes a direct path to litigation.
Case Background
GEMA, the German music rights management society representing over 95,000 members, filed a lawsuit against OpenAI in November 2024. ChatGPT reproduced original song lyrics without an internet connection, proving these lyrics were used in the model's training data. The dispute concerned nine famous German songs, including "Atemlos" (Christina Bach), "Männer" (Herbert Grönemeyer), and "Über den Wolken" (Reinhard Mey).
OpenAI mounted a defence on three fronts:
- The company argued that its models do not store specific data but reflect statistical correlations, and any text is generated based on user prompts outside the company's control.
- OpenAI invoked the exceptions for Text and Data Mining (TDM) under Articles 3 and 4 of EU Directive 2019/790.
- The company attempted to claim status as a non-commercial research organisation.
Key Court Findings
The court examined OpenAI's technical arguments and rejected them all. The core conclusion was that if a model can reproduce song lyrics using simple prompts like "What does the text [song title] sound like?", this constitutes copyright infringement. GEMA proved that training data is embedded into the model's weights and remains extractable. The court agreed, citing information technology research. The judges drew an analogy to MP3 compression: the model does not store texts character-for-character but is capable of recognisably recreating them. This sufficed to establish the fact of reproduction. The court dismissed OpenAI's defence based on the TDM exception, stating that TDM is intended for extracting abstract patterns—syntactic rules, common terms, semantic relationships, while the memorisation of specific song lyrics constitutes direct copying. Although the German legislator explicitly mentioned "machine learning as a foundational technology for artificial intelligence", the court drew a clear distinction between training a model in general and the memorisation of entire works. The court also rejected OpenAI's attempt to shift liability onto users, stating that the company selected the training data, built the system, and defined its architecture, while user prompts merely trigger processes embedded within the model and do not create independent liability. Other defensive arguments from OpenAI also failed, as training AI does not constitute "ordinary use" of a work to which authors would have tacitly consented, and references to quotation or parody were likewise not accepted.
The court's decision applies specifically to the older versions of ChatGPT 4 and 4o, on which GEMA conducted its testing. During the proceedings, disagreements arose between GEMA and OpenAI regarding whether newer model versions infringe copyright. The court did not investigate this matter within the current case. This means OpenAI could theoretically argue it has implemented technical measures to prevent memorisation in later versions.
However, the court established a fundamental legal principle: the mere fact of memorisation and the ability to reproduce protected content constitutes an infringement, irrespective of technical arguments about "statistical correlations."
For practical purposes, this means all AI models must be tested for their ability to reproduce training data via simple prompts, and the existence of such capability creates legal risk.
The court found OpenAI guilty even of infringing a 15-word fragment, stating the reproduced segments were sufficiently long to preclude coincidence.
The court ordered OpenAI to immediately cease infringements, disclose information about the infringing activity, and pay compensation. Crucially, the judge, in an oral explanation of the ruling, stated that OpenAI was found guilty of at least negligence, leading to the denial of a six-month grace period for implementing necessary changes to maintain its service in Germany. The court acknowledged the technical difficulties of removing data from a trained model, but the obligation to prevent infringements remains with the provider. In practice, this necessitates multi-layered measures: output filters, model retraining, and licensing strategies.
Post-GEMA Ruling: AI Training Data Classification
| Data Category | Status | Description & Risks |
| Public Domain Works | Safe | Works where copyright protection has expired (70 years post-author's death in the EU) can be used without restriction. Main challenge – verifying status for content from different jurisdictions with varying terms. The only unconditionally safe category. |
| Open Licences | ||
| CC0 and equivalents | Safe | Use without restrictions. However, the quantity of high-quality CC0 content is limited, especially for specialised domains. |
| CC BY | Technically Problematic | Licence requires attribution, which is impracticable for tens of thousands of sources embedded in model weights. Some jurisdictions may deem this technical non-compliance with licence terms. Practical risk remains unclear. |
| CC BY-NC CC BY-ND |
Prohibited | The first prohibits commercial use (and most AI models are commercial), the second prohibits the creation of derivative works. Post-GEMA, a trained model is highly likely to be considered a derivative work. |
| ShareAlike Licences (e.g., CC BY-SA, GPL) | Viral Trap | Require distribution of derivatives under the same licence. If the model is a derivative work, the entire model must be open-sourced under a copyleft licence. This is fatal for commercial proprietary models. Particularly dangerous: GPL (for code) and ODbL (for databases like OpenStreetMap). |
| Content Not Under Open Licences | ||
| Protected, Publicly Available Content | High Risk | Song lyrics, articles, social media posts, images. Use without a licence has now been deemed infringement by the court. The TDM exception offers no protection where memorisation occurs. |
| Protected, Not Publicly Available Content | Absolutely Prohibited | Internal documents, private databases, content behind paywalls. No legal basis without explicit consent. |
| Licensed Content | Safe Path | The only safe path for commercial AI training in the EU. Licences must explicitly permit three actions: 1) model training, 2) memorisation within parameters, and 3) reproduction in outputs. Wording must be extremely specific—general phrases about "use in AI" are insufficient post-GEMA. |
| Synthetic Data | Conditionally Safe | If synthetic data is generated by a model trained on protected content, a derivative infringement arises. Additionally, the problem of model collapse exists when training on AI-generated content. |
It is critical to emphasise the key distinction from another well-known case—Getty Images v Stability AI—in the United Kingdom. In that parallel case, the claimant failed to provide convincing evidence that the model created near-identical copies of the training data. Getty even withdrew its claims regarding training infringements. In the GEMA case, the situation was diametrically opposite—GEMA demonstrated to the court specific examples where ChatGPT reproduced song lyrics almost verbatim using simple prompts like "What does the song text sound like".
Case Prospects
OpenAI has announced plans to appeal. The ruling may be reviewed by the Munich Higher Regional Court or referred to the Court of Justice of the European Union. Concurrently, GEMA is continuing litigation against Suno AI regarding AI-generated music (a hearing is scheduled for 26 January 2026). If these cases reach the German Federal Court of Justice or the Court of Justice of the European Union, they will create binding precedents for the whole of Europe.
GEMA had offered OpenAI a special licensing framework in September 2024, but the company refused. Proactive licensing is substantially cheaper than litigation. Licensing agreements must explicitly permit model training, memorisation within parameters, and reproduction in outputs, with clear terms of use.
The GEMA v OpenAI ruling establishes strict legal boundaries: for commercial AI systems in the EU, the only reliable path involves working with public domain or licensed content.
The concept of memorisation as infringement means technical arguments about statistical correlations do not provide a legal defence against liability. For AI startups, this constitutes an imperative to radically revise data strategy: investment in proper licensing and technical protective measures is a necessary condition for sustainable business development.
Practical Recommendations
Conduct an audit of data used for model training, documenting sources, legal status, and existing licences. It is critical to test models for their ability to reproduce training data via simple prompts. If memorisation is detected, implement output filters, consider model retraining, and obtain licences for problematic works.
A multi-layered defence strategy is essential: at the dataset level, prioritise the use of licensed content or public domain material; at the training level, apply techniques to minimise memorisation and utilise differential privacy; at the output level, implement filters and infringement detection systems.
The Arbitration & IT Disputes team provides comprehensive legal support to AI companies: dataset audits and risk assessment; development of licensing strategies; preparation of defences against potential claims; negotiations with rights holders and collective management organisations.
In the new legal reality post-GEMA v OpenAI, a pre-emptive legal strategy is a critical success factor.