If I ask you to picture an archive, what do you see? Dusty shelves, rows of books? Curiosities behind glass? Or perhaps a warehouse of defunct media? Personally, I don’t see a real space – or at least it’s not ‘real’ in the traditional sense. As part of my research on archaeological approaches to computer game preservation, I’ve worked with a plethora of online archives, from official centralised databases to ad hoc fan wikis. These are all invaluable resources for preserving ephemeral digital culture and making it accessible.
The Internet Archive (archive. org), founded in 1996, is the best known among them. The nonprofit library website provides free access to digitised collections of music and print materials, as well as website pages (via its ‘Wayback Machine’) that it archives through an automated systems trawl. The Internet Archive has been in numerous legal disputes over its lifespan, especially in terms of copyright. In September 2024, for example, the US court of appeals upheld an earlier ruling in Hachette v Internet Archive, in which a group of publishing companies alleged that the Archive’s lending of complete digital copies of books constituted copyright infringement. In this complex case, the most important finding was that its lending was not considered ‘fair use’.
Another entity also trawls the web at scale with the defence of fair use protecting its activity: the American artificial intelligence research organisation, OpenAI. As recent events demonstrate, it’s one rule for OpenAI and its profit margins, another for research institutions like the Internet Archive.
AI Boom
We’re currently in the middle of an AI boom that started to flourish in the late 2010s, with OpenAI part of the vanguard. This is in part due to their development of the generative AI chatbot ChatGPT – now as ubiquitous as it is controversial. Companies like OpenAI cannot survive without the huge amount of data that the internet provides to train their AI models. The company even stated in a submission to the House of Lords communications and digital select committee that: ‘Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment but would not provide AI systems that meet the needs of today’s citizens.’
Generative AI is nonetheless moulding the internet into its own image, making the sources of information even more obfuscated
Indeed, as a recent Data Provenance Initiative study concluded, ‘The web has acted as the primary “data commons” for general purpose AI’ – and in response, a rapid increase in websites restricting AI organisations from scraping them for content. While the study primarily frames this as a loss for AI systems, it points out that these increasing restrictions will also affect nonprofits and academic research.
As Bram Caers, an assistant professor of Middle Dutch literature at Leiden University, has pointed out, many libraries and archives have instigated projects to digitise their collections over the past decade in order to increase access. He advises that, ‘We need to have a critical conversation about the open availability of visual heritage sources such as manuscripts and early printed books, and how this access may be exploited.’ Archives that have been digitised for research are paying the price for AI exploitation twice over, being removed from open online access for fear of being ingested into training models – if that hasn’t occurred already.
Black box archive
The lack of transparency around AI algorithms has created a black box problem, with there being a broad understanding of the input and output of a system but not of the process that connects the two. In the case of ChatGPT, users may not even know all the data that was input to the system, let alone how and from what sources the model drew its output.
Despite the steps being taken to limit the grasp of AI models, OpenAI itself has claimed that it is ‘impossible’ for it to create a tool like ChatGPT without access to copyrighted material. By trawling the web for vast amounts of data, the activity of its crawlers mimics that of archival efforts – but the result is not to collect it for public reference, but rather for incorporation into a vast model of data that extrapolates at scale, producing results without easily verifiable context. This is why generative AI can be considered an archive – and a black box archive specifically.
In Archive Fever, Jacques Derrida observes: ‘What is no longer archived in the same way is no longer lived in the same way.’ We may ask now: what does it mean if our understanding of the human experience Presentation at the AI For Good summit in 2023; Internal Archive servers in San Francisco; user-designed buildings in Wurm Online becomes increasingly refracted through the black box archive of generative AI?
The internet is dead, long live the internet?
The content generated by AI models is now swamping the internet just as it continues to need new human-created data. If the models are trained on recursively generated content, they will collapse. Generative AI is nonetheless moulding the internet into its own image, making the sources of information even more obfuscated.
This picture of an AI ouroboros eating its own tail also brings to mind a conspiracy theory born of the online world: dead internet theory. According to conspirators, since 2016 the internet has become increasingly overtaken by bot activity and AI-generated content as a means of state control. While the latter part of this theory is extreme and unsubstantiated, it is true that the internet is being inundated with the output of generative AI models. Researchers at Cambridge and Oxford universities are even predicting that a large majority of internet content will be AI generated by 2026. After all, people don’t just generate content using tools like ChatGPT, they also share that content on the internet. Wikipedia has already introduced its own ‘AI Cleanup’ project to combat unsourced, poorly written AI generated content that has proliferated across the site.
In October 2024, the Internet Archive was subject to three cyber attacks and went offline for several weeks. It was a chilling reminder of how its unique value for online history also makes it a target. During this period, I couldn’t use it to access my own research materials, which I have made publicly available there.
As part of that research, I conducted interviews with players of the massively popular multiplayer online game, Wurm Online. Wurm is a fantasy ‘sandbox’ game, in which players can create and change things at will, rather than follow a predetermined narrative. I interviewed the curator of an in-game museum built to collect the art and culture of a gaming community that has existed for almost two decades. Because in game objects, which bear the marks of their creators, degrade over time, the curator of Rockcliff Museum (and other contributors) have spent countless hours working digitally to reconstruct them. The care taken to preserve the provenance of these digital objects feels like an anathema to the black box archive.