Twitter collection: Collecting the haystack

One person is looking at a web code on a screen. Photo: Josephine Kreutzer

How do you archive 3.5 billion tweets? The German National Library took up this challenge in 2023. In cooperation with its partner Science Data Center for Literature, it assembled an extensive collection of German-language tweets as part of a crowdsourcing project. After all, social media are a part of Germany's hybrid media landscape – and their historical development also makes them part of our digital cultural heritage.


The German National Library's statutory collection mandate covers the collection, archiving and access to social media posts in the German language or relating to Germany. But when memory institutions add social media to their collections, they are initially faced with a fundamental decision: what do they want to collect – and how?

A choice has to be made between selected accounts and topics on the one hand, or sheer volume on the other. This largely depends on whether the posts are to be collected through the interface of a social media platform using classic web crawling technology or whether structured raw data are to be collected through a programming interface. Both methods are justifiable, but the method chosen will influence the possibilities for use.


Code Symbolic photo: A crowdsourcing initiative has added 220 million tweets to the German National Library's collection Photo: Britta Woldering

A fundamental decision: needle or haystack?

Collecting social media posts through the web interface ensures that their look and feel are preserved. The archived pages can be viewed and read by humans, and users are more likely to consider and research individual pages and visual aspects when exploring the collection. Conversely, if structured data are collected, the look and feel of the web interface are lost, as are the visual design and opportunities for interaction. This being said, structured data are more suited to machine-based analytical processes such as text and data mining.

Volume is another consideration. Institutions usually opt for web crawling when collecting curated selections of data. However, if a large volume of posts is to be collected, preference is usually given to structured data, which can be accessed through an interface. Any collection of social media accordingly begins with a choice between the needle and the haystack.

Twitter and the research boom

The DNB is no stranger to "archiving the web": the German National Library has been collecting websites in its web archive since 2012. However, social media have not been collected to date, not least because of their idiosyncrasies in terms of media technology. Social media are both a source of data and the object of various research strategies in the humanities, social sciences, IT, natural sciences and life sciences.

This is demonstrated by the Twitter platform, for example, which until it was taken over by a consortium of investors led by Elon Musk was noted for its flexible programming interfaces and the options available for accessing the Twitter archive. Until the beginning of 2023, this accessibility led to a boom in research studies and the creation of extensive collections for research purposes. Nowadays, access to the platform's archive is largely subject to a fee, which has to all intents and purposes closed it for research and archiving purposes. The turbulent developments experienced by this and earlier platforms also show that the platforms themselves are not stable institutions.

Focus on German-language tweets

Given the urgency of the situation and the need to make the widest possible – albeit still within the constraints of the collection mandate – selection of tweets capable of operationalisation, we decided to concentrate on German-language tweets. One disadvantage of this approach was that it did not cover English-language tweets by German politicians, for example. Despite restricting our collecting activities to the German language, there was still a bundle of approximately 3.5 billion tweets which had to be archived.
In order to filter German-language tweets from the Twitter archive, we used the language encoding assigned by Twitter, the reliability of which had been proven by random sampling and analysis during the preparatory phase.

Call for donations: download quotas wanted

Twitter offered interfaces with various options for accessing the Twitter archive. The most extensive free access option was "Academic Research" access, which could be requested for research projects. This allowed users to download 10 million tweets from the archive every month. The project's core concept? To attract the largest possible number of researchers with "Academic Research" access who would be willing to donate part of their download quota for the initiative. We had calculated that with one account, it would take around 30 years to archive all German-language tweets, while with 350 accounts it would take less than a month.

Crowdsourcing secures 220 million tweets

On 20 February 2023, we launched our call for individuals to become involved in the development of a German-language Twitter archive. The response was very enthusiastic and positive, but the number of active participants remained in low double figures. The crowdsourcing initiative continued until the "Academic Research" access interface was closed in mid-April. We collected tweets with the language code "German". As well as the tweet texts, we stored the extensive metadata which Twitter also delivers. These include the number of retweets and likes, conversation IDs and hashtags.

The Twitter data set collected during the crowdsourcing initiative spans a period in which it was not yet possible to upload images or videos. We collected around 220 million tweets from approximately 5.8 million accounts. These cover the period from March 2006 (the launch of Twitter) up to and including June 2011. The collection is 640 GB in size.

Use on site is possible

The tweets collected during the crowdsourcing initiative were supplemented by a large research corpus of German-language tweets. This corpus covers the period from 2014 to 2023 and encompasses some 2.8 billion tweets. This means that an extensive collection of tweets is now archived at the German National Library, with a gap covering the period from July 2011 to the beginning of 2014.

Since the Twitter collection consists of data which are subject to copyright and data protection legislation, it can only be used on site on the German National Library's premises in Leipzig and Frankfurt. Along with certain technical deployment environments and tools, e.g. for data analysis and visualisation, work with this kind of data corpus requires the user to be skilled in handling structured raw data.

Twitter collection in use

The Twitter collection was first offered as part of a data sprint and the German National Library's digital humanities call for 2024, which allows researchers with definite projects to apply. In the future, it may be possible to use it on request outside the context of the call.

With its "haystack" of German-language tweets, the German National Library is breaking new ground in its services for users: social media as part of the publication world and an important aspect of contemporary life.

Last changes: 04.06.2024

to the top