Expanding our digital indexing processes

A person looks up while cataloguing media works.

2,569,764 physical and digitally accessible units are new in the DNB's holdings in 2022. In indexing, man and machine go hand in hand. (Photo: Stephan Jockel)

Subject cataloguing: human and machine in duet

The German National Library (DNB) uses intellectual and machine-based processes to catalogue media works. Cataloguing means using metadata to describe a media work – a book, a journal, a website, a map or musical score – in such a way that it can be found in an online catalogue or database. The metadata generated from various sources (subject headings, classifications) should provide optimum support when carrying out information research.

Subject cataloguing (SE) pursues the goal of integrating intellectual – i.e. performed by humans – and machine-based subject cataloguing more closely. Here the cataloguer assumes the role of the "human in the loop". Humans have extensive experience and recognise errors, while machines catalogue quickly and can cope with large volumes of data. The system is optimised by combining the skills of both. The development and training of AI requires the librarians' intervention at many points.

Permanent use of "digital assistant" DA-3

Digital Assistant DA-3 has been providing support for intellectual cataloguing processes since 2020. This web-based platform aggregates data from various sources and makes them available as proposals for use in intellectual subject cataloguing. Existing machine-generated cataloguing data or data from other libraries can be efficiently re-used for this purpose. The success of the developmental work and tests enabled DA-3 to be put into routine operation in 2022.

More information about Digital Assistant DA-3 is available at the da-3.de website.

Workshop on computer-supported subject cataloguing

In November 2022, the DNB again co-organised the workshop "Computer-Supported Subject Cataloguing". The online event facilitated the exchange of information on new cataloguing tools in the German-speaking countries. This workshop, which is already the sixth in the series, was the scene of a lively discussion on the cataloguer's tasks – and on user expectations with regard to the optimisation of subject cataloguing to achieve satisfactory search results. The user community and interested visitors have been able to access a separate website for DA-3 as a communication platform since 2022. This website is operated by the Library Service Centre Baden-Württemberg (BSZ).

Quality assurance of machine-generated cataloguing data

Machine-generated subject headings and classifications were again subjected to quality assurance procedures in 2022. Firstly, the Finnish software Annif now in use at the DNB generated the corresponding data. These were then reviewed intellectually by means of random sampling. The human evaluations and training serve to adjust and improve the algorithm.

There is blog post which provides more information about the cataloguing machine.

GNDmul project successfully concluded

Which foundations are required for the generation of high-quality cataloguing data? Well-maintained authority data! This refers to a knowledge network of terms with thematic interrelations. Our colleagues in SC also devoted themselves to this editorial task. Due to the successful completion of the GND-mul project at the end of 2022, the ongoing linking of multilingual authority data will now become one of the decisive tasks involved in maintaining authority data.

The GND-mul project consisted of the creation of standardised structures and access points for the presentation and re-use of concordances (links) between the Integrated Authority File (GND) and other vocabularies (thesauri). Mappings to specialised terminology in English, French, Italian and Spanish and specialised thesauri such as the STW Thesaurus for Economics (STW), the Thesaurus for the Social Sciences (TheSoz) and Medical Subject Headings (MeSH) facilitate the linking of collected objects, multilingual metasearches and imports of external data.

Numerous virtual meetings were held with representatives of other libraries and information institutions during the year, mainly with the Swiss National Library, the Bibliothèque Nationale de France and the Biblioteca Nazionale Centrale di Firenze.

AI in libraries: machine-based cataloguing processes

The dynamic development of digital technologies is opening up new ways in which collections can be developed, expanded, catalogued and used for research tasks. The federal government's AI strategy aims to support the research, development and application of innovative technologies. The German National Library (DNB) is taking part in this initiative with its "Automatic Cataloguing System" research project. Its work involves the use of advances in AI for the subject cataloguing of collected online publications. The project is receiving funding from the Federal Government Commissioner for Culture and the Media.

Which current developments in the fields of machine learning and natural language processing are suitable for the reliable thematic classification of German-language publications using subject headings in the Integrated Authority File (GND)? This is the central question being explored by the researchers. The goal: cataloguing data that describes the content of the publication as completely and accurately as possible. The machine-based methods should be able to identify the themes, locations and persons involved in any given media work. Furthermore, they should also be able to link the publication with the appropriate GND subject headings as accurately as possible.

Even when dealing with identical terms that have different meanings, they should be able to correctly identify and classify the semantic context. The GND currently contains around 1.35 million so-called "semantic concepts" for subject cataloguing purposes. And it by no means contains all the relevant content – this is another factor that the software must be able to recognise. An Extreme Multi-Label Classification (XMLC) problem of this kind can only be solved with a mix of special methods.

Oriented on open technology and open sources

The DNB has been using automated processes for the subject cataloguing of online publications for around ten years. The testing and introduction of newer methods for processing and analysing texts in natural language is intended to improve the quality of the results still further. A wide variety of extremely diverse processes are being investigated as part of the project – with preference given to open-source tools. The DNB intends to integrate suitable methods into its cataloguing machine – and will in turn make the results of its development work available to the community in the form of open-source tools.

AI as an opportunity and challenge

A workshop addressing these themes was held in November 2022 under the auspices of the "Network for Machine-Based Cataloguing Processes". The main topic this year was the use of artificial intelligence (AI) and digital humanities (DH) in libraries. Participants from Berlin State Library (the Prussian Cultural Heritage Foundation), the Bavarian State Library, the TIB - Leibniz Information Centre for Science and Technology, the ZBW - Leibniz Information Centre for Economics and the German National Library all came to Frankfurt to discuss projects and developments relating to the machine-based processing and analysis of data, texts and images.

You will find more information about the themes of the workshop in our blog.

Linking data: expansion of services at the German National Library

The German National Library (DNB) operates the platform Culturegraph. Its aim: to strengthen the links between the holdings of the DNB and those of other libraries, library networks and cultural institutions and to enrich them with catalogue data. This makes the holdings more accessible. The new machine-based process has been in operation since 2022 – and is enjoying great success.

Culturegraph contains metadata from the German National Library and from library networks in Germany and Austria. Data are compared and linked using various methods of data analysis. Cataloguing data can thus be transmitted to other libraries, while information from external sources can be used to enrich data stock. The formation of clusters of titles for one work, for example, offers the opportunity to import an authority data link only present in one of the bibliographic data records in the work cluster to all the bibliographic data records in that cluster. The results of the analytical and linking work are disseminated in multiple networks.

300,000 links generated to the GND

In 2022, the German National Library automatically added around 300,000 additional links to the authority data records for persons in the Integrated Authority File (GND) to its bibliographic data records. December 2022 saw the launch of a workflow that bundles the data in Culturegraph on a daily basis and adds around 1,000 links every day. The library networks are also regularly supplied with link lists for their data records.

For some time now, the DNB has been linking data from the Open Researcher and Contributor Identification (ORCiD) service with the GND – using Culturegraph – whenever the respective data records for persons can be uniquely matched. ORCiD is the unique identifier for researchers on the platform of that name. ORCiD data records are created by the researchers themselves. They can contain other information as well as the ORCiD and the person's name, e.g. other variants of the name, institutions with which the person is associated, external identifiers for other organisations, and publications. 

New proposals process developed

Since the beginning of 2022, the DNB has been testing a machine-based process that generates proposals for new authority data records for persons in the Integrated Authority File – thus supporting the cataloguing system. The process uses the ORCiDs assigned to a specific bibliographic record. If there is as yet no GND data record for the person concerned, the ORCiD service is contacted and a proposed GND record generated from the ORCiD record. Additionally, a link is created between the bibliographic data record and the proposed record by entering the title in the proposed record. This results in proposals that can be converted into high-quality authority data records for persons with a minimum of manual effort.

If a proposed data record for an author is found while cataloguing a title, for example, this can be converted into a GND authority data record for a person in just one click. The title in the catalogue is then automatically linked with the new GND data record created in this way. The systematic processing of proposals by the GND editorial team is also supported by the automated generation of rankings.

Last changes: 19.09.2023

to the top