Launch of cataloguing machine EMa
Diagram of the cataloguing system EMa (Figure: DNB)
The German National Library (DNB) catalogues its holdings both descriptively and by subject. This enables users to find what they are looking for by entering the title, publisher, subject heading or subject category. The DNB has been using machine-based processes and artificial intelligence (AI) for the subject cataloguing of media works since 2012. In April 2022, a newly developed system was rolled out: the "cataloguing machine" EMa. This system generates descriptive metadata to enrich the records in the DNB's catalogue. These data are also made available for external applications via the DNB's data services. The new system already generates DDC Subject Categories, subject headings from the Integrated Authority File (GND) and DDC Short Numbers in the subject category of medicine for media works in German and English. The previous cataloguing system will continue assigning Short Numbers for another 52 subject categories for the time being.
What exactly does subject cataloguing mean?
Subject cataloguing structures the DNB's extensive holdings by topic. The volume and proportion of digital media works are increasing steadily; digital publications already account for around two-thirds of the works collected every year. Machine-based subject cataloguing – alongside intellectual subject cataloguing – has therefore been common practice for about ten years. Subject cataloguing at the DNB is based on the assignment of media works to the respective Deutsche Nationalbibliografie series. The series go from A to T, each consisting of different types of media. Series A, for example, consists of monographs, series C of maps and series M of printed music and music publications. Serials from the publishers' book trade are catalogued intellectually as set out in the cataloguing concept of 1 July 2019. Series O, which encompasses all online publications, is mostly catalogued automatically.
Subject cataloguing encompasses verbal and classificatory cataloguing. Verbal cataloguing provides semantic contextualisation and networks data by linking the publications with the subject headings in the Integrated Authority File (GND). The DNB also carries out classificatory cataloguing using the Dewey Decimal Classification (DDC) system. All media are thus assigned to subject categories such as "Philosophy", "Medicine" or "Sport". The one hundred subject categories used by the DNB are based on the highest-level classes of the DDC. Some of the media works also undergo so-called "in-depth cataloguing" using DDC full numbers. The DDC offers a wide – almost unlimited – array of possible combinations. This is why full numbers have hitherto only been assigned intellectually. For machine-based processes, the DNB has developed a simplified classification scheme using DDC Short Numbers. This scheme encompasses a limited number of defined classes per subject category that can also be assigned automatically.
The modular architecture of EMa
The DNB started developing the new cataloguing system, the "cataloguing machine" EMa, in 2018 as part of an internal project. By the time the project was successfully concluded in 2022, the DNB's previous cataloguing had already been largely replaced. The new system operates within the DNB's own IT infrastructure as before. Fig. 1 is a schematic diagram of EMa's modular architecture. The main advantage of the modular concept: EMa can be expanded flexibly and continually adapted to accommodate technological advances with little effort on the part of staff. Services and processes can be simply exchanged or expanded, while new functions can be added at any time.
Annif: an open source toolkit from the Finnish National Library
The DNB now uses AI processes provided by the Annif toolkit to classify and assign subject headings to the publications it receives. This flexible set of tools for library applications was developed by the Finnish National Library. Annif facilitates machine-based learning and the processing of natural language; it is available as an open-source software package. The processes were selected, prepared and tested for the cataloguing of publications. They are language-independent, which means that any specialised vocabulary in SKOS format can be used for cataloguing purposes – including the GND. More and more libraries in Germany and worldwide are using Annif or are interested in using it. The result is the creation of communities which work together and supply new processes.
Figure: DNB
The EMa process cycle
The entire machine-based cataloguing process is automated. Productive operation is initiated, controlled and monitored using the EMa control system. Every day, the process starts with a list of the digital media works received the previous day. The text delivery service retrieves the digital media works from the text repository and the relevant metadata from the catalogue system. It then creates the text basis for the subsequent analyses. First, a text language recognition service identifies the language of the text. German or English-language texts and their metadata are subsequently transferred to Annif by the classification and indexing service. Annif offers a wide range of options for the differentiated processing of the various media works to be catalogued.
The cataloguing service then converts the results of the machine-based cataloguing process – subject categories, DDC Short Numbers or GND subject headings – into the Pica+ format used by the DNB's cataloguing system. Finally, the EMa control writes them to the metadata record for the media work (fig. 2). They can then immediately be found during searches in the DNB portal. The data record indicates whether the cataloguing data were generated intellectually or by machine.
Figure: DNB
Which processes are used?
The DNB is currently using configurations for the following AI processes:
- SVC (Linear Support Vector Classification): a learning process used to assign DDC Subject Categories.
- Omikuji-Bonsai (a tree-based process): a learning process used for the assignment of DDC Short Numbers to medical publications.
- Omikuji-Bonsai and MLLM (Maui-like Lexical Matching) as an ensemble: a learning process and a lexical process are combined in order to assign subject headings to online publications and university publications.
- Omikuji-Bonsai, Omikuji-Attention, stwfsa, fastText and MLLM as an ensemble: two learning processes and three lexical processes are combined to assign subject headings to literature for children and young adults with a section of the GND.
How is high quality guaranteed?
The technical workflows are reviewed every day. There are also regular inspections of random samples. Experience has shown that the quality of the cataloguing results depends on many different factors. These include the scope and quality of the training data, the suitability of the algorithms for the use cases, the models' accuracy of fit, the homogeneity of the media works to be catalogued in each case group, and much more. The availability of sufficient up-to-date training data is particularly important for the learning process. It is also tremendously important for the quality of the machine-based cataloguing that the GND is complete and up-to-date.
What comes next?
The DNB aims to increase its machine-based cataloguing activities step by step and to expand them to cover other groups of publications. The components of EMa are continually being developed, improved and supplemented to this end. Which new methods and technological advances can the DNB make use of in its applications? And can this improve the quality of the cataloguing results still further? These questions are being explored as part of the research project "Automatic Cataloguing System". The Federal Commissioner for Culture and Media is funding the DNB's project as part of its National AI Strategy.
The paramount goal? To achieve a high degree of reliability for the cataloguing data, irrespective of whether it was generated intellectually or automatically.
Click here for more information on the "Automatic Cataloguing System" research project.
Last changes:
19.09.2023