Evaluating the benefits of a quality assurance framework

Anne Ferger & Daniel Jettka | Paderborn University, Germany

Slides at digitalhumanists.github.io/EADH2021

We integrated a quality assurance framework into an existing research data creation workflow. For this integration we employed the version control system git. Using different automatically created outputs we evaluated the benefits those quality control measures had.

Data creation workflow

How is the research data created?

  • our example case is linguistic transcriptions, annotations and translations of Dolgan, Kamas, Evenki and Selkup created in the project INEL
  • further information on the project

How is the research data created?

  • linguistic corpus creation with manual and automated workflows combined (mainly using the software ELAN, FLEx and EXMARaLDA)
  • created data consists of transcription files with annotations and translations (xml), metadata files (xml) and audio or manuscript files (binary)
  • using git as version control system to allow for collaborative work on the corpora

Quality Assurance and Workflow Optimization

What is data quality in this case and how can it be assured?

  • in our case high data quality is scientifically robust data (e.g. statistics and searches on the data are reliable)
  • manual creation leads to technical inconsistencies
  • transcriptions, translations, annotations and metadata are inter-dependent, so inconsistencies can be checked or fixed automatically

➔ checks for and fixes of inconsistencies

How can data quality assurance be integrated into the existing workflows?

  • the checks and fixes tailored to the special use case of the linguistic corpora were added to an extendable open source framework called corpus services
  • the quality checks and fixes were integrated in the git routines, so that no change of original workflow was needed
  • using recurring scripts selected checks and fixes are run on the current (git) version of the corpus
  • the file changes and additionally created outputs are added to the corpus using git
  • to minimize problems with this workflow, those checks were done at a dedicated hour at night were everyone agreed to stop working on the data (one hour was also fine for time differences)
  • The used scripts were also made available on GitHub


Checks & fixes

Some types of Checks:

  • File coverage - files in system = referenced files?
  • XML handling (validation)
  • Contextual string matching - multiple whitespaces or linebreaks
  • Naming conventions - files, metadata, structural units
  • Structure and content of transcriptions, annotations, and metadata - annotation tiers, schemas

Some types of Fixes

  • Synchronization of names - metadata / files
  • XML handling (indentation)
  • Synchronization of values - vocabularies / dictionaries / Geo
  • Update of regular tasks - segmentation / pre-processing
  • Removal of superfluous info - auto-save, logs, empty units

Details: List of corpus functions

Overview of the current (curation) state of corpora - daily auto-update


Reports etc.

  1. Reports - assistance for manual curation tasks
  2. Overviews - finding inconsistencies in structural infos and metadata
  3. Error lists - tool-specific curation routines (e.g. import in EXMARaLDA Partitur Editor)
  4. Log files - statistical information about developments in curation

Assistance for continuous “curation”

How can the enhancements to the data quality assurance be measured and evaluated?

Further information and tools:

Corpus Data

Brykina, Maria; Orlova, Svetlana; Wagner-Nagy, Beáta. 2020. INEL Selkup Corpus. Version 1.0. Publication date 2020-06-30. Archived in Hamburger Zentrum für Sprachkorpora. http://hdl.handle.net/11022/0000-0007-E1D5-A. In: Wagner-Nagy, Beáta; Arkhipov, Alexandre; Ferger, Anne; Jettka, Daniel; Lehmberg, Timm (eds.). The INEL corpora of indigenous Northern Eurasian languages.

Däbritz, Chris Lasse; Kudryakova, Nina; Stapert, Eugénie. 2019. "INEL Dolgan Corpus." Version 1.0. Publication date 2019-08-31. http://hdl.handle.net/11022/0000-0007-CAE7-1. Archived in Hamburger Zentrum für Sprachkorpora. In: Wagner-Nagy, Beáta; Arkhipov, Alexandre; Ferger, Anne; Jettka, Daniel; Lehmberg, Timm (eds.). The INEL corpora of indigenous Northern Eurasian languages.

Gusev, Valentin; Klooster, Tiina; Wagner-Nagy, Beáta. 2019. "INEL Kamas Corpus." Version 1.0. Publication date 2019-12-15. http://hdl.handle.net/11022/0000-0007-DA6E-9. Archived in Hamburger Zentrum für Sprachkorpora. In: Wagner-Nagy, Beáta; Arkhipov, Alexandre; Ferger, Anne; Jettka, Daniel; Lehmberg, Timm (eds.). The INEL corpora of indigenous Northern Eurasian languages.

Also on GitLab



Anne Ferger



Daniel Jettka

