Digital humanities

On this page you will find information about the corpus tools that I am developing and about the spoken language corpora that I have created.

Corpus tools

The following corpus tools are freely available.

act - Aligned Corpus Toolkit for R

The Aligned Corpus Toolkit (act) is an advanced tool for corpus linguists that work with time-aligned transcription data. It is a cross-platform library to be used in the statistical environment R. It offers advanced search possibilities in transcriptions (full text search, normalized search, concordance etc.), import-export functionality for Praat-TextGrid-files, export for ELAN files, the creation of batch lists for cutting audio and video files with FFmpeg, the creation of printable transcripts in the style of conversation analysis, and an integrated workflow with Praat. The package is itself written in R and may be expanded by other users.

Binary package for R: Download

The Transformer - A corpus tool on Windows

The Transformer is a corpus tool for scientists who work with time-aligned transcribed linguistic data. It addresses conversation analysts, phoneticians, anthropologists, and other social scientists who want to analyze digital audio or video data and language. The Transformer is a program to manage and convert transcribed linguistic and aligned data. The Transformer itself is not an annotation tool, but it allows you to change the format of your data and save it to a variety of output formats. In addition, The Transformer provides possibilities for searching and organizing corpora.

For more information, visit the separate web site.

TextGrid to Transcript - Converting TextGrids to print transcripts in Praat

"TextGrid to Transcript" is a tool to generate print transcripts in the style of conversation analysis based on Praat TextGrids. "TextGrid to Transcript" is a script that runs within Praat. It offers basic possibilities to modify the layout of a transcript, such as insertion of line numbers, selection of the tiers to be exported, formatting of the tier names/speakers and adjusting the width of the transcript.

  • Script (right click and select "Save as..."): Download
  • Instructions for using the script in English: Download
  • Instructions for using the script in German: Download

[moca] multimodal oral corpus administration

I have been involved in the development and redesign of [moca] – an online system for multimodal oral corpus administration. [moca] stores audio and/or video recordings and their accompanying transcription files. Transcription files are aligned, providing speaker information and the temporal blueprint of the transcription, in addition to the transcription itself. This allows for accessing the media file at individual points in a transcription file directly through an internet browser. 

For more information, visit the separate web site.

Corpora

ICAS - Instructing Corporeal Arts and Skills

The ICAS is a Spanish-spoken corpus of authentic Instructions of Corporeal Arts and Skills. The corpus focuses on instructional classes in dance (Argentine tango, Latin dance), but also comprises sports classes (e.g. aikido, surfing), medical instructions (e.g. first aid, physical rehabilitation) and vocational training (e.g. construction, welding). All classes have been recorded with a dual camera set-up and body microphones on the teachers. The transcriptions are time-aligned and therefore compatible with tools like Praat, ELAN, EXMARaLDA, etc.

This corpus is currently being built up within the project "Body knowledge. Multimodal practices for instructing corporeal-performative knowledge in interaction" (for further details, visit http://www.body-knowledge.org).

Size: up to date ~130,000 words, total length ~76 hours, 60 recordings (Corpus size is constantly increasing)

Funding: Supported by a grant from the Ministry of Science, Research and the Arts of Baden-Württemberg and the Albert-Ludwigs-University of Freiburg.

cespla - Corpus de Conversaciones ESPontáneas PLAtenses

The cespla is a linguistic corpus of everyday conversations from the region along the River Plate (Argentina and Uruguay). It mainly consists of dinner conversations amongst friends and family that have been recorded mostly in Buenos Aires and La Plata. Most of the recordings are audio only; some are video recordings. All transcriptions are time-aligned.

For more information, visit the separate web site.

Size: ~385,000 words transcribed, total length ~164 hours, 60 recordings

FundingWissenschaftliche Gesellschaft (Freiburg im Breisgau), Verein für Gesprächsforschung e.V. (Prize for the best PhD project)

TTI. Tango Teacher Interviews

This Spanish-spoken corpus presents a collection of interviews with teachers of Argentine Tango that have been video recorded in Buenos Aires and La Plata (Argentina). All transcriptions are time-aligned.

Size: ~55,000 words transcribed, total length ~10 hours, 32 recordings

escucho. Radio call-in program "Te escucho" by Luisa Delfino

"Te escucho" is a famous Argentine radio call-in-format program by journalist Luisa Delfino. The radio format is oriented towards advice-giving and life-coaching. The corpus consists of audio recordings of the internet broadcasts and time-aligned transcriptions.

Size: ~40,000 words transcribed, total length ~13 hours, 18 recordings

SCHALL - SprachCorpus Heutiger ALLtagsgespräche

Corpus of spontaneous everyday conversations in German. All recordings are audio, transcriptions are partly time-aligned, while others are flow text. The corpus has been created in collaboration with Stefan Pfänder. 

For more information, see www.sprachcorpus.de.