The Data Seal of Approval board hereby confirms that the Trusted Digital repository BAS CLARIN complies with the guidelines version 2010 set by the Data Seal of Approval Board.
The afore-mentioned repository has therefore acquired the Data Seal of Approval of 2010 on May 22, 2013.
The Trusted Digital repository is allowed to place an image of the Data Seal of Approval logo corresponding to the guidelines version date on their website. This image must link to this file which is hosted on the Data Seal of Approval website.
The Data Seal of Approval Board
|Guidelines Version:||2010 | June 1, 2010|
|Guidelines Information Booklet:||DSA-booklet_2010.pdf|
|All Guidelines Documentation:||Documentation|
|Seal Acquiry Date:||May. 22, 2013|
|For the latest version of the awarded DSA |
for this repository please visit our website:
|Previously Acquired Seals:||
|This repository is owned by:||
The BAS data repository (https://clarin.phonetik.uni-muenchen.de/BASRepository/; last accessed 27.02.2013) contains speech and multimodal corpora for research, education and technology development. The corpora have been created by BAS alone, in collaboration with other academic institutions or industrial partners, or entirely by external partners.
Each corpus contains a plain text description file naming the data producers, outlining the contents, means of data collection, and structure and documentation of the corpus. If corpora are provided by external partners, either these partners also provide this description file, or the BAS creates this description file in collaboration with the external partner.
Each corpus undergoes at least one internal validation. This validation checks the formal consistency of the corpus, but not the content. The validation report becomes part of the corpus documentation and is visible in the repository and corpus web site.
BAS corpus collections underly strict ethical standards: participation is voluntary, participants sign or, in the case of web-based data collection, agree to have read an informed consent form, and where required the data collection procedure must have been accepted by an ethics committee. BAS requests that data produced in collaboration with or by external partners underlie comparable ethical standards, however it cannot systematically verify whether the data it receives is collected according to these rules.
The data formats used depend on the type of data: signal data in general comes in binary data formats, annotation, documentation and metadata usually comes in text data formats. The BAS requires the use of well-documented and/or publicly specified or de facto standard data formats:
1.1) audio: WAV, AIFF, NIST, alaw, mlaw
1.2) video: mpeg2, mp4, QuickTime, AVI
1.3) sensor: device-dependent formats
2.1) data: plain text (ASCII for legacy corpora, UTF-8 otherwise), XML
2.2) annotation: TextGrid, BPF, EAF
2.3) metadata: CMDI (+ IMDI for legacy corpora)
2.4) documentation: plain text, HTML, PDF
When necessary, BAS performs data conversions from proprietary to open formats, and data migration to accommodate for technology development.
The file formats listed here are fully supported:
For the file formats listes here and not in the above document, the BAS will try to offer support:
All BAS resources can be accessed and searched via the repository web interface.
BAS corpora in the data repository come with metadata in CMDI and DC format. External partners are requested to provide CMDI metadata along with their corpora. The BAS offers assistance in creating these metadata where necessary.
BAS has defined two CMDI profiles (media.corpus.profile and media.session.profile) for speech and multimodal corpora. These profiles are registered in the CMDI registry.
Furthermore, BAS has implemented a tool to facilitate the creation of new and update of existing CMDI metadata for media corpora and sessions.
The technical compliance of the submitted metadata to the IMDI schema is validated during ingest.
The BAS was founded in 1995 by the Bavarian Ministry of Science and Education and is hosted by the Institute of Phonetics and Speech Processing at LMU Munich. The Institute of Phonetics has charged two staff members with permanent positions (Florian Schiel, Christoph Draxler) with running the BAS.
The main goal of the BAS is to create and make available high quality speech and multimodal corpora for research, education and industrial speech and multimodal technology development.
The BAS has been assigned the official data repository and archive for a number of speech and multimodal corpora in national and academic and/or industrial data collection projects, e.g. Verbmobil, SmartKom and SmartWeb, BITS synthesis and Ph@ttSessionz, and others. Wherever license terms allow it, the BAS has added corpora created during collaboration projects to its catalogue and has continually maintained and updated these corpora so that they remain accessible.
The BAS closely cooperates with the European Language Resources Association ELRA and the Linguistic Data Consortium (LDC), as well as with other resource providers, and it has organized and contributed to workshops and conferences on speech and multimodal corpora.
A document with the mission statement is available here:
The repository is not a legal entity on its own but is part of the Institute of Phonetics and Speech Processing which is not a legal entity on its own but part of the Ludwig Maximilian University Munich. The legal status of the Ludwig Maximilian University is “Körperschaft des Öffentlichen Rechts“. The repository is funded by the Institute of Phonetics and Speech Processing. The repository has agreements with external depositors about the right to archive the data. The depositors themselves are responsible for compliance with any legal regulations in the area where the data is collected. The repository enables the depositors to restrict access to their resources at various levels. Distributed copies elsewhere may not be made available to third parties.
Online template contract, code of conduct and terms of usage:
The repository stores its resources on its own server in its own local network. A backup is performed to the Leibniz Rechenzentrum (LRZ) on a daily basis. In addition to the backup, the BAS has archived all its corpora using the LRZ archive service on special archive nodes that are permanent, i.e. that do not expire (regular archive nodes expire 10 years after the original date of submission).
Furthermore, a subset of the corpora is also held on optical media in a separate location in the building of the repository.
The local storage hardware is replaced at irregular intervals, depending on the technical requirements.
Processes to ingest new corpora, to update metadata information, to update content of corpora including a full versioning system, to move the server location, to maintain and move the web services server, as well as documentation of the used maintenance software are documented in text files in a working space accessible for the CLARIN employees.
Introduction to the LRZ backup storage and guidelines for usage:
Besides the steps mentioned above for the previous guideline to take care of the bit stream preservation of the resources, some measures are taken to enhance the chance of future interpretability of the data. The number of accepted file formats is limited, to make future conversions to other formats more feasible. As much as possible open (non-proprietary) file formats are used. For textual resources, XML formats are used whenever possible, to make future interpretation of the files possible even if the tool that was used to create them no longer exists. Text is encoded in Unicode to ensure future interpretability.
Would be good to mention this as part of a publicly accessible preservation policy.
At BAS, the workflow for archiving data consists of the following phases:
1) corpus creation: specification, recording, annotation
2) post processing: formatting, metadata generation, reporting
3) validation: formal consistency checks (completeness, technical validation)
4) distribution and exploitation (repository)
Phase 1) and, in parts phase 2), are outlined in the BAS cookbooks:
Corpora are either created by BAS according to the above workflow, or they are provided by external providers. Data produced externally enters the workflow either at phase 2) or at phase 3).
In phase 4), data enters the repository via one of two ways: ingest or update.
Ingest means that a new corpus is created in the repository. At BAS, ingest is an automatic process: a script retrieves primary and meta data from the local file system, requests PIDs for the appropriate data items. This script is a proprietary perl script, and it relies on a small set of human- and machine-readable configuration files. The corpus and session data receives the version number 1.
Update means that existing data in the repository is modified. Updates occur at irregular intervals, in general as the result of error corrections or extensions of an existing corpus. Again, this is an automatic process. The script uses the same configuration files as the ingest script. It retrieves all modified primary and meta data from the local file system and requests new PIDs for the appropriate data items. The version counter of the updated resources is incremented.
A public description of the workflow is given here:
The repository has signed agreements with external depositors. The agreements state the right of the BAS (represented by the Ludwig-Maximilians-Universität München) to archive, maintain and distribute the data to third parties (user licenses); the agreements also state that the purpose of the storage of a resource in the BAS repository is to make the resource available to the scientific community as it is feasible. There is no guarantee that resources are distributed, that is the BAS reserves the right to restrict the distribution for ethical or technical reasons. Access restrictions for certain groups (e.g. commercial enterprises) are defined by the depositors. In general it is the BAS' policy to only accept resources that are available for scientific usage.
The BAS policy for external resources is described in: http://www.phonetik.uni-muenchen.de/Bas/BasPolicyExternalResources_eng.pdf
A contract template can be found at: http://www.phonetik.uni-muenchen.de/Bas/BasTemplateContract.pdf
The repository provides various ways of utilizing the archived data via online tools as well as by downloading the data in formats commonly used by the research communities. For very large resources where online access is not (yet) technically feasible we also provide the possibility to distribute resources on standard media (such as DVD-R and/or hard discs). An advanced metadata search utility is provided, as well as a simple search tool for textual content. All metadata can be harvested via the OAI-PMH protocol. Unique persistent identifiers according to the Handle system are provided for each corpus and each session within the corpora.
BAS OAI-PMH endpoint:
The repository displays the latest version of all resources. Internally, all resources are governed by a versioning system.
MD5 checksums are calculated for all objects and checked periodically. The availability of files on the file system is checked automatically daily. The availability of the archive access tools is checked automatically several times a day. The availability of file, web and application servers is monitored continuously.
The repository in principle makes the original deposited objects available in an unmodified way, if the objects were in one of the accepted file types and encodings. Additionally, lower quality distribution copies of audio and video recordings may be made available. New versions of archived resources can be deposited, in which case the old versions will be moved to a version archive. Different versions of the same resource are not compared; we assume the depositor has good reasons for depositing a newer version. A new version of a resource will get a new persistent identifier; the old version will keep the original persistent identifier. Metadata can change if the depositor or archivist sees the need for that, in the case of errors or missing information. Changes to the metadata are currently not logged. All archived objects are linked to their metadata descriptions and are organized in hierarchical (or multi-rooted) tree structures to indicate relationships between objects and sets of objects. The tree structures can change if the depositors decide that this is necessary. Provenance metadata as to who made changes to the repository is currently not available.
Would be good to have this information publicly available online.
The repository aims to support the OAIS reference model’s tasks and functions. However, due to the complexity of the OAIS reference model, the repository cannot guarantee that all tasks and functions are (or will be) implemented.
We treat each archival object separately and maintain relational links to metadata and other objects.
The data consumer has direct access to the archived objects via the web, provided that access requirements have been met.
Resource data in the repository is protected, while metadata are openly accessible; an account is necessary to get access to the content data. For some data sets, explicit permission (license) from the depositor is needed; the license has to be filed at the BAS and the data consumer must have a AAI federation user account. For a large part of the data, the data consumer needs to agree with a code of conduct, which also contains licensing terms. If the data consumer does not comply with the access regulations, the only thing that can be practically done is to deny him/her further access and to make the research community aware of the misuse.
Users of the repository will only be granted the requested access credentials if they are a) members of institutions that have agreed to the codes of conduct, or b) if the users have agreed to the codes of conduct before they get access to the data.
Terms of usage:
If applicable, the data consumer is made aware of usage restrictions for the data she or he has gotten access to. Generally the usage restrictions are already described in the codes of conduct. For some data, explicit statements need to be made by the data consumer about the usage of the data before he/she gets access. The depositor then decides on whether access is granted or not. In case of misuse, the only thing that can be practically done is to deny the user further access to the repository and to make the research community aware of the misuse.