The Data Seal of Approval board hereby confirms that the Trusted Digital repository The Language Archive - Max Planck Institute for Psycholinguistics complies with the guidelines version 2010 set by the Data Seal of Approval Board.
The afore-mentioned repository has therefore acquired the Data Seal of Approval of 2010 on March 1, 2011.
The Trusted Digital repository is allowed to place an image of the Data Seal of Approval logo corresponding to the guidelines version date on their website. This image must link to this file which is hosted on the Data Seal of Approval website.
The Data Seal of Approval Board
|Guidelines Version:||2010 | June 1, 2010|
|Guidelines Information Booklet:||DSA-booklet_2010.pdf|
|All Guidelines Documentation:||Documentation|
|Repository:||The Language Archive - Max Planck Institute for Psycholinguistics|
|Seal Acquiry Date:||Mar. 01, 2011|
|For the latest version of the awarded DSA |
for this repository please visit our website:
|Previously Acquired Seals:||
|This repository is owned by:||
The repository is divided into different sub-repositories. There is a part for research that has been conducted by researchers affiliated to the Max Planck Institute for Psycholinguistics, there is a part for DOBES endangered language documentation projects and there are parts for other related projects. There is also a part for non-related researchers or projects. On the basis of the originating institution or organization, a data consumer can make some judgments about the level of trust or about the reputation of the depositor. Depositors within the Max Planck Institute are bound to ethical rules regarding human subject data from the Max Planck Society. Depositors in a DOBES project are bound to ethical rules of the DOBES code of conduct. The archive does not (and cannot) systematically verify whether the data it receives is collected according to these rules.
The repository has a list of accepted file formats. Only these formats are accepted by the ingest tool, which checks for validity of the ingested resources. Other file formats need to be converted, the repository offers advice on how to do this or in some cases does the conversions for the depositor.
The data producer is required to provide metadata in the IMDI format. Metadata descriptions are generally created for bundles of resources that belong together, e.g. an audio recording with its transcription. The repository offers tools for the creation of these metadata descriptions and offers training on metadata creation. There is a recommendation for a minimum set of metadata fields that need to be filled in, but this is not enforced. The technical compliance of the submitted metadata to the IMDI schema is validated during ingest.
We have an explicit mission to archive language resources from all around the world, both collected by associated researchers as well as researchers who are not affiliated with us. We promote this mission as much as possible in international conferences an during training courses that we organize ourselves or training courses that we are asked to take part in. The mission goes together with the official possibility to store full copies at two computer centers at different locations for which the president of the Max Planck Society gives an institutional backing of 50 years of bit-stream preservation. We are working on duplicating the archive access framework in those backup locations as well, such that access to the data can be provided even if our institute would cease to exist.
The repository is not a legal entity on its own but is part of the Max Planck Institute for Psycholinguistics which in its turn is not a legal entity of its own but part of the Max-Planck-Gesellschaft zur Förderung der Wissenschaften. e.V. Eingetragener Verein ("registered association") is its legal status. The repository is funded by the MPI for Psycholinguistics, the Max Planck Society (MPG), the Berlin-Brandenburg Academy of Sciences (BBAW) and the Royal Netherlands Academy of Arts and Sciences (KNAW). The repository has agreements with its external depositors about the right to archive the data. The depositors themselves are responsible for compliance with any legal regulations in the area where the data is collected. Where required by national regulations the archive also signs contracts with national/regional institutions. All ethical issues are dealt with by using Codes of Conduct, such as the DOBES Code of Conduct for the DOBES part of the archive. The repository enables the depositors to restrict access to their resources at various levels. All distributed copies elsewhere are stored under the agreement that they are made available under the same access restrictions, if they are made available.
Two copies of every resource are stored within the MPI and at least 4 additional copies are stored in different physical locations in Germany. The storage hardware is being replaced at regular intervals to the latest state of the art. Regular checks are performed on archival content to check for file and format integrity. The Sun SAM-FS HSM system that is being used for storage also checks for file integrity upon file access. The repository will have 2 identical archive access setups at the backup sites in Göttingen and Munich, so that in case of an emergency the data can be accessed via one of these sites.
Besides the steps mentioned at the previous guideline to take care of the bit stream preservation of the resources, some measures are taken to enhance the chance of future interpretability of the data. The number of accepted file formats is limited, to make future conversions to other formats more feasible. As much as possible open (non-proprietary) file formats are used. For textual resources, XML formats are used whenever possible, to make future interpretation of the files possible even if the tool that was used to create them no longer exists. Text is encoded in Unicode to ensure future interpretability.
There is an abstract standard workflow, but with technological advances there is a large variety of how this is applied. The online archive management tool LAMUS defines a workflow to a certain extent, because no resources can be archived without metadata being present and without a corpus hierarchy being present. The depositor mainly decides what material is being archived; the archive only has technical criteria about file formats and encodings. The depositor determines who can access the material and is also responsible for protecting the privacy of any subjects appearing in the recordings or texts.
There are no formal criteria in place to decide on when to apply data transformations to the current archival formats. More documentation is required to describe various workflow scenarios.
The archive has signed agreements with external depositors. For DOBES depositors there is the following agreement:
Agreements with other external depositors are based on this.
Depositors within the MPI for Psycholinguistics are contractually obliged to archive their data, so no agreements are necessary with them. All archived resources are available online, the access permissions are defined by the depositors. The repository will have 2 identical archive access setups at the backup sites in Göttingen and Munich, so that in case of an emergency the data can still be accessed via one of these sites.
The repository provides various ways of utilizing the archived data via online tools as well as by downloading the data in formats commonly used by the research communities. An advanced metadata search utility is provided, as well as a deep search tool for textual content. All metadata can be harvested via the OAI-PMH protocol. Unique persistent identifiers according to the Handle system are provided for each archived object.
MD5 checksums are calculated for all objects and checked periodically. The availability of files on the file system is checked automatically daily. The availability of the archive access tools is checked automatically multiple times a day. The availability of file, web and application servers is monitored continuously. New versions of archived resources can be deposited, in which case the old versions will be moved to a version archive. In the future these old versions will also be made available to the end users but this is currently not yet the case.
More documentation should be written for this guideline.
The statement of compliance is good, but I am not sure what the last sentence means:
"More documentation should be written for this guideline."
The repository in principle makes the original deposited objects available in an unmodified way, if the objects were in one of the accepted file types and encodings. Additionally, lower quality distribution copies of audio and video recordings are made available. New versions of archived resources can be deposited, in which case the old versions will be moved to a version archive. Different versions of the same resource are not compared; we assume the depositor has good reasons for depositing a newer version. A new version of a resource will get a new persistent identifier; the old version will keep the original persistent identifier. Metadata can change if the depositor or archivist sees the need for that, in the case of errors or missing information. Changes to the metadata are currently not logged. All archived objects are linked to their metadata descriptions and are organized in hierarchical (or multi-rooted) tree structures to indicate relationships between objects and sets of objects. The tree structures can change if the depositors decide that this is necessary. The identities of the depositors are checked by means of a login and password when they deposit material online. Provenance metadata as to who made changes to the repository is currently only stored in log files and not shown to the data consumer.
The repository supports the OAIS reference model’s tasks and functions, in so far that they are not in conflict with the Live Archives idea:
We do not create data packages for Ingest, Archiving and Dissemination for example but we treat each archival object separately while maintaining relational links to metadata and other objects.
The data consumer has direct access to the archived objects via the web, provided that access requirements have been met.
I am accepting this evidence because the guideline refers to "internationally accepted archival standards like OAIS," and not just OAIS, and because the archive is on the whole so responsive to data preservation and access. In general, though, I have the view that there is a very high bar to full compliance with OAIS and that it is a goal that few archives have reached.
Most of the data in the repository is protected; an account is necessary to get access to the data. For some data sets, explicit permission from the depositor is needed. For a large part of the data, the data consumer needs to agree with a code of conduct, which also contains licensing terms. Some corpora have Creative Commons licenses applied to them. If the data consumer does not comply with the access regulations, the only thing that can be practically done is to deny him/her further access and to make the research community aware of the misuse.
There are a number of specific codes of conduct that are applicable to parts of the repository, e.g. the DOBES code of conduct. The codes of conduct are in line with generally accepted codes of conduct for research data in the Netherlands. Users need to agree with the codes of conduct before they get access to the data.
If applicable, the data consumer is made aware of usage restrictions for the data she/he has gotten access to. Generally the usage restrictions are already described in the codes of conduct. For some data, explicit statements need to be made by the data consumer about the usage of the data before he/she gets access. The depositor then decides on whether access is granted or not. In case of misuse, the only thing that can be practically done is to deny the user further access to the repository and to make the research community aware of the misuse.