The Data Seal of Approval board hereby confirms that the Trusted Digital repository CLARIN Center BBAW complies with the guidelines version 2010 set by the Data Seal of Approval Board.
The afore-mentioned repository has therefore acquired the Data Seal of Approval of 2010 on May 21, 2013.
The Trusted Digital repository is allowed to place an image of the Data Seal of Approval logo corresponding to the guidelines version date on their website. This image must link to this file which is hosted on the Data Seal of Approval website.
The Data Seal of Approval Board
|Guidelines Version:||2010 | June 1, 2010|
|Guidelines Information Booklet:||DSA-booklet_2010.pdf|
|All Guidelines Documentation:||Documentation|
|Repository:||CLARIN Center BBAW|
|Seal Acquiry Date:||May. 21, 2013|
|For the latest version of the awarded DSA |
for this repository please visit our website:
|Previously Acquired Seals:||
|This repository is owned by:||
The repository will include resources provided by CLARIN-D member institutions and other institutions and/or organizations that belong to the CLARIN-D extended community. The data in our repository contains sufficient information for others to assess the scientific and scholarly quality of the research data in compliance with disciplinary and ethical norms. We specifically rely on DFG ethical Codes of Conduct (e.g. layed down in the DFG Rules of Good Scientific Practice). Thus, our repository provides a quality assessment by which the data consumer can make some judgment about the level of trust or about the reputation of the depositor on the basis of the meta-information about the source institution/organization information associated with any given resource. Our repository does not (and cannot) systematically verify whether the data received have been collected according to these quality standards. Ethical rules
ALLEA (ALL European Academies) European Science Foundation, The European Code of Conduct for Research Integrity. http://www.allea.org/Content/ALLEA/Scientific%20Integrity/Code_Conduct_ResearchIntegrity.pdf
DFG, Rules of Good Scientific Practice http://www.dfg.de/en/research_funding/legal_conditions/good_scientific_practice/index.html
BBAW, Richtlinien zur Sicherung guter wissenschaftlicher Praxis http://www.bbaw.de/die-akademie/aufgaben-und-ziele/sicherung-guter-wissenschaftlicher-praxis/RichtlinienundAusfuehrungsbestimmungen.pdf
The repository provides a list of accepted formats, including common multimedia-document formats as well as formats for binaries. For other file formats, we provide advice for conversion. Lists of recommended formats
CLARIN, standards recommendations. http://www.clarin.eu/recommendations
CMDI metadata (CLARIN link www.clarin.eu/cmdi) is uploaded or created during the archiving process. This step is required during the uploading process, since data without metadata is technically not accepted by the system. The front-end of the archiving system includes software to assist the depositor in creating valid CMDI metadata.
The mission of the repository is to ensure the availability and long-term preservation of german text corpora, lexical and other resources.
This mission is supported by the infrastructure of the Berlin-Brandenburg Academy of Sciences and Humanities and by the integration of the repository into the national and international CLARIN infrastructures.
As part of the CLARIN infrastructure, the repository is included by all promotional activities carried out at the national level of CLARIN-D as well as the European level of CLARIN.
The repository is no legal entity in its own right. It is run by the Berlin-Brandenburg Academy of Sciences and Humanities which is an institution governed by public law. Deposits are handled in a case-by-case approach. There are individual contracts and different licences for each resource we have archived. The access to the items is also handled case-by-case, ranging from open access over restricted access requiring a contract to restricted access on-site. The depositors themselves are responsible for compliance with any legal regulations in the area where the data is collected. Where required by national regulations, the archive also signs contracts with national/regional institutions.
Backups are performed when the data in the repository changes, and are stored in the form of disaster recoverable virtual machine images as well as file system and database dumps. The backups are copied to tape storage which is deposited in a locked safe in a separate fire safety zone of the building (in german: 'Brandschutzabschnitt') and are performed with open source software, so that they are recoverable also on a long-term basis.
For software backups, we dump databases to local storage, sync those dumps (via rsync software) and additionally local software daily to a another server. Weekly backups are performed to a Quantum LTO5 tape library via the backup software Amanda (see www.amanda.org), which decides independently when incremental and full dumps have to be made (but full dumps are done at least once per month). Amanda is open source software which is based on basic GNU backup software like tar, gzip and dump, which ensures the ability to recover backups also in the distant future.
On the other hand, the virtual machines are completely backed up as virtual machine image snapshots via Proxmox vzdump (see http://pve.proxmox.com/wiki/Backup_-_Restore_-_Live_Migration), which are then backed up to tape storage to ensure fast disaster recovery times and also live migration of virtual machines to another virtualization cluster node. Proxmox uses the open source kernel virtual machine (kvm) software internally, which again ensures the ability to recover or convert snapshots also in the distant future. The snapshots are performed prior to configuration updates on the machines.
In addition to the measures mentioned under §6 above to ensure the preservation of the raw resource data, measures are taken to ensure the future interpretability of the data. The number of accepted file formats is limited, to make future conversions to other formats more feasible. Open (non-proprietary) file formats are used whenever possible. For textual resources, XML formats are used whenever possible, to ensure future interpretability of the files independent of the tool used to create them. Text is encoded in Unicode to ensure future interpretability.
Many parts of the CLARIN infrastructure do address the migration of data from one resource center / repository to another. Since the usage of these infrastructure services (e.g. a PID system, CMDI) is obligatory, every CLARIN center is, to a certain extent, ready to move it's digital assets to another center. This is of paramount importance in case a center/repository would be unable to continue offering its services. The virtual machines can be hosted by other centres.
The online archive management tool Fedora Commons defines a workflow to a certain extent, because no resources can be archived without metadata being present. The depositor mainly decides what material is being archived; the archive only has technical requirements with regard to the file formats and encodings. The depositor determines who can access the material and is also responsible for protecting the privacy of any subjects appearing in the recordings or texts. Additionally quality checks of data and metadata including PID (Persistent Identifier) assignment are done by the repository software.
We would hope that during the implementation process documentation is developed which can be referenced in future DSA submissions.
In general it is the BBAW policy to accept only resources that are available for scientific use (preferably under a Creative Commons License). All archived resources are available online, the access permissions are defined by the depositors. Crisis management is addressed on a technical level. Since a PID system is used in CLARIN, moving resources from one CLARIN resource center to another one is possible without affecting the validity of references (e.g. PID reference of a resource used in a research paper). Our setup consists of virtual machines which are implemented by a high-availability failover cluster.
The repository provides various ways of utilizing the archived data via online tools as well as by downloading the data in formats commonly used by the research communities. An advanced metadata search utility is provided, as well as a simple search tool for textual content. All metadata can be harvested via the OAI-PMH protocol. Unique persistent identifiers according to the Handle system are provided for each corpus and the each session within the corpora. Additionally, CLARIN provides search facilities like the VLO (http://www.clarin.eu/vlo/).
The integrity of the data is ensured by the version control in the Fedora-Commons back-end by MD5 checksums. Checksum tests are done regularly, especially before performing backups. Metadata is a data stream within the digital object, and as such is version-controlled like object data. The availability of file, web, and application servers is monitored continuously. We consider all objects deposited in our repository as fixed and immutable. We create new digital objects for updates and keep the old versions in our repository. However, updates of metadata for existing resources are possible without considering the result to be a new version.
The repository in principle makes the original deposited objects available in an unmodified way, if the objects are delivered in one of the accepted file types and encodings. New versions of archived resources can be deposited, in which case the old versions will be moved to a version archive. Different versions of the same resource are not compared; we assume the depositor has good reasons for depositing a newer version. A new version of a resource will get a new persistent identifier; the old version will keep the original persistent identifier. Metadata can change if the depositor or archivist sees the need for that, in the case of errors or missing information. Changes to the metadata are currently not logged. All archived objects are linked to their metadata descriptions and are organized in hierarchical (or multi-rooted) tree structures to indicate relationships between objects and sets of objects. The tree structures can change if the depositors decide that this is necessary. The identities of the depositors are checked by the repository staff when they hand over their data. Provenance metadata as to who made changes to the repository is currently only stored in log files and not shown to the data consumer.
For metadata we rely on the group of emerging standards around CMDI (ISO-CD 24622-1). With the use of the Fedora-Commons system and the defined workflow supported by the repository’s interface, the repository aims to be as conformant to OAIS as possible.
Most of the data in the repository have Creative Commons licenses applied to them. If the data consumer does not comply with the access regulations, the only measure that can be taken in practice is to deny him/her further access and to make the research community aware of the misuse. For some data sets, explicit permission from the depositor is needed. In that case a login is necessary.
There are a number of specific codes of conduct that are applicable to parts of the repository, e.g. the DFG code of conduct. The codes of conduct are in line with generally accepted codes of conduct for research data in Germany. Any data user is bound by the terms and conditions of use of the repository, as soon as repository services or data deposited in the repository are used.
If applicable, the data consumer is made aware of usage restrictions for the data to which she/he has received access. Generally, the usage restrictions are already described in the codes of conduct. For some data, explicit statements need to be made by the data consumer about the use of the data before he/she receives access. The depositor then decides whether or not access is granted. In case of misuse, the only thing that can be done in practice is to deny the user further access to the repository and to make the research community aware of the misuse.