The Data Seal of Approval board hereby confirms that the Trusted Digital repository CLARIN-D Resource Center Leipzig complies with the guidelines version 2014-2017 set by the Data Seal of Approval Board.
The afore-mentioned repository has therefore acquired the Data Seal of Approval of 2013 on April 30, 2015.
The Trusted Digital repository is allowed to place an image of the Data Seal of Approval logo corresponding to the guidelines version date on their website. This image must link to this file which is hosted on the Data Seal of Approval website.
The Data Seal of Approval Board
|Guidelines Version:||2014-2017 | July 19, 2013|
|Guidelines Information Booklet:||DSA-booklet_2014-2017.pdf|
|All Guidelines Documentation:||Documentation|
|Repository:||CLARIN-D Resource Center Leipzig|
|Seal Acquiry Date:||Apr. 30, 2015|
|For the latest version of the awarded DSA |
for this repository please visit our website:
|Previously Acquired Seals:||
|This repository is owned by:||
CLARIN-D Resource Center Leipzig (http://clarin.informatik.uni-leipzig.de/repo/)is part of CLARIN-D (Common Language Resources and Technology Infrastructure Deutschland) - a web and centres-based research infrastructure for the social sciences and humanities. The aim of CLARIN-D and its service centres is to provide linguistic data, tools and services in an integrated, interoperable and scalable infrastructure for the social sciences and humanities. The research infrastructure is rolled out in close collaboration with expert scholars in the humanities and social sciences, to ensure that it meets the needs of users in a systematic and easily accessible way. CLARIN-D is funded by the German Federal Ministry for Education and Research.
CLARIN-D is building on the achievements of the preparatory phase of the European CLARIN initiative as well as CLARIN-D's Germany-specific predecessor project D-SPIN. These previous projects have developed research standards to be met by the CLARIN services centres, technical standards and solutions for key functions, a set of requirements which participants have to provide, as well as plans for the sustainable provision of tools and data and their long-term archiving.
This repository offers resources such as a set of corpora of the Leipzig Corpora Collection (http://wortschatz.uni-leipzig.de/), based on newspaper, Wikipedia and Web text. Furthermore several REST-based webservices are provided for a variety of different NLP-relevant tasks.
Within CLARIN-D this resource centre is a certified centre of type B. CLARIN distinguishes a number of different centre types that have different impact for the language resources and tools infrastructure. Type B centres offer services that include the access to the resources stored by them and tools deployed at the centre via specified and CLARIN compliant interfaces in a stable and persistent way.
Within CLARIN-D the following requirements hold for centres of type B (https://www.clarin.eu/node/3542) and are fulfilled by this resource centre:
A short overview of all requirements for centres of type B is also given in the form of a checklist (https://www.clarin.eu/content/checklist-clarin-b-centres).
List of outsource partners:
1) Gesellschaft für Wissenschaftliche Datenverarbeitung mbH Göttingen (GWDG)
The repository makes use of a common CLARIN PID service (https://www.clarin.eu/files/pid-CLARIN-ShortGuide.pdf) based on the Handle System (http://www.handle.net/) and in cooperation with the European Persistent Identifier Consortium (EPIC). The usage of PIDs is mandatory for resources in CLARIN thus all resources added to the repository may be referenced using PIDs.
CLARIN-D has a contractual relationship with GWDG concerning the provision of PID-services via EPIC API v2. The following document lists the services which were stipulated: http://de.clarin.eu/mwiki/images/0/0b/GWDG_PID.pdf
This outsource partner offers relevant functionality for guideline 10: "The data repository enables the users to utilize the research data and refer to them.".
Good, comprehensive context-setting information.
The minimal requirements for data/tools to be deposited in the repository are:
The data that is put into the repository is checked for compliance with internal and CLARIN guidelines concerning scientific and scholarly quality. Only data that:
will be added to the repository. The data itself, the metadata and additional documentation is an obligatory part of each repository entry. Currently these guidelines are not in a fixed state and subject to minor changes. A preliminary version is available on the repository website (http://clarin.informatik.uni-leipzig.de/repo/). Until a final version is released, no data created by external data producers will be added to the repository. The data stored in the repository is limited to well known and documented content created by our own institution as the result of a long running research project. According to these guidelines metadata on data/tools always contains information on the resource producer (name and URL of the institution, information on contact persons that allows interested users to obtain further information. Adding references to publications to the metadata (or adding the papers to the repository), on how the data was created and in which scenarios it is intended to be used is encouraged but not enforced. Data sharing and reuse is promoted by providing free access to the data (download, webservices) and metadata (via the OAI-PMH protocol). The CLARIN infrastructure contains software components such as the VLO (http://www.clarin.eu/vlo/) which enables users to browse and search through combined (metadata of all CLARIN repositories) catalogs.
It is recommended to use formats listed in the CLARIN standard recommendations (http://www.clarin.eu/recommendations). In addition relevant standards and formats in the context of CLARIN are listed (http://www.clarin.eu/content/standards-and-formats). Manual checks are performed by CLARIN members before data is added to the repository. Usage of standardized formats is encouraged but not enforced.
In case no recommended/well known and documented format is used, an exhaustive documentation on the syntax and semantic of the data (e.g. database dumps: names of tables and columns; specifications and examples on the contents of each column; examples on how to retrieve different types of data) has to be provided by the data producer. This documentation (English, PDF) is stored on the repository along with the data and metadata and is provided to everyone who wishes to download/access the resource. The repository maintainers keep track of all formats already used by the depositors and commit themselves to work on updates of the CLARIN standard recommendation if new formats gain in popularity.
Metadata for all CLARIN repositories has to be provided in the CMDI format. There is exhaustive documentation (http://www.clarin.eu/cmdi) available on how to create CMDI compliant metadata profiles and instances. Additionally a set of tools is provided that allow data producers to easily create new or adapt existing metadata to the CMDI standard.
Resources must be accompanied with valid CMDI metadata in order to be considered for deposit. Metadata is checked for compliance according to CMDI standards in the following way:
The granularity of CMDI metadata is up to the (meta)data producer. The repository itself is able to handle a high granularity of metadata. The creation of metadata files (instances) is supported via the XML Editor ARBIL (https://tla.mpi.nl/tools/tla-tools/arbil/) that comes with CMDI support. Metadata elements need to be compliant to the standards set in CMDI. Since CMDI is a component based approach which allows (meta)data producers to create custom tailored metadata profiles there is no limit to the usage of established standards etc. In order to be visible and useable in the CLARIN infrastructure CMDI metadata added to the repository needs to contain a minimum set of attributes (linked to data categories stored in the CLARIN Concept Registry) which is enforced by the quality checks described above. The usage of metadata elements that are accepted by a research community is encouraged and technically supported via re-use of existing metadata components, but is not enforced.
This information is part of the resource depositor guide which is available on the repository website (http://clarin.informatik.uni-leipzig.de/repo/).
The mission of the repository is to serve as the repository of a CLARIN-D resource center of type B (http://www.clarin.eu/files/centres-CLARIN-ShortGuide.pdf). The mission of CLARIN-D is to provide "linguistic data, tools and services in an integrated, interoperable and scalable infrastructure for the social sciences and humanities" (http://de.clarin.eu/en/home-en.html). Therefore a repository in which data, tools and associated metadata is archived on a long term basis must be operated by such a resource center.
The repository is part of the CLARIN infrastructure and thus does not carry out promotional activities on it's own, but is embedded into such activities on CLARIN-D and the European CLARIN level. These activities do include but are not limited to:
Many parts of the CLARIN infrastructure do address the migration of data from one resource center / repository to another. Since the usage of these infrastructure services (e.g. a PID system, CMDI) is obligatory, every CLARIN center is, to a certain extent, ready to move it's digital assets to another center.
The repository is not a legal entity on its own. It is run by the University of Leipzig which is an institution governed by public law. Depositors need to sign an agreement stating that they own all necessary rights required to deposit the data and that during the creation of the resource the data producer respected IPR (Intellectual Property Rights) and privacy issues. Data depositors are themselves responsible for compliance with any national or international legal regulations. Since no data with disclosure risk will be added to the repository, depositors also have to state that the deposited resource does not contain any data with disclosure risk. The repository staff maintains a checklist of cases in which resources containing data with disclosure risk have previously been rejected or modified (and if so, how they were modified) in order to be compliant to the repository regulations. This list is intended to help in cases in which the depositors are unsure about the status of their resource regarding disclosure risk.
In case a violation of conditions is observed, the original data provider is contacted. In case the violator can be identified, further access by this person/institution will be prevented if technically possible (Shibboleth => the home institution will be informed).
Data is stored on a RAID system and backups are created on a regular basis (every time the content of the repository changes, since ingests are done by the repository maintainers). These backups are held on separate hardware. Deterioration of storage media is monitored via Nagios probes which do a regular check of the used hardware (e.g. S.M.A.R.T. - Self-Monitoring, Analysis and Reporting Technology - data) and report drastic changes or imminent failures. In case of failures/problems/... the administrators of the repository are notified and will take appropriate actions. For further information please refer to the preservation policy provided on the repository website (http://clarin.informatik.uni-leipzig.de/repo/).
By encouraging data depositors to use standardized formats (UTF-8, documented XML formats, ...) we try to minimize the cases in which obsolescence of file formats will occur in the near future. By enforcing a detailed and exhaustive documentation in case proprietary / "custom" formats are used we ensure that exhaustive documentation is available under all circumstances. Thus it will, at least, be possible to specify and implement data converters.
Long term data usability is ensured by the following measures:
For further information please refer to the preservation policy provided on the repository website (http://clarin.informatik.uni-leipzig.de/repo/).
Currently there are no established workflows that define how to integrate/archive data provided by external data providers. We are currently working on a documentation on how to archive some types of resources that we will add to the repository on our own. Based on this work this documentation will be extended in order to address similar kinds and other types of resources we expect to be added by external depositors. Once all currently open questions mentioned below are part of the documentation, these documents will be available on the repository website (http://clarin.informatik.uni-leipzig.de/repo/).
Currently there is no documentation or process for transformations on archival data.
Up to now there is no established and documented selection process. While some questions still need to be answered (e.g. up to which scale are we able to handle big data of external depositors) some outlines are already clear from a CLARIN perspective:
The handling of requests to deposit data that does not fall within the (CLARIN) mission will be decided on a case by case basis. Data that supports the CLARIN mission will be prioritized.
On a technical level access to the data can be limited to users working in research institutions (CLARIN-AAI, DFN-AAI). According to the CLARIN rules access to metadata is not limited. Only data that comes with licenses that fit these rules (e.g. texts/audio/video must not be shared freely on the web but are free to be used in research and teaching) will be added to the repository. In case privacy of subjects is a concern, this needs to be addressed by contracts signed by those subjects (e.g. interviewed people explicitly state that the data may be provided freely to researchers/teaching purposes). In case new usage scenarios are supported by the CLARIN-AAI we will adapt these rules.
In a first stage only data that is available for free (and states this in a license) or comes with “compatible” licenses (free for research/teaching) will be added to the repository. Thus, currently there are no licences / contractual agreements with data producers since only data created by the institution which runs the repository is present. In the future external depositors will have to sign a depositor agreement. These contracts contain statements on:
A preliminary version is already available on the repository website (http://clarin.informatik.uni-leipzig.de/repo/).
Enforcing licenses by data users in the case of misuse is conducted by the property rights owner. Crisis management concerning the availability of the digital objects is addressed on a technical level. Since a PID system is used in CLARIN, moving resources from one CLARIN resource center to another one is possible without affecting the validity of references (e.g. PID of a resources used in a paper). Our setup consists of virtual machines which may be moved to other CLARIN partners . In case virtual machines are moved internally (inside the CLARIN-D center in Leipzig) this will be possible without severe impact to user experience (live migration is supported). In case the machines need to be moved to other CLARIN partners a limited downtime will occur.
As described previously in chapter 2 the decision which formats are used is up to the data provider. Since usage of standardized formats is encouraged, data usually will be available in formats used by the research community.
Harvesting of metadata is possible via OAI-PMH. Search facilities are currently not provided by the repository itself. Instead CLARIN operates OAI-PMH harvesters which collect CMDI-metadata from all repositories run by CLARIN centers. The collected metadata is used in the back-end of web applications such as the VLO (http://www.clarin.eu/vlo/), which provide a central starting point when searching for resources in the CLARIN infrastructure. In cases of some resources “deep search” is supported by the means of the CLARIN Federated Content Search (http://www.clarin.eu/fcs) interface.
The repository uses the common CLARIN PID service (https://www.clarin.eu/files/pid-CLARIN-ShortGuide.pdf) based on the Handle System (http://www.handle.net/) and in cooperation with the European Persistent Identifier Consortium (EPIC). The usage of PIDs is mandatory for resources in CLARIN thus all resources added to the repository may be referenced using PIDs.
CLARIN-D has a contractual relationship with GWDG concerning the provision of PID-services via EPIC API v2 as mentioned in section 0 on repository context.
Currently, we are implementing the use of checksums in order to monitor the integrity of digital objects stored in the repository. We are still in the process of checking if the software stack that is already in place allows this kind of monitoring based on existing functionality or if an external solution needs to be implemented. Since we are in the implementation phase there also is no fixed guideline on when to perform these kinds of checks. Currently we plan to do checksum-tests when:
For this, Apache Subversion (SVN) has been set up as part of our repository. Within this SVN the data of the repository is already stored as a backup mechanism. These backups are created once the integrity of the data in the repository was ensured after ingestion. In addition a checksum of the original data is created and stored. A mechanism for regular comparison of the state of the resources in the SVN and the repository to checksums created upon insertion into the version control system still has to be implemented.
Additionally the integrity of the data is ensured by the version control capabilities that are part of the Fedora Commons repository which is operated in the backend. Metadata is a data stream within the digital object, and as such is version controlled like object data.
Access to data and metadata is provided via webservice interfaces. The availability of these webservices is monitored via Nagios (http://www.nagios.org/) / Icinga (https://www.icinga.org/) probes. Some of these probes are run in local installations at the center while others are operated by CLARIN-D (http://de.clarin.eu/images/ap3/ap3-005-monitoring.pdf). The frequency of checks depends on the type of service that is monitored.
Multiple versions of data are valid. CLARIN propagates the idea of reproducible research. Thus updates/new versions of existing data is handled like any other resource with the exception of setting and storing a reference to the previous version. Access to metadata and data of all versions is provided at the same time and is handled in the same way:
However, updates of metadata for existing resources are possible without considering the result to be a new version.
Part of the archiving workflow is the integrity check of the data and the metadata by the archive manager. This is done both manually and automatically. The metadata is parsed for syntactic correctness and manually evaluated for completeness and soundness.
In case data that is present in the repository “changes” this data is considered to be a new version of the existing data. Thus data producers need to provide the same type and scale of information (metadata, documentation) that was provided for the previous version (at least in case changes occurred).
The metadata provided for each resource to be added to the repository needs to contain basic information on the data depositor (e.g. name of the institution, contact address) and the provided data (e.g. name, date or version, description of the resource itself and of the data format being used, obligatory links to papers). Adding further information (e.g. change logs) is encouraged but not enforced. In case multiple version of a resource are present in the repository, at least references to previous/newer versions needs to be present in the metadata.
Data and metadata are essential and mandatory parts of the digital objects that represent a resource in the repository. This can be considered to be an implicit link between data and metadata. In CMDI metadata is explicitly linked to data and additional metadata via the ResourceProxy-section (https://www.clarin.eu/faq/3462) in a CMDI file.
Currently we do not intend to compare essential properties of different versions of the same file/resource. Keeping track of changes that occurred in between different versions of the same file/resource will be up the data producers. In order to improve the usability we will encourage but not enforce data producers to provide change-logs in case new versions of already existing data are ingested into the repository.
Currently there is no explicit check of the identity of depositors since especially in the first phase of CLARIN only data that is provided by known partners will be added to the repository. Once this changes an explicit procedure for the check of depositor identities and “ownership” of the ingested data needs to be specified. External deposits will only be accepted after a due diligence process involving a check of the identity of the depositor and a clarification of all legal issues.
For metadata we rely on the group of emerging standards around CMDI (ISO-CD 24622-1). With the use of the Fedora Commons system the repository aims to be as conformant to OAIS as possible. The repository complies with the OAIS reference model’s tasks and functions. Moreover, the repository is based one the Fedora Commons software, which is compliant with the Reference Model for an Open Archival Information System (OAIS). Besides of the integration of the repository into the CLARIN infrastructure, there are currently no further plans for infrastructural development. For this repository the OAIS
Submission Information Package (SIP) consists of:
Archival Information Package (AIP) consists of:
Metadata is available in CMDI via OAI-PMH. The CMDI file of a resource contains links to documents stored in the repository, interfaces - usually webservices in CLARIN – or webapplications that facilitate usage of the resource. The CMDI file tied together with these resources can be seen as a representation of a Dissemination Information Package (DIP).
Currently there is only data stored in the repository that is available for free in case it is used for research or teaching purposes. In a first phase, only this kind of data will be added to the repository. This will change in the near future. Template contracts will be used which enforce resource depositors to specify an appropriate licence (free, free for academic use/research; see http://weblicht.sfs.uni-tuebingen.de/Reports/D-SPIN_R7.3.pdf for details).
Implicitly the regulations of the DFN-AAI (https://www.aai.dfn.de/ and https://www.aai.dfn.de/en/der-dienst/degrees-of-reliance/) need to be mentioned since access to resources stored in the repository that shall only be available for academic use/research will be granted using this AAI infrastructure (based on Shibboleth).
Data providers need to make sure that IPR and personal rights (e.g. mentioning of people in context with personal information/events/... in texts) are respected in their deposited data. Access restricted resources (limited to academic use/research; see http://weblicht.sfs.uni-tuebingen.de/Reports/D-SPIN_R7.3.pdf for details) are protected via Shibboleth and are only available to persons that are able to log-in through IDPs operated at institutions taking part in the DFN-AAI or similar AAI federations that are part of CLARIN.
The system does not allow the integration of data into the repository without the specification of access criteria and without providing an appropriate license. These license conditions are available to the users via CMDI metadata. In case of misuse, the only thing that can be practically done is to deny the user further access to the repository and to make the research community aware of the misuse.
Background information: The repository is part of a CLARIN-D center. CLARIN aims at "Providing linguistic data, tools and services in an integrated, interoperable and scalable infrastructure for the social sciences and humanities." (http://de.clarin.eu/en/home-en). The resources provided by the repository are intended to be used for scientific purposes (teaching, research, …). An example of misuse may be the usage of these resources in a commercial context without knowledge/consent of the resource owner.