DSA logo

 

Implementation of the Data Seal of Approval

The Data Seal of Approval board hereby confirms that the Trusted Digital repository HZSK Repository complies with the guidelines version 2010 set by the Data Seal of Approval Board.
The afore-mentioned repository has therefore acquired the Data Seal of Approval of 2010 on May 3, 2013.

The Trusted Digital repository is allowed to place an image of the Data Seal of Approval logo corresponding to the guidelines version date on their website. This image must link to this file which is hosted on the Data Seal of Approval website.

Yours sincerely,

 

The Data Seal of Approval Board

Assessment Information

Guidelines Version:2010 | June 1, 2010
Guidelines Information Booklet:DSA-booklet_2010.pdf
All Guidelines Documentation:Documentation
 
Repository:HZSK Repository
Seal Acquiry Date:May. 03, 2013
 
For the latest version of the awarded DSA
for this repository please visit our website:
http://assessment.datasealofapproval.org/seals/
 
Previously Acquired Seals:
  • Seal date:May 3, 2013
    Guidelines version:2010 | June 1, 2010
 
This repository is owned by:
  • Hamburger Zentrum für Sprachkorpora
    Max-Brauer-Allee 60
    22765 Hamburg
    Hamburg
    Germany

    T 0049 (40) 42838-6425
    E corpora@uni-hamburg.de
    W http://www.corpora.uni-hamburg.de/

Assessment

1. The data producer deposits the research data in a data repository with sufficient information for others to assess the scientific and scholarly quality of the research data and compliance with disciplinary and ethical norms.

Minimum Required Statement of Compliance:
3. In progress: We are in the implementation phase.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

The HZSK-repository contains mainly spoken language corpora. For projects just starting to compile corpora, the HZSK provides assistance in all areas related to data formats, legal issues and compliance to regulations and requirements given in CLARIN.
For projects that are finished with their data collection already, the HZSK always conducts a maintenance and curation process to ensure that data and metadata comply with the repository requirements.

The minimal requirements for corpora BEFORE that curation process are:

- data provided in a standardized format or a comprehensive documentation of a proprietary format
- Metadata in a well documented and accessible format
- Contact information on the data depositor / data producer
- a statement on the legal status of the resource
- a contract about the means and procedure of accessing the data
- compliance with internal and CLARIN guidelines concerning scientific and scholarly quality
- information on how the data was originally created

The depositor is required to deliver metadata according to the HZSK core metadata schema and in cooperation with the corpus depositor the HZSK creates a description of the corpus. We encourage depositors to provide further references to work describing the resource and its creation in detail.

The depositor is alone responsible for the consent of the participants in the data and for compliance with ethical codes of conduct and national and international legal regulations.

In the curation process,

- data is converted into standardized formats (EXMARaLDA, ELAN/EAF, TEI)
- metadata is converted into standardized formats (EXMARaLDA corpus metadata, CMDI metadata)

Data sharing and reuse is promoted by providing access to the data (download, webservices) and metadata (via the OAI-PMH protocol and the repository itself) free of charge. The CLARIN infrastructure contains software components such as the VLO (http://www.clarin.eu/vlo/) which enable users to browse and search through combined catalogs (metadata of all CLARIN repositories). New resources are promoted through announcements on relevant mailing lists.

For most resources, it is necessary to request (free) access to the corpus and state the nature of the intended work.
Example: http://www.corpora.uni-hamburg.de/sfb538/bipode_nutzungsvereinbarung.pdf
For these resources, citations are requested in publications using these resources. All resources are described in a comprehensive way with references to publications describing the resources.
Example: http://www.corpora.uni-hamburg.de/sfb538/en_h9_hacaspa.html

Since the HZSK hosts very heterogeneous resources, no claim can be made about its general reusability. Many of the resources are requested regularly.
See: http://www.corpora.uni-hamburg.de/?Ressourcen_und_Projekte::Korpora

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

2. The data producer provides the research data in formats recommended by the data repository.

Minimum Required Statement of Compliance:
3. In progress: We are in the implementation phase.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

All spoken language corpora hosted at the repository have been converted into best practice formats during the curation process outlined in DSA guideline 1.

The XML-based EXMARaLDA format is a widely spread format for transcriptions with (mainly manually created) annotations and can be considered best practice. For all EXMARaLDA transcription files in a hosted corpus, the HZSK provides corresponding files in the best practice formats TEI, EAF and sometimes also in the CHAT format depending on the original transcription guidelines used, which were automatically generated during the corpus curation process. Corpus metadata and metadata on communications (sessions), speakers and all physical files is stored in the XML-based EXMARaLDA metadata-format, which is also part of the EXMARaLDA system. We store audio recordings in the uncompressed WAV format. All main corpus files are stored in open, standard or best practice formats.

To accept further corpora to the repository, we will require the corpus to be in the EXMARaLDA format or to be convertible into the above mentioned formats in a corpus curation process as described in DSA guideline 1. Before the HZSK decides on conducting a curation process, detailed information about the file formats and the tools and methods by which the files were created is requested.

The cost-benefit analysis leading to the decision on whether to perform the curation is based on a publicly available guideline.

The HZSK does not allow corpora to be deposited into the repository without previous curation. Since the EXMARaLDA software tools require valid files, we ensure compliance with the format requirements by using them in the curation process.

References:
The EXMARaLDA format: http://www.exmaralda.org/
The corpora hosted at the HZSK (information page, not part of the repository): http://www.corpora.uni-hamburg.de/sfb538/en_overview.html
Guideline for the cost-benefit analysis for corpora curation (german draft): http://www.corpora.uni-hamburg.de/documents/leitfaden_draft.pdf

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

3. The data producer provides the research data together with the metadata requested by the data repository.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

Since all corpora that are to be ingested into the HZSK repository undergo a manual curation process through HZSK-staff, there is only little automated guidance for data depositors.
As part of that curation process, we try to obtain as many missing metadata as possible from the data depositors.

A cost-benefit analysis leading to the decision on whether to perform the curation is based on a publicly available guideline.

For guidance for projects just starting with their data compilation, please refer to DSA guideline 1.

We also provide a metadata schema with recommended metadata elements, the EXMARaLDA core metadata schema.

For depositors, metadata can be created on a file level using EXMARaLDAs Corpus Manager (Coma).

Coma datasets - which include OLAC metadata for resource description - are then converted into CLARIN-compliant CMDI metadata.

OLAC metadata: http://www.language-archives.org/OLAC/metadata.html
CMDI component metadata: http://www.clarin.eu/cmdi
Guideline for the cost-benefit analysis for corpora curation (german draft): http://www.corpora.uni-hamburg.de/documents/leitfaden_draft.pdf
EXMARaLDA core metadata schema: http://www.corpora.uni-hamburg.de/documents/HZSKcoremetadataset.pdf
EXMARaLDA Corpus Manager: http://www.exmaralda.org/en_coma.html

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

4. The data repository has an explicit mission in the area of digital archiving and promulgates it.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline can be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

The mission of the repository is to serve as the repository of a CLARIN-D resource center of type B. The mission of CLARIN-D is to provide “linguistic data, tools and services in an integrated, interoperable and scalable infrastructure for the social sciences and humanities“ (http://de.clarin.eu/en/home-en.html). Therefore a repository in which data, tools and according metadata is archived on a long term basis has to be operated by such a resource center.

The "Satzung" (articles of association) of the HZSK include explicit references to the tasks of archiving as well as publishing language resources and to the underlying methodology.

§2 of the Satzung states:

"§2 Goals and Mission

1. The HZSK promotes and coordinates computer based research and teaching in linguistics and related disciplines at the University of Hamburg. Its main aims are:

a. Ensuring sustainability, i.e. long-term usability and availability of empirical digital linguistic data, created and used at the University of Hamburg for research and teaching purposes. (...)"

The HZSK presents its activities on a regular basis, organizes workshops and training courses to introduce people to the underlying methodology.

References:
HZSK-Satzung: http://www.corpora.uni-hamburg.de/downloads/Satzung.pdf

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

5. The data repository uses due diligence to ensure compliance with legal regulations and contracts including, when applicable, regulations governing the protection of human subjects.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

The repository is part of the Hamburger Zentrum for Sprachkorpora, a research centre inside the Hamburg University, which is (like all public german universities) a "Körperschaft des öffentlichen Rechts" (corporation under public law).

The repository uses model contract(s) with data producers. The contracts are adapted regarding access to the resource as requested by the rights holders. Depending on the contract with the rights holder, one of four levels of access control applies for a certain resource, ranging from an email request to a signed contract sent by mail to the HZSK. The contracts used with the data consumers are all based on one model contract and altered as requested by the rigths holder.

The conditions of use are not repository-wide but corpus-specific and therefore published on each corpus description page and further specified in the contract/conditions of use signed by the data consumer.

If the conditions are not complied with, the user is denied further access to the repository. Further legal measures remain reserved to the data depositors.

Basically all data hosted at the HZSK-repository can be considered data with disclosure risk. Therefore, all data is managed, stored and distributed with great care. Access to the repository is only possible for those corpora for which access has been granted as described above. The means of distribution of sensitive data is at the discretion of the rights holder, who is also responsible for the consent of the participants in the data and for compliance with ethical codes of conduct and national and international legal regulations. In our model contract for the conditions of use, there is a part where the data consumer consents to not reveal the identity of any participant, nor publish data or part thereof in a manner that would make the reconstruction of a person's identity possible. As a part of the curation process it is also possible to anonymize files if requested by the rights holder.

The HSZK staff has received training and guidelines to handle requests for access according to our four-level-access-system.

References:
The HZSK corpus access guidelines (in german): http://www.corpora.uni-hamburg.de/documents/Korpusfreigaberichtlinien.pdf

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

6. The data repository applies documented processes and procedures for managing data storage.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline can be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

The repository runs on a virtual server hosted by the computing centre of the Hamburg University ("Regionales Rechenzentrum", RRZ).
There are nightly backups of the whole server. In case of a failure, the virtual machine can be restored in 24 hours time.

Bitstream-conservation is also provided by the computing centre of the Hamburg University, for a timespan not less than 10 years.

Documentation (in German):

- Virtualization: http://www.rrz.uni-hamburg.de/fileadmin/virtueller-server/virtuelle-Server-Ordnung-RRZ220307.pdf
- Backup: http://www.rrz.uni-hamburg.de/serversysteme/unix-server/adsm-backup.html
- Backup-Policy of the computing centre: http://www.corpora.uni-hamburg.de/documents/TSM-Policy.pdf

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

7. The data repository has a plan for long-term preservation of its digital assets.

Minimum Required Statement of Compliance:
3. In progress: We are in the implementation phase.
This guideline can be outsourced.

Applicant Entry

Statement of Compliance:
3. In progress: We are in the implementation phase.
Evidence:

The digital assets of the HZSK (as well as the actual repository) are stored at the computing centre of the Hamburg University (Regionales Rechenzentrum, RRZ). See also guideline 6.
The RRZ commits itself to ensure bitstream-conservation for 10 years from the moment assets are first stored*.
All main corpus files in the HZSK repository are XML-based, allowing for easy conversion into other formats.

References:
[1]Requirements for the long-time archiving of database applications of the computing centre (in german): http://www.gwiss.uni-hamburg.de/Datenbanken.html

* As long as the requirements in [1] are met, a long-term (10 years) archiving commitment is given.The HZSK-Repository meets the requirements in [1] by using a virtual machine for the repository. However, the requirements would also be met for the plain data in the repository (possibility to export the data, unicode-encoding, xml-based modeling, documentation).
In practice, the ten years period starts from the moment when there is a change in the terms and regulations that is not met by the repository, or the HZSK goes out of service.

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

8. Archiving takes place according to explicit work flows across the data life cycle.

Minimum Required Statement of Compliance:
3. In progress: We are in the implementation phase.
This guideline can be outsourced.

Applicant Entry

Statement of Compliance:
3. In progress: We are in the implementation phase.
Evidence:

While the goal of the curation process for the resources at the HZSK is always the same (creating an EXMARaLDA corpus for ingestion into the repository), the process is highly individual depending on the source data.
The procedural documentation for archiving data is not published. As many steps are performed (semi)automatically, documentation of the curation process and the ingest into the repository is sometimes only available in the form of commented data processing programs or scripts.
The cost-benefit analysis leading to the decision on whether to perform the curation is based on a publicly available guideline (see answers to DSA guidelines 2 and 3).
We plan to further develop and standardize our workflows and will document these versions of our archiving procedures accordingly.

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

9. The data repository assumes responsibility from the data producers for access and availability of the digital objects.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

The repository has contracts with the rights holder/data producer of each corpus. The contracts are all based on one model contract with details on access adapted according to the requirements of the rights holder. The repository will not allow any deposit of data without a signed agreement specifying the handling of the data and access to it in detail.
There is a backup procedure for the virtual server hosting the repository. The stored version can be used in case of any severe issues with the current system (see also DSA guideline 6).

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

10. The data repository enables the users to utilize the research data and refer to them.

Minimum Required Statement of Compliance:
2. Theoretical: We have a theoretical concept.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

Metadata for the corpora hosted at the HZSK can be (and is) harvested via OAI/PMH. The collected metadata is used in the back-end of web applications such as the VLO (http://www.clarin.eu/vlo/), which provide a central starting point when searching for resources in the CLARIN infrastructure.

To use the actual corpus data, users have to sign an end-user-agreement and will receive password-protected access to a corpus. All corpora are provided in different tool formats that are common in the research community dealing with spoken language corpora (EXMARaLDA, Folker, Praat, ELAN) as well as exchange and presentation formats (TEI, MS-Word, PDF).

The EXMARaLDA system provides the search and analysis tool EXAKT that allows deep search in EXMARaLDA corpora (transcription data, annotations and metadata).

An interface for CLARINs federated search facility is being developed and should be ready for use by Q3/2013.

The repository itself does not offer a persistent identifier service on its own but makes use of a common CLARIN PID service based on the handle system.
The usage of PIDs is mandatory for resources in CLARIN, thus all resources added to the repository can be referenced using PIDs.

References:
Example metadata record: http://goo.gl/p4vD2
Same metadata record as found in the Virtual Language Observatory (VLO): http://goo.gl/TtDV7
Screenshot: http://goo.gl/gBSIF
EXMARaLDA search and analysis tool (EXAKT): http://www.exmaralda.org/en_exakt.html
CLARIN PID-Service description: https://www.clarin.eu/files/pid-CLARIN-ShortGuide.pdf
Handle-System: http://www.handle.net/
Handle-System Implementation: http://handle.gwdg.de:8080/pidservice/

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

11. The data repository ensures the integrity of the digital objects and the metadata.

Minimum Required Statement of Compliance:
3. In progress: We are in the implementation phase.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
3. In progress: We are in the implementation phase.
Evidence:

The HZSK repository uses Fedora Commons' ability to automatically generate checksums for ingested resources.

The repository will only allow manual versioning of corpora. The HZSK controls what gets deposited and will release new versions if major changes such as major corrections or completions, further annotation layers etc. are made to the data.

We are about to implement Nagios monitoring for the repository, consistent with the other CLARIN-D centers.

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

Are checksums of resources regularly compared? Generating them upon ingest is one thing but in order to detect data corruption, they would have to be generated periodically and compared to previously generated checksums.

Are previous versions still available?

12. The data repository ensures the authenticity of the digital objects and the metadata.

Minimum Required Statement of Compliance:
3. In progress: We are in the implementation phase.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

The repository does not allow any uncontrolled changes to deposited data. New versions will only be created and ingested manually by the HZSK staff when data changes require a new version. Data producers of the current corpora have deposited their data in the repository as an archive and did not wish to further develop their data. For future data collections, versioning issues will be discussed and individual strategies agreed upon when data producers sign the contract.
The repository maintains information on data provenance and versions ingested into the repository.
The Coma format is the part of the EXMARaLDA system used to manage metadata on corpora, communications/sessions, speakers and physical files of the corpus. Since all spoken corpora are based on a Coma file, the existing metadata is always included in the resources.
Since versioning is controlled and conducted by the HZSK, all changes to the data can be documented for each version of the data and since EXMARaLDA data is XML, more detailed information on differences between versions can be gained by using common XML editors or tools.
The repository does not allow anonymous depositing of resources. Before a corpus is added to the collection, the HZSK will meet with the rights holder in person to discuss the curation process and the details on the access to the deposited resources.

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

13. The technical infrastructure explicitly supports the tasks and functions described in internationally accepted archival standards like OAIS.

Minimum Required Statement of Compliance:
3. In progress: We are in the implementation phase.
This guideline can be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

The repository complies with the OAIS reference model’s tasks and functions.

The repository builds on the Fedora Commons software, which is compliant with the Reference Model for an Open Archival Information System (OAIS) due to its ability to ingest and disseminate Submission Information Packages (SIPS) and Dissemination Information Packages (DIPS) in standard container formats.
The data consumer has direct access to the archived objects via the web, provided that access requirements have been met.

The repository is part of the CLARIN infrastructure and will fulfill current and future requirements decided on by the CLARIN board.

References:
- Reference Model for an Open Archival Information System (OAIS), Recommended Practice, CCSDS 650.0-M-2 (Magenta Book) Issue 2, June 2012 http://public.ccsds.org/publications/archive/650x0m2.pdf
- Fedora Commons: http://fedora-commons.org/

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

14. The data consumer complies with access regulations set by the data repository.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
3. In progress: We are in the implementation phase.
Evidence:

The End User Licence(s) used with data consumers depend on the rights holders' requirements. If desired, right holders can require data consumers to sign contracts specifying the conditions of use for a particular resource in detail as defined by the rights holder.
After registering, other resources are free to use for academic and teaching purposes and these restrictions are recognized by accessing the resources with the access data supplied by the repository.
In case of misuse, the user is denied further access to the repository. Further legal measures remain reserved to the data depositors.

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

15. The data consumer conforms to and agrees with any codes of conduct that are generally accepted in higher education and scientific research for the exchange and proper use of knowledge and information.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

There are a number of specific codes of conduct that are applicable to parts of the repository, e.g. the DFG code of conduct. The codes of conduct are in line with generally accepted codes of conduct for research data in Germany. Any data user is bound by the terms and conditions of use of the selected resource, as soon as he agrees to the license agreement of that resource.

In case of misuse, the user is denied further access to the repository. Further legal measures remain reserved to the data depositors.

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments:

16. The data consumer respects the applicable licenses of the data repository regarding the use of the research data.

Minimum Required Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
This guideline cannot be outsourced.

Applicant Entry

Statement of Compliance:
4. Implemented: This guideline has been fully implemented for the needs of our repository.
Evidence:

For all corpora, specific restrictions are given by their respective license.
According to our four level access guidelines, explicit statements might need to be made by the data consumer about the usage of the data before he/she gets access. The depositor then decides on whether access is granted or not.
In case of misuse, the user is denied further access to the repository. Further legal measures remain reserved to the data depositors.

References:
The HZSK corpus access guidelines (in german): http://goo.gl/x0Tet

Reviewer Entry

Accept or send back to applicant for modification:
Accept
Comments: