diff --git a/C14-Storage-Integrity.md b/C14-Storage-Integrity.md deleted file mode 100644 index f4df10f..0000000 --- a/C14-Storage-Integrity.md +++ /dev/null @@ -1,74 +0,0 @@ -# C14.1. Processes and documents to ensure that the repository staff have a clear understanding of all storage locations and how they are managed. - -Tethys RDR (Research Data Repository) is a data archiving service that aims to provide secure, sustainable, and long-term storage of research data. This means that Tethys RDR is designed to ensure that research data is stored in a way that ensures its integrity, security, and accessibility for the long-term, even as technology and data formats change over time. In order to achieve this, various conditions must be met. The following processes and documents have been developed for the repository staff, ensuring that data is stored securely, backed up regularly, and preserved for the long-term. - -## **Disaster recovery plan** - -The [disaster recovery plan for TETHYS](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DisasterManagement) specifies how the repository will be recovered from a data loss or system failure. The plan includes procedures for restoring data files from backups, recovering the database and restoring the access to the repository. In the recovery plan there are scenarios for recovering specific data files (if checksum test fails), whole folders for all files of a specific dataset and restoring a file in all available backup versions. In addition, the restoration of the entire IT system using Docker containers is described in detail. - -## **Database Model** - -The [TETHYS database model](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/Database) is based on a relational database model (PostgreSQL). The database model includes data constraints and validation rules that helps to ensure data integrity. This helps to prevent errors, duplication, and inconsistencies in the data. The model includes several tables that store information about different types of research data with related tables for storing metadata like licenses, authors, contributors, abstracts, titles, subjects, collections, data files, projects and users with permissions. - -## **Data Architecture Diagram** - -With the help of the [Data Architecture Diagram](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DataArchitectureDiagram) the whole repository staff has a clear understanding of the management off all storage locations. TETHYS collects and manages scanned geological maps, spatial data sets, and other types of research outputs. The Data Architecture Diagram is providing information on how to prepare data sources for deposit, like licenses, correct use of keywords, file formats and file upload limits. - -**Data storage:** The Data Architecture Diagram also describes the virtual infrastructure ([data storage](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DataArchitectureDiagram#3-storage-infrastructure)) used to store the data and metadata in the repository. By using PostgreSQL, TETHYS is able to manage large volumes of metadata and provide fast and secure access to this information. The data files are stored on an Ubuntu 22.04 file server with ext4 partition. Corresponding file checksums md5 and sha512 are also stored in the database. - -**Data Discovery:** TETHYS supports data discovery in various ways. The datasets can always be found through the Data Frontend, https://tethys.at/search, browsing by subject, author, language or year, or by searching inside the metadata attributes title, author or keywords. All visible metadata are indexed and searchable by [Solr](https://tethys.at/solr/rdr_data/select?q=*%3A*). Tethys Metadata and File downloads can be queried by a REST API (Representational State Transfer Application Programming Interface), which allows repository staff to interact with the Tethys system and retrieve metadata and data files programmatically. - -For the **metadata management** the Data Architecture Diagram provides information on the specific types of metadata that should be included with the data. This may include information on the format and structure of the metadata, as well as the types of information that should be included. Tethys RDR supports three metadata standards for the metadata export (**Dublin Core, DataCite and ISO19139**). - -**Security and Access Control:** To protect sensitive data from unauthorized access, TETHYS provides an [Access Control List](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/Database#the-acl-tables-used-by-tethys-are) (ACL) system that is used to manage users, user roles and permissions. - -**System Integration** with other systems involves integrating the research repository with other systems, such as project management tools, data platforms and reporting tools. By providing the Open Archives Initiative Protocol for Metadata Harvesting [(OAI-PMH)](https://www.tethys.at/oai?verb=Identify) any other data provider can harvest TETHYS metadata. An example would be the Bielefeld Academic Search Engine BASE: https://www.base-search.net/Search/Results?q=coll:fttethysrdr&refid=dctablede Matomo is used to track statistics data for TETHYS. Matomo is an open-source web analytics platform that can be used to track user behavior on a website. - -# C14.2. The repository's strategy for multiple copies. - -IBM Spectrum Protect (formerly known as Tivoli Storage Manager or TSM) is used to protect the data stored in Tethys Repository. Tivoli Spectrum Protect provides a comprehensive backup and recovery solution for research data, which can help ensure data availability, integrity, and recoverability in case of a disaster or data loss. For TETHYS up to 90 incremental versions of each data file will be backed up and are available for recovery. This means that there are 90 possible instances of the data. These versions are typically created based on the backup schedule and retention policies defined by the computer center of the GeoSphere Austria. - -Database backup and recovery is also described in the [disaster management plan](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DisasterManagement#create-db-backup). - -# C14.3. The risk management techniques used to inform the strategy. - -Risk management techniques for a research data repository involve identifying potential risks to the data, assessing the likelihood and impact of those risks, and implementing strategies to mitigate or manage those risks. The following risk management techniques are used for the TETHYS: - -- **Access Controls** to protect against unauthorized access: The access to the administrative backend is limited to authorized users only. To provide a secure connection, only https is allowed. Fail2ban protects the repository server from brute-force attacks, denial-of-service attacks, and other malicious activities. It works by monitoring log files and dynamically updating firewall rules to block IP addresses that are exhibiting suspicious behavior. -- **Back up data regularly**: up to 10 incremental backups of the data are maintained to ensure that it can be recovered in case of data loss or corruption. -- **Encrypt sensitive data**: Sensitive data, such as personally identifiable information and passwords are encrypted inside the PostgreSQL database. -- **Monitor activity logs**: Activity logs of the webserver are monitored via fail2ban to detect suspicious activities, such as unauthorized access attempts or data exfiltration. -- **Implement data retention policies**: Policies for the for data retention and data deletion process are implemented. -- **Conduct regular security assessments**: the security of the repository is regularly assessed by the repository staff to identify potential vulnerabilities and implement strategies to address them. - -The potential disruptive physical threats, which can occur at any time and affect the normal business process, are listed below: - -- **Fire**: Fire suppression systems are installed, there are fire and smoke detectors on all floors -- **Electric power failure**: Redundant UPS systems with standby generators are available. (Monitoring: 24/7) -- **Communication Network loss**: Unfortunately, there is no redundant repository sever in case of network loss. By monitoring the network in real-time and receiving alerts when network loss is detected, the IT department can quickly investigate and resolve issues before they impact end-users. -- **Flood**: All critical equipment is located on 2nd floor. -- **Sabotage**: Only authorized IT personal have access to the server room. - -# C14.4. Procedures for handling and monitoring deterioration of storage media. - -TETHYS RDR calculates internal checksums during the ingestion workflow. These checksums ensure that ingested data has not been altered or damaged. A checksum is a short sequence of bits or bytes that is calculated from the data using an algorithm. If even a single bit of the data changes, the checksum will also change, indicating that the data has been changed unintentionally on the file store. During the file upload, TETHYS calculates and stores **md5** and **sha512**-checksums for each file. - -# C14.5. Procedures to ensure that data and metadata are only deleted as part of an approved and documented process. - -Tethys Repository assigns a DOI for each published dataset, which means that once the data is published, it cannot be changed or deleted. In exceptional cases (violation of legal rights or subject of a justified complaint), access to the datasets in question may be blocked so that only the DOI can be cited. In this case, only the DOI can be cited, which then redirects to a landing page that explains why the dataset had to be removed. This ensures that the citability and scientific traceability of the dataset are maintained even if access to the dataset itself is restricted. See also the general guideline for publishing research data in the [manual p.14](https://data.tethys.at/docs/HandbuchTethys.pdf#page=15). - -The deletion of a record initiates the following process: - -1. Conduct an investigation: The repository may conduct an investigation to determine whether the dataset or metadata record is in fact in violation of legal rights or is the subject of a legitimate complaint. -1. Notify the submitter: If the dataset or metadata record is found to be in violation, the repository may attempt to notify the depositor of the issue and the reason for removal. -1. Remove the dataset or metadata record: If the violation is confirmed, the repository administrator removes the dataset or metadata record from its collection. -1. Document the removal: The repository administrator documents the removal of the dataset or metadata record, including the reason for removal and any communications with the submitter. -1. Review and revise policies: The repository may review and revise its policies and procedures to prevent similar violations from occurring in the future. - -![Metadata_deletion_process](./images/metadata_deletion_process.drawio.svg) - -# C14.6. Any checks (i.e. fixity checks) used to verify that a digital object has not been altered or corrupted from deposit to use. - -For internal fixity checks, TETHYS Repository operates an automated cron job that routinely tests all the md5 and sha512-checksums for data stored by the TETHYS Repository and produces a regular report providing appropriate warnings if a silent data corruption is detected in the storage layer. Corresponding code of the cron job can downloaded via [TEHYS Code repository](https://gitea.geologie.ac.at/geolba/tethys.backend/src/branch/master/commands/ValidateChecksum.ts). - -If the web backend is in production mode, the logger writes the error messages as a mail to the administrator. Depending on these warnings, the administrators will investigate the cause of the changes and the corrupted files will be restored from the backup (IBM Spectrum Protect).