diff --git a/C14-Storage-Integrity.md b/C14-Storage-Integrity.md new file mode 100644 index 0000000..f4df10f --- /dev/null +++ b/C14-Storage-Integrity.md @@ -0,0 +1,74 @@ +# C14.1. Processes and documents to ensure that the repository staff have a clear understanding of all storage locations and how they are managed. + +Tethys RDR (Research Data Repository) is a data archiving service that aims to provide secure, sustainable, and long-term storage of research data. This means that Tethys RDR is designed to ensure that research data is stored in a way that ensures its integrity, security, and accessibility for the long-term, even as technology and data formats change over time. In order to achieve this, various conditions must be met. The following processes and documents have been developed for the repository staff, ensuring that data is stored securely, backed up regularly, and preserved for the long-term. + +## **Disaster recovery plan** + +The [disaster recovery plan for TETHYS](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DisasterManagement) specifies how the repository will be recovered from a data loss or system failure. The plan includes procedures for restoring data files from backups, recovering the database and restoring the access to the repository. In the recovery plan there are scenarios for recovering specific data files (if checksum test fails), whole folders for all files of a specific dataset and restoring a file in all available backup versions. In addition, the restoration of the entire IT system using Docker containers is described in detail. + +## **Database Model** + +The [TETHYS database model](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/Database) is based on a relational database model (PostgreSQL). The database model includes data constraints and validation rules that helps to ensure data integrity. This helps to prevent errors, duplication, and inconsistencies in the data. The model includes several tables that store information about different types of research data with related tables for storing metadata like licenses, authors, contributors, abstracts, titles, subjects, collections, data files, projects and users with permissions. + +## **Data Architecture Diagram** + +With the help of the [Data Architecture Diagram](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DataArchitectureDiagram) the whole repository staff has a clear understanding of the management off all storage locations. TETHYS collects and manages scanned geological maps, spatial data sets, and other types of research outputs. The Data Architecture Diagram is providing information on how to prepare data sources for deposit, like licenses, correct use of keywords, file formats and file upload limits. + +**Data storage:** The Data Architecture Diagram also describes the virtual infrastructure ([data storage](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DataArchitectureDiagram#3-storage-infrastructure)) used to store the data and metadata in the repository. By using PostgreSQL, TETHYS is able to manage large volumes of metadata and provide fast and secure access to this information. The data files are stored on an Ubuntu 22.04 file server with ext4 partition. Corresponding file checksums md5 and sha512 are also stored in the database. + +**Data Discovery:** TETHYS supports data discovery in various ways. The datasets can always be found through the Data Frontend, https://tethys.at/search, browsing by subject, author, language or year, or by searching inside the metadata attributes title, author or keywords. All visible metadata are indexed and searchable by [Solr](https://tethys.at/solr/rdr_data/select?q=*%3A*). Tethys Metadata and File downloads can be queried by a REST API (Representational State Transfer Application Programming Interface), which allows repository staff to interact with the Tethys system and retrieve metadata and data files programmatically. + +For the **metadata management** the Data Architecture Diagram provides information on the specific types of metadata that should be included with the data. This may include information on the format and structure of the metadata, as well as the types of information that should be included. Tethys RDR supports three metadata standards for the metadata export (**Dublin Core, DataCite and ISO19139**). + +**Security and Access Control:** To protect sensitive data from unauthorized access, TETHYS provides an [Access Control List](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/Database#the-acl-tables-used-by-tethys-are) (ACL) system that is used to manage users, user roles and permissions. + +**System Integration** with other systems involves integrating the research repository with other systems, such as project management tools, data platforms and reporting tools. By providing the Open Archives Initiative Protocol for Metadata Harvesting [(OAI-PMH)](https://www.tethys.at/oai?verb=Identify) any other data provider can harvest TETHYS metadata. An example would be the Bielefeld Academic Search Engine BASE: https://www.base-search.net/Search/Results?q=coll:fttethysrdr&refid=dctablede Matomo is used to track statistics data for TETHYS. Matomo is an open-source web analytics platform that can be used to track user behavior on a website. + +# C14.2. The repository's strategy for multiple copies. + +IBM Spectrum Protect (formerly known as Tivoli Storage Manager or TSM) is used to protect the data stored in Tethys Repository. Tivoli Spectrum Protect provides a comprehensive backup and recovery solution for research data, which can help ensure data availability, integrity, and recoverability in case of a disaster or data loss. For TETHYS up to 90 incremental versions of each data file will be backed up and are available for recovery. This means that there are 90 possible instances of the data. These versions are typically created based on the backup schedule and retention policies defined by the computer center of the GeoSphere Austria. + +Database backup and recovery is also described in the [disaster management plan](https://gitea.geologie.ac.at/geolba/tethys.backend/wiki/DisasterManagement#create-db-backup). + +# C14.3. The risk management techniques used to inform the strategy. + +Risk management techniques for a research data repository involve identifying potential risks to the data, assessing the likelihood and impact of those risks, and implementing strategies to mitigate or manage those risks. The following risk management techniques are used for the TETHYS: + +- **Access Controls** to protect against unauthorized access: The access to the administrative backend is limited to authorized users only. To provide a secure connection, only https is allowed. Fail2ban protects the repository server from brute-force attacks, denial-of-service attacks, and other malicious activities. It works by monitoring log files and dynamically updating firewall rules to block IP addresses that are exhibiting suspicious behavior. +- **Back up data regularly**: up to 10 incremental backups of the data are maintained to ensure that it can be recovered in case of data loss or corruption. +- **Encrypt sensitive data**: Sensitive data, such as personally identifiable information and passwords are encrypted inside the PostgreSQL database. +- **Monitor activity logs**: Activity logs of the webserver are monitored via fail2ban to detect suspicious activities, such as unauthorized access attempts or data exfiltration. +- **Implement data retention policies**: Policies for the for data retention and data deletion process are implemented. +- **Conduct regular security assessments**: the security of the repository is regularly assessed by the repository staff to identify potential vulnerabilities and implement strategies to address them. + +The potential disruptive physical threats, which can occur at any time and affect the normal business process, are listed below: + +- **Fire**: Fire suppression systems are installed, there are fire and smoke detectors on all floors +- **Electric power failure**: Redundant UPS systems with standby generators are available. (Monitoring: 24/7) +- **Communication Network loss**: Unfortunately, there is no redundant repository sever in case of network loss. By monitoring the network in real-time and receiving alerts when network loss is detected, the IT department can quickly investigate and resolve issues before they impact end-users. +- **Flood**: All critical equipment is located on 2nd floor. +- **Sabotage**: Only authorized IT personal have access to the server room. + +# C14.4. Procedures for handling and monitoring deterioration of storage media. + +TETHYS RDR calculates internal checksums during the ingestion workflow. These checksums ensure that ingested data has not been altered or damaged. A checksum is a short sequence of bits or bytes that is calculated from the data using an algorithm. If even a single bit of the data changes, the checksum will also change, indicating that the data has been changed unintentionally on the file store. During the file upload, TETHYS calculates and stores **md5** and **sha512**-checksums for each file. + +# C14.5. Procedures to ensure that data and metadata are only deleted as part of an approved and documented process. + +Tethys Repository assigns a DOI for each published dataset, which means that once the data is published, it cannot be changed or deleted. In exceptional cases (violation of legal rights or subject of a justified complaint), access to the datasets in question may be blocked so that only the DOI can be cited. In this case, only the DOI can be cited, which then redirects to a landing page that explains why the dataset had to be removed. This ensures that the citability and scientific traceability of the dataset are maintained even if access to the dataset itself is restricted. See also the general guideline for publishing research data in the [manual p.14](https://data.tethys.at/docs/HandbuchTethys.pdf#page=15). + +The deletion of a record initiates the following process: + +1. Conduct an investigation: The repository may conduct an investigation to determine whether the dataset or metadata record is in fact in violation of legal rights or is the subject of a legitimate complaint. +1. Notify the submitter: If the dataset or metadata record is found to be in violation, the repository may attempt to notify the depositor of the issue and the reason for removal. +1. Remove the dataset or metadata record: If the violation is confirmed, the repository administrator removes the dataset or metadata record from its collection. +1. Document the removal: The repository administrator documents the removal of the dataset or metadata record, including the reason for removal and any communications with the submitter. +1. Review and revise policies: The repository may review and revise its policies and procedures to prevent similar violations from occurring in the future. + +![Metadata_deletion_process](./images/metadata_deletion_process.drawio.svg) + +# C14.6. Any checks (i.e. fixity checks) used to verify that a digital object has not been altered or corrupted from deposit to use. + +For internal fixity checks, TETHYS Repository operates an automated cron job that routinely tests all the md5 and sha512-checksums for data stored by the TETHYS Repository and produces a regular report providing appropriate warnings if a silent data corruption is detected in the storage layer. Corresponding code of the cron job can downloaded via [TEHYS Code repository](https://gitea.geologie.ac.at/geolba/tethys.backend/src/branch/master/commands/ValidateChecksum.ts). + +If the web backend is in production mode, the logger writes the error messages as a mail to the administrator. Depending on these warnings, the administrators will investigate the cause of the changes and the corrupted files will be restored from the backup (IBM Spectrum Protect). diff --git a/DisasterManagement.md b/DisasterManagement.md new file mode 100644 index 0000000..83e8d26 --- /dev/null +++ b/DisasterManagement.md @@ -0,0 +1,219 @@ +

+ File Restore with the command line tool +

+"IBM Spectrum Protect" is a data protection and recovery software solution that helps to manage backup and recovery operations. + +For TETHYS up to **90 incremental versions** of each data file will be backed up and are available for recovery. This means that there are 90 instances of the data that have been backed up at different points in time. These versions are typically created based on the backup schedule and retention policies defined by the computer center of the „GeoSphere Austria“. + +## Recovery Tool + +Command line restore examples for the versio 8.1.17 can be found here:\ + https://www.ibm.com/docs/en/spectrum-protect/8.1.17?topic=data-command-line-restore-examples + +How TO: + +- change into the direcotiry of the IBM backup archive client: + ```bash + cd /opt/tivoli/tsm/client/ba/bin + ``` +- start the backup client wit admin rights: + ```bash + sudo dsmc + ``` +- general help + ```bash + Protect> help restore + ``` + +## Examples + +### **1. Fixity check fails for one file:** + +**Restoring a single file from a selection of existing versions** + +```bash +Protect> restore /etc/host -pick -inactive /tmp/restored/ +``` + + + + + + + + + + + + + + +
Means:
+ Restoring the file /etc/hosts to a new target directory "/tmp/restored/" with manual selection (-pick) of the desired versions. The specification of "-inactive" is necessary to get also already deleted or changed versions of the file displayed. +
+ Subsequently, the following result is to be output. The column A/I means: +
+ A = last valid version (active) +
+ I = Inactive versions that are no longer on the productive disk +
+ You select the required version by entering the selection number at the front. + In the example, the older of two versions (2) is to be restored. +
+
+ + + + + + +
Protect>   restore  /etc/host  -pick  -inactive  /tmp/restored/
+ Restore function invoked.
+ TSM Scrollable PICK Window - Restore
+
+ #    Backup Date/Time        File Size A/I  File
+ ----------------------------------------------------------------------------------------
+ 1. | 12/05/2023   13:18:21       1.69 KB  A   /etc/hosts
+ 2. | 11/21/2023   10:16:19        726  B    I   /etc/hosts
+     |
+     0---------10--------20--------30--------40--------50--------60--------70--------80------
+ <U>=Up  <D>=Down  <T>=Top  <B>=Bottom  <R#>=Right  <L#>=Left
+ <G#>=Goto Line #  <#>=Toggle Entry  <+>=Select All  <->=Deselect All
+ <#:#+>=Select A Range <#:#->=Deselect A Range  <O>=Ok  <C>=Cancel
+
+ pick>   2
+
+ +### **2. Fixity check fails for whole metadata of dataset (all files in dataset folder):** + +**Restore a directory with the backup state of a certain date to a new target directory** + +```bash +Protect> restore /storage/app/public/files/268/ -sub=yes -inactive -pitd=10/20/21 /tmp/268/ +``` + +Restore the complete directory "storage/app/public/files/268/" including all subdirectories (parameter -sub=yes) as it was at the time "October 20, 2021" to the directory "/tmp/268" (must be created before). If no new folder is defined, the folder "/storage/app/public/files/268/" will be overwritten. There is no prompt, the restore starts immediately. + +### **3. Restore a file in different/all versions to different result files** + +First, it is **queried which versions** exist: + +```bash +Protect> q backup -inactive -subdir=yes /home/testuser/xxxx.txt +``` + +Response: + + + + + + + + + + + + + + + + + + + + + + + + + + + +
SizeBackup DateMgmt Class A/I File
447 B10/20/2021 11:04:15DEFAULTA /home/testuser/xxxx.txt
270 B10/16/2021 11:18:37DEFAULT I /home/testuser/xxxx.txt
418 B10/17/2021 11:16:20DEFAULT I /home/testuser/xxxx.txt
+ +There are 3 versions of the file "xxxx.txt", their backup date and time is displayed in each case. Thus you are able to restore the respective versions individually, if necessary also under other name. e.g.: + +```bash +Protect> restore -inactive -subdir=yes -pitd=10/17/2021 /home/testuser/xxxx.txt /restore_folder_/xxxx_vom17oct2014.txt + +Protect> restore -inactive -subdir=yes -pitd=10/16/2021 /home/testuser/xxxx.txt /restore_folder_/xxxx_vom16oct2014.txt +``` + + + + + +

Metdata (Database) Backup

+ +One of the common backup and restore configurations is an environment with two SQL Servers (SQLServer-1 and SQLServer-2), two SQL Server instances (SQLInstance-1 and SQLInstance-2), and one database named SQLDB-1 which will be backed up, transferred, and restored to SQLServer-2. +![Image](./db_backup.drawio.svg) +
+
+## Create DB backup + +To create a backup of a research database using pg_dump, you can follow these steps: + +1. Open a terminal or command prompt and navigate to the directory where you want to store the backup file. +2. Type the following command to create a backup of the database: + + ```bash + pg_dump -U -E UTF8 -F c -b -v -f .backup + ``` + + Here, replace **username** with the username of the database, **backup_file_name** with the name you want to give to the backup file, and **database_name** with the name of the research database you want to backup. + + -E is for the encoding, e.g. UTF8 + -F c specifies the format of the backup as a custom format, which is more flexible than plain SQL text format. \ + -b includes the backup of the objects such as functions and procedures. \ + -v enables verbose mode which will display the progress of the backup process. \ + + Usually the timestamp is integrated into the backup_file_name: + + ```bash + pg_dump -U -F c -b -v -f _$(date +"%Y%m%d_%H%M%S").backup + e.g.: + pg_dump -U tethys_user -E UTF8 -F c -b -v -f /home/user/backups/tethys_$(date +"%Y%m%d").backup tethys + ``` + +3. When you run the command, you will be prompted to enter the password for the database user. Enter the password and press enter. +4. The pg_dump command will start creating the backup file. Wait until it completes, which may take several minutes depending on the size of the database. + +5. Once the backup is complete, you will see a message indicating that the backup has been created successfully. + +It's important to note that the backup file may contain sensitive information, so it should be stored securely and only accessible to authorized personnel. +
+
+## Restore DB on second DB-Server + +To restore a backup of the research database using pg_restore, you can follow these steps: + +1. Open a terminal or command prompt and navigate to the directory where the backup file is located. +2. Type the following command to restore the backup to the database: + ```bash + pg_restore -U -d + ``` + Here, replace with the username of the database, with the name of the research database you want to restore the backup to, and with the name of the backup file. +3. When you run the command, you will be prompted to enter the password for the database user. Enter the password and press enter. +4. The pg_restore command will start restoring the backup file to the database. Wait until it completes, which may take several minutes depending on the size of the database. + +e.g. +``` +pg_restore -h localhost -p 5432 -U tethys_user -d tethys_db -v /tmp/tethys_20230404.backup +``` +Once the restore process is complete, the research database should be fully recovered with all its data and objects. + +
+
+

Restore API and Web Frontend from Docker Images

+All Tethys Docker Images are stored securely in our Gitea instance by using the built-in Docker Registry functionality. +... \ No newline at end of file