DAP will provide you with a central search repository
DAP will provide you with long-term data protection against damage
DAP will allow you to open old data even in the far future
DAP's logical architecture offers a selection of modules based on your requirements for the final solution. You can select modules to create a repository with a fast and customizable search feature, an LTP archive with a dedicated format database and tools, a web archive with automated web harvesting or all these modules together as one solution / product. There is also a module supporting semi-automated operations with an LTO tape library when working with large amounts of data. The modularity of DAP products is implemented even deeper and some modules are built as a selection / combination of configurable plug-ins. DAP is also a scalable product in the terms of its physical architecture (infrastructure) and you can build it as your local archive, a server farm offering data archiving and data services to others or as a confederated archive integrating the needs of several institutions in your country / corporation.
Workflow implementation, named Framework, is a core system module using a specific configuration for each process. This configuration is composed of several steps running one by one. Each step is implemented either as a standalone executable, a script or in the form of a plug-in. This approach brings the possibility of adding any 3rd party software or utility and simply integrate it into an existing workflow process. We are using this method in file format processes where format identification is executed by a DROID application and format validation is running JHOVE API. During format validation JHOVE uses a specific plug-in for each format. A list of allowed or forbidden formats is defined by your own format strategy and could be controlled globally, at the process level (ingestion, dissemination, web harvesting, upload) or even at the user profile level.
DAP logical architecture is based on OAIS ISO 14721. This standard covers the architecture and also provides guidance on how to design archival processes, data management, staff or user tasks and also provides the requirements for describing an archive as an Open Archive. DAP offers tools and user interfaces to fulfil these needs and so it is fully compliant with this standard.
DAP contains tools for Long-term preservation (LTP). The most important component of LTP functionality in DAP is the Format database. This database contains the entire history of file formats used in the system and also all their versions and relation to files (data) in the system. Database data are filled mostly by format identification and format validation processes. This module also contains support for the format conversion (semi-automated) process. The system or operator could define risk formats based on reference to the international format registry (PRONOM) and then the operator should proceed to identify its impact on the archived data.
The DAP module responsible for all data communication between archival storage and input/output channels is called ImpEx (Import-Export). It serves as an interconnection between live storage and cold (media) storage and therefore is used in the synchronization process. It is also used during ingestion, dissemination processes or even during restoration of corrupted data processes. The module currently uses physical and network disk drives, LTO tape library tools and the WebDAV method, but the functionality of this module allows its integration into any other type of media/storage.
The DAP Repository module offers advanced features for cataloguing, curating and metadata enrichment to achieve the best user experience combined with the best practices in digital repository systems. All metadata about objects are stored in the MARC21 standard and bibliographic records use the WEMI structure inspired by the FRBR model. The repository is also fully integrated with the web archive and contains evidence of all web harvest runs, the live web catalogue and metadata about collected web sites. This very specific and unique approach provides complete footage (life cycle) of a website as it changed its content, metadata or even owner during a specific time period. Having these data indexed creates a very precise and quick search engine to browse archived web content.
DAP could be used as your central repository due to its ease of integration with other systems via web services. Identifiers for external systems are stored using the MARC21 standard and linked to an internal URN:NBN identifier. A relationship between two objects is created and there is no need to make changes in both systems. Updating is done via the synchronization process. DAP uses the OAI-PMH protocol for metadata exchange but others such as Z39.50 could be easily implemented.
DAP has a web interface for curators to give and manage access to external subjects for cooperative storage and sharing of their publications. According to OAIS principles this implies an open system for a designated community and the ability to give and manage access to stored data. Collaboration between a curator and the owner of author’s rights, the publisher or author himself, is managed by evidence of contracts. A contract contains permission to store and share publications with a defined type of access: in-house or public. Every contract needs to be signed. Ownership is implemented as linkage between stored objects and the system user. This allows owners to cooperate during the archival process of their publication.
The web archive module uses the same procedure to cooperate with content owners but, in addition, it contains archival policy for web harvesting that includes respective robots.txt by default. Robots.txt gives the option for web site owners to be excluded from any automated (robot driven) web crawling. Overriding this policy for one specific site could be set but must be approved by the owner and thus included in the contract. DAP is implemented to respect authors’ rights protected by EU copyright law and it is prepared for use in legal deposition.
The DAP ImpEx (import-export) module allows data transfer from any type of storage recognized by a system as a network or local drive. ImpEx is used during ingestion into an archive, dissemination from an archive and LTP processes. The online method uses the WebDAV protocol.
DAP is fully scalable both horizontally and vertically. This means that it is scalable by adding more computer power or hardware components and also by reconfiguring its software components to accept either large amounts of data (TB) in one package or a lot of small data packages at once while keeping its data throughput rate. When physical infrastructure components are shared for both input and output DAP can be configured to specify which will be used for input and output when the system needs to operate in both states at the same time.
An LTP archive is a living archive keeping up with trends in file formats, monitoring the validity of its data and making format conversions to guarantee readable end-user content for its community. During the ingestion process, the system runs format identification to identify the version of the file format and then, depending on format identification output, it selects the appropriate validator plug-in and runs format validation. If the file is valid then it is stored in the archive. Similar steps are used during LTP checks which are ran periodically to assure the validity of archived data over several years. Data about file formats and their validators are stored in a database using a component called Format Database (FMT DB) which keeps the entire history of each file format used and also provides a GUI for the operator. It contains data about which formats are actually supported and which are risky (unsupported or being replaced). Every LTP archive should have its own format strategy and, in addition to supported and risky formats, it creates its own list of accepted formats. The FMT DB component also allows proprietary file formats and their validators to be added manually by operator. There is also an option to use own format identification to get a proprietary file format's ID during the format identification step.
DAP architecture is based on a study of other solutions worldwide and it takes their experience into account. The web archive solution uses a combination of the best open source software for web harvesting (Heritrix) and for viewing archived content (Open Wayback). Both of them are integrated into a unique solution extending their core functionality with some added functionality but keeping their source code unmodified. DAP uses its own web catalogue to create lists for web harvesting. This web catalogue is filled with automated process called Discovery. Every web harvest run is composed of running several instances of Heritrix and each instance collects data from 1 specific URL to have all data from this URL separated into standalone web archive packages (WARCs). DAP keeps metadata about each Heritrix harvest for that particular URL so there is a complete history about each harvest of that particular URL. Every URL in the catalogue has its own dedicated storage space (folder) for archiving and, when it reaches a predefined quota, the system creates a package (SIP) for the LTP archive. There is also a de-duplication mechanism used at the URL level so data stored in a web archive or the LTP archive only depend on previous harvest runs from that particular URL. This solution provides the option for data curators to manage all archived content at the URL level – they can restore data for a specific URL, they can create collections based on a list of URLs, they can even do this based on a metadata search in the archive itself or through a web catalogue. The extension for Open Wayback (OWB) software uses its native functionality of black and white lists in standalone OWB instances: 3 areas – public, in-house and curator access. These three areas are used when providing access based on contracts, license (CC, GPL, etc.) or due to respecting authors’ rights protected by EU copyright law. The system automatically sends the end-user to a specific area based on its physical location (IP address). The module (GUI) for managing this access uses the relationship between the catalogue record, archived content (harvest runs) and the agreement with the author (via contract or public license).
A Digital Archive Platform (DAP) is a modular solution developed by TEMPEST, allowing you to build a new system or integrate it with any existing one you have. You can choose modern tools for long-term preservation, digital or web archiving, cataloguing, user and access management or providing content to end users on a web portal. Data curators have their own GUI for collecting, archiving, metadata enrichment, cooperative work, managing access or even for planning automated tasks or creating reports. Administrators can configure and scale performance, define top level parameters and have tools for monitoring and auditing system components.
The Central Repository solution is suitable for customers looking for an information system for central evidence, management and search for digital objects or electronic documents. The Repository can be integrated with existing evidence to achieve a central access point and data uniformity and integrity between existing records.More info
The Archive for document preservation, or the LTP archive solution, is suited for customers looking for a living archive with data searchable and accessible over decades.More info
The Central Repository solution is suitable for customers looking for an information system for central evidence, management and search for digital objects or electronic documents. The Repository can be integrated with existing evidence to achieve a central access point and data uniformity and integrity between existing records.
A core component of this solution is a catalogue with all objects, their metadata, identifiers and references. Metadata are stored using the MARC21 standard. The repository's physical architecture is composed of a relationship database for metadata, disk storage for digital objects and an application server for web applications to work with the repository, and a portal web application for providing advanced search and access to content for end-users.
When building digital archives some systems are designed to protect data via disk arrays or to store one or more copies on storage media. But there are some institutions that care more about protecting their data for a longer period than just the lifespan of disk drives or storage media. These customers care also about the readability of content over several years. These requirements lead them to focus on file formats, their standards and support in the future. To cover all these areas within your institution it is necessary to implement Long Term Preservation (LTP) processes.
In terms of long-term preservation, DAP implements principles for physical and logical data protection. Physical data (bit) protection is realized by a combination of specific hardware and firmware components. Logical protection is realized by preserving file formats. This specific type of preservation is divided into 3 separate running and evaluated processes – format identification, format validation and format conversion. DAP has a special component supporting these 3 processes called the Format Database. The Format Database contains a full register of currently supported file formats and the version history of every used file format since the date when the LTP archive started its production. These data are important for managing format preservation processes.
DAP uses software named DROID for running format identification. This application uses PUID identifiers as reference to the international format registry named PRONOM (nationalarchives.gov.uk). Format identification is one of the first steps of the ingestion process of SIP packages into the archive.
Format validation is a process which begins right after format identification. Based on the detected PUID, the system selects a format validation plug-in and proceeds to validate the file. Output from this process is stored as a PREMIS event in a METS file describing AIP package structure.
Format conversion is implemented as a semi-automated process. The automated part of this process is detection and evidence of risky formats. A format is put into a risk state when it is no longer supported or has been replaced by a newer version or by another format. When a risky format is detected then notification is send to the operators. Then comes the manual part of the process when the operator decides if he/she is able to proceed and make a format conversion or not. Format conversion is done outside the DAP platform. Newly converted files are stored as new versions of previous files and are archived as new AIP packages with reference to the original ones.
Saving a document in the archive
Searching and opening a document
File format change identification
Conversion of a document to a new format
Searching and opening a document
Digitization belongs to key processes in document archiving or preservation. Have a look at our competences and experience in short video.