Research projects

KEEP was born from an R&D platform and it remains active in the production of scientific knowledge. Proof of this are the numerous publications and scientific events where KEEP SOLUTIONS has been present.

KEEP SOLUTIONS strategically embraces research and scientific development by promoting research and actively participating in national and international R&D projects.

We have been collaborating in research projects with national and international institutions such as the Technical University of Vienna, the Austrian Institute of Technology, the Microsoft Research, the Technical University of Berlin, the University of Manchester, the University Pierre and Marie Curie, the British Library, the Austrian National Library, the National Library of Denmark, the Portuguese National Archives, among others.

European projects

The CEF eArchiving Building Block is based on the outcomes of the E-ARK project (2014 – 2017). The project, which involved a number of European national archives, e-government agencies, digital preservation software developers and research institutions, aimed to synthesise best-practices from across Europe and develop a core set of interoperability specifications for archival operations. The project also updated a number of archival software components to meet the specifications, and carried out a number of operational pilots to verify the cross-border validity and interoperability of the specifications.

After the successful conclusion of the E-ARK project in 2017, the European Commission included its outcomes into the CEF programme to become the basis of the eArchiving Building Block. Within CEF the initial specifications and open-source software components are being enriched with training and implementation guidance services. The mid-term vision of eArchiving is that within the next five years its specifications will be implemented throughout Europe across all sectors.

Type: European Union project – CEF-TC-2019-3
Year: 2019-2021

E-ARK4ALL is supporting the development of the CEF eArchiving Building Block. The eArchiving Building Block is a solution to long-term information assurance. It provides the specifications, reference software, training and service desk support for digital archiving. This benefits both the design of repositories as digital archiving and preservation systems and the enabling of business systems to send data to those repositories.

Data drives the global economy, but to remain accessible over time, it frequently needs to be migrated between successive generations of software. That can occur because software needs to be updated, replaced or decommissioned. Such migrations always incur costs and bring risks concerning data integrity and information assurance. This is where eArchiving can help.

The foundation of eArchiving is the set of Information Package specifications. These describe platform-independent formats for storing information assets as bulk data and metadata that remains authentic and understandable over the long-term. These specifications are based on the accumulated knowledge and experience of a large international community of researchers and practitioners. They are thus ideal for long-term information assurance, supporting:

  • migrating data between generations of business information systems;
    transferring data to dedicated archival repositories;
  • managing data in repositories concerning its long-term preservation;
  • reusing data over the long-term, independently of the business systems.

eArchiving also provides a set of sample software components for several scenarios and business environments. With these components any organisation can develop its own institutional archiving and preservation ecosystem, or develop standardised workflows for delivering content from its business systems to external repositories. Using eArchiving guidelines guarantees that information will still be available and reusable for as long as it is required.

Type: European Union project – CEF-TC-2018-15 eArchiving
Year: 2018-2019

veraPDF logo

Designed to meet the needs of digital preservationists, and supported by leading members of the PDF software developer community, veraPDF is a purpose-built, open source, permissively licensed file-format validator covering all PDF/A parts and conformance levels. Learn more about what veraPDF is doing, and meet the team.

Led by the Open Preservation Foundation (OPF) and the PDF Association, and assisted by the Digital Preservation Coalition, the consortium’s mission is to develop the definitive, open-source validator for PDF/A. The veraPDF consortium has retained two subcontractors to provide and quality-control software and test files. Lead developer Dual Lab specializes in technology-intensive application development, while KEEP Solutions focuses on open source solutions for archival institutions.

veraPDF is funded by the PREFORMA project. PREFORMA – PREservation FORMAts for culture information/e-archives, is a Pre-Commercial Procurement (PCP) project co-funded by the European Commission under its FP7-ICT Programme. The project’s main aim is to address the challenge of implementing standardised file formats for preserving digital objects in the long term, giving memory institutions full control over the acceptance and management of preservation files into digital repositories.

Type: European Union project – ICT-2013.11.2
Year: 2014-2017

Eark logo

Archives provide an indispensable component of the digital ecosystem by safeguarding information and enabling access to it. Harmonisation of currently fragmented archival approaches is required to provide the economies of scale necessary for general adoption of end-to-end solutions. There is a critical need for an overarching methodology addressing business and operational issues, and technical solutions for ingest, preservation and re-use.

In co-operation w300ith commercial systems providers, E-ARK will create and pilot a pan-European methodology for electronic document archiving, synthesising existing national and international best practices, that will keep records and databases authentic and usable over time.

The methodology will be implemented in an open pilot in various national contexts, using existing, near-to-market tools, and services developed by the partners. This will allow memory institutions and their clients (public- and private-sector) to assess, in an operational context, the suitability of those state-of-the-art technologies.

Our objective is to provide a single, scalable, robust approach capable of meeting the needs of diverse organisations, public and private, large and small, and able to support complex data types. E-ARK will demonstrate the potential benefits for public administrations, public agencies, public services, citizens and business by providing simple, efficient access to the workflows for the three main activities of an archive – acquiring, preserving and enabling re-use of information.

The practices developed within the project will reduce the risk of information loss due to unsuitable approaches to keeping and archiving of records. The project will be public facing, providing a fully operational archival service, and access to information for its users. The project results will be generic and scalable in order to build an archival infrastructure across the EU and in environments where different legal systems and records management traditions apply. E-ARK will provide new types of access for business users.

E-ARK will pilot an end-to-end OAIS-compliant e-archival service covering ingest, vendor-neutral archiving, and reuse of structured and unstructured data, thus covering both databases and records, addressing the needs of data subjects, owners and users. The pilot and methodology will also focus on the essential pre-ingest phase of data export and normalisation in source systems. The pilot will integrate tools currently in use in partner organisations, and provide a framework for providers of these and similar tools ensuring compatibility and interoperability. A core component of the project is the integration platform which uses the existing ESSArch Preservation Platform (EPP) application as an Archival Information System, which is already in productive deployment at the National Archives of Norway and Sweden. In order to achieve scalability, E-ARK will adopt a data management and storage layer for this tool on top of the proven open-source Cloudera CDH4 distribution of Apache Hadoop, enabling storage and computational power to be seamlessly added to the system.

The pilot will run in several national archives, each of which will provide data to run in the pilot instance by agreement from an associated government data owner (e.g. national or regional / federal).

To sustain the outputs of our project, project partner The DLM Forum, comprising 22 national archives and associated commercial and technical providers, is well placed to ensure these. Using the open Apache licensing model, commercial suppliers will be able to incorporate the project outputs (particularly the open interfaces for pre-ingest, ingest, archival, access and re-use) into their own systems, enhancing their longevity. National archives running E-ARK pilot instances will serve as exemplars for others wanting to adopt up the new e-archiving open system.

In addition, project partner, The Digital Preservation Coalition will promote best practices in this area, as will our dedicated government institution partners.

Type: European Union project – FP7 CIP-ICT-PSP-2013-7
Year: 2014-2017

The Collaboration to Clarify the Costs of Curation (4C) project will help organisations across Europe to more effectively invest in digital curation and preservation. Making an investment inevitably involves a cost and existing research on cost modelling provides the starting point for the 4C work. But the point of an investment is to realise a benefit, so work on cost must also focus on benefit, which must then encompass related concepts such as ‘risk’, ‘value’, ‘quality’ and ‘sustainability’. Organisations that understand this will be more able to effectively control and manage their digital assets over time, but they may also be able to create new cost-effective solutions and services for others.

Existing research into cost modelling is far from complete and there has been little uptake of the tools and methods that have been developed and very little integration into other digital curation processes. The main objective of the 4C project is, therefore, to ensure that where existing work is relevant, that stakeholders realise and understand how to employ those resources. But the additional aim of the work is to closely examine how they might be made more fit-for-purpose, relevant and useable by a wide range of organisations operating at different scales in both the public and the private sector.

These objectives will be achieved by a coordinated programme of outreach and engagement that will identify existing and emerging research and analyse user requirements. This will inform an assessment of where there are gaps in the current provision of tools, frameworks and models. The project will support stakeholders to better understand and articulate their requirements and will clarify some of the complexity of the relationships between cost and other factors. The outputs of this project will include various stakeholder engagement and dissemination events (focus groups, workshops, a conference), a series of reports, the creation of models and specifications, and the establishment of an international Curation Costs Exchange framework. All of this activity will enable the definition of a research and development agenda and a business engagement strategy which will be delivered to300 the European Commission in the form of a roadmap.

The consortium undertaking this project includes organisations with extensive domain expertise and experience with curation cost modelling issues. It includes national libraries and archives, specialist preservation and curation membership organisations, service providers, research departments and SME’s. It will be coordinated by a national funding organisation that specialises in supporting the innovative use of ICT methods and technologies.

Type: European Union project – FP7 ICT-2011.4.3
Year: 2013-2015

The SCAPE project will develop scalable services for planning and execution of institutional preservation strategies on an open source platform that orchestrates semi-automated workflows for large-scale, heterogeneous collections of complex digital objects. SCAPE will enhance the state of the art of digital preservation in three ways: by developing infrastructure and tools for scalable preservation actions; by providing a framework for automated, quality-assured preservation workflows and by integrating these components with a policy-based preservation planning and watch system. These concrete project results will be validated within three large-scale Testbeds from diverse application areas.

SCAPE approaches digital preservation through research and development sub-projects: Testbeds, Preservation Components, Platform, and Planning and Watch.

The SCAPE Testbeds are the primary driver for the rest of the project, in that they define use case scenarios, create preservation workflows, and assess the large scale applicability of the SCAPE Preservation Platform and the preservation components developed within the project. Using these software components, test environments are created for the different scenarios and the complex large scale preservation workflows.

SCAPE Preservation Components address known limitations of digital preservation systems on three levels: scalability, functional coverage

quality. This sub-project improves and extends existing tools, develop new ones where necessary, and apply proven approaches to the problem of ensuring quality in digital preservation.

Building on the state of the art and focusing on formats and tools that are considered most important by the Testbeds sub-project, SCAPE investigates methods to parallelise and embed components in robust and scalable workflows. A major focus is the ability to capture relevant provenance and contextual information and metadata, and to provide usable outputs for automated policy-driven preservation.

The SCAPE Platform will provide an extensible infrastructure for the execution of digital preservation processes on large volumes of data. It will include a flexible mechanism for the integration of existing digital repository systems and provide a reference implementation. The Preservation Platform will also provide the underlying environment for large-scale testing and evaluation performed by the Testbeds and the Preservation Component providers in the project. The computational layer of the Preservation Platform system will make use of Hadoop, with the underlying distributed storage layer being based on HBase, which provides high performance and scalable data storage on top of Hadoop’s Distributed File System (HDFS).

The Planning and Watch Components developed in SCAPE address the bottleneck of decision processes and processing information required for decision making. Work on these components started with a conceptual analysis, based on extensive real-world application experience. A set of essential policy elements is being defined and modelled. These elements will make use of the SCAPE Policy Catalogue. Building on SCAPE’s machine-understandable policy representation and the first release of the automated planning component, core watch services will be implemented. In the final phase the policy-aware planning component will be fully integrated with the platform and repository operations.300

The Cross-project Activities in SCAPE include project management and coordination as well as the investigation of Open Research Challenges and a Research Roadmap. These activities provide administrative control and technical coordination for the project as well as focused research on innovative and emerging technologies having the potential to improve SCAPE’s capabilities.

The project’s Take-up Activities aim to provide both coordination for communication and dissemination of project results within and beyond the project. A number of training activities, which will also incorporate Best Practice guidelines, are aimed at fostering the take-up of project outputs at technical, operational and strategic levels. Furthermore, they will ensure that SCAPE has a long-term and sustained impact beyond the runtime of the project.

Type: European Union project – FP7 ICT-2009.4.1
Year: 2011-2014

PhD projects

Automated Watch for Digital Preservation

The current exponential growth of the digital created documents is an obvious effect of the global tendency towards the digital technology. Replacing paper with digital documents has become a common activity in all kinds of institutions and many already completely eradicated the use of paper. Even European policies, as the eGovernment, urge for the public administration to cease the use of paper, and provide all services and documentation in digital form.

But documents in digital form are much more perishable than their paper counterparts and it is not obvious for the normal user that keeping a digital document accessible for several decades is a very difficult task. Furthermore, some aspects that a normal user will consider maintained when keeping the physical form of the paper do not behave the same way when the information is in digital form. Authenticity is one of these aspects, and it is crucial as the information as no value to be kept if the power to serve as evidence is lost. The digital preservation field tries to tackle all these problems and is currently one of the main concerns of the European research efforts, like the Seventh Framework Program (FP7) .

The main difficulty of digital preservation resides on the ever-changing technological environment to which the documents must maintain compatibility. Part of the solution must pass by the detection of these changes by continuously monitoring the environment, the users and the documents to detect preservation risks. This PhD project focuses on creating automatic and systematic ways to monitor the environment and provide a valuable input for risk detection and assessment.

Author: Luís Faria
Year: 2011-2017

Long-term preservation of digital information in the context of a historical archive

During the second half of the 20th century, mankind has passively witnessed the worldwide proliferation of digital technologies. These technologies are currently present in every aspect of today’s civilized life and natively support a great deal of human activities. Distinct actions such as telling the time or planning a mission to Mars are now entirely supported by digital technologies. This growth has been accompanied by an overwhelming expansion of digital information.

Digital information has a lot of advantages over traditional analogue information. However, it carries a structural problem that may hinder its accessibility in the long run. Digital information requires the presence of a technological environment (hardware and/or software) in order to be adequately rendered for human consumption. This technological dependency makes it vulnerable to the rapid evolution of digital technologies as well as technological ruptures caused by non-retrocompatible developments.

To insure the continuous access to digital information, several strategies have been proposed: emulation, format migration, encapsulation, etc. However, there is still a great deal of work to be done in what concerns making these processes more automatic and user-friendly. Moreover, issues regarding the authenticity of digital materials have always been a concern for information science professionals.

This thesis aims at solving the previously outlined issues, focusing especially on the automation of migration-based preservation strategies. In order to accomplish this goal, we have developed a Service Oriented Architecture (SOA) specially designed to assist cultural heritage institutions in the implementation of preservation interventions. The proposed SOA delivers a recommendation service and a method to carry out complex format migrations. The recommendation service is supported by three evaluation components that assess the quality of every migration intervention in terms of its performance, suitability of involved formats and data loss. The proposed system is also able to produce preservation metadata that can be used by any client institution to document preservation interventions and retain objects’ authenticity.

The system has been evaluated in what concerns its ability to produce suggestions of migration services that maximize the preservation requirements of any given client institution. The evaluation process also focused the system’s ability to determine the level of degradation imposed to a digital object during a migration process, especially in what concerns its subjective significant properties, i.e., pixel correctness and embedded metadata.
The system was evaluated using datasets of raster images encoded in several formats. The results of this research show that the proposed system is capable of effectively calculating the similarity between digital images, revealing a correlation value superior to 0.81 between automatic similarity algorithms and the mean opinions scores provided by human evaluators. In what concerns the system’s ability to determine the level of degradation occurred in the image metadata, the system showed correction values above 0.96 while using a modified version of the Jaccard similarity metric.

The recommendation system showed a level of correlation of 0.68 to 0.85 (with a maximum precision of 34.9%) when suggestions based on previously executed migrations were compared with the ideal rankings of migration services calculated specifically for a given object.

The main contributions of this research are: the ability to preserve digital information using a format migration strategy without having to deploy complex migration systems; the ability to obtain detailed migration reports that document the entire preservation intervention which can be used as preservation metadata to ensure information authenticity; and the possibility of comparing and assessing different migration options and objectively choose the one that maximises the satisfaction of a client institution.

Author: Miguel Ferreira
Year: 2005-2008