Separate research databases successfully federated in what is believed to be a UK first

Researchers can securely work with multiple remote healthcare databases without any source data needing to be moved

Share:
Published: 28th September 2022

A Cambridge-based project has managed to successfully federate analysis of genomic healthcare data for the first time in the UK

Genomic medicine has the potential to provide new diagnoses and treatment solutions for patients who have complex or rare diseases, faster and more efficiently.

However, genomic data is often securely stored in different places, within what is known as a Trusted Research Environments (TRE). Moving data from one TRE to another can be costly, complicated, time-consuming or impossible due to data regulatory laws governing sensitive data. These delays are further amplified where the TREs are in separate organisations. It can also take a lot of computer processing power as a human genome involves a lot of data, with a single sequenced genome taking up around 150 gigabytes (GB) of storage.

A single whole genome sequence can be around 150GB in size. Streaming music consumes around 1GB of data every eight hours.  It would take 1,200 hours (nearly five months, listening for eight hours a day), to listen to a single genome’s worth of data.

 

A team comprising local healthcare and research organisations in Cambridge were granted funding to demonstrate how analysis can be undertaken across separate secure and remote databases simultaneously as if they were one, a process known as federation.

The funding came from UK Research & Innovation as part of Phase 1 of the DARE UK (Data and Analytics Research Environments UK) programme, which is delivered in partnership with Health Data Research UK (HDR UK) and Administrative Data Research UK (ADR UK).

This Sprint project involved the University of Cambridge, National Institute for Health and Care Research Cambridge Biomedical Research Centre (NIHR Cambridge BRC), Genomics England,  platform and technology innovation from UK enterprise Lifebit, project management support from Health Innovation East and Cambridge University Health Partners (CUHP) and had patient-public involvement throughout.

Following the eight-month project, the team has achieved what we believe to be the UK’s first demonstration of genomic data federation by bridging the TREs of the NIHR Cambridge BRC and Genomics England to enable researchers to safely access and work with both databases without moving original data, only the combined analysis results. The APIs (the Application Programming Interfaces that enable TREs to talk to each other) used were developed by the Global Alliance for Genomic & Health (GA4GH) and are open source, meaning they are available free of charge to the global research community.

Professor Serena Nik-Zainal, NIHR Research Professor and Honorary Consultant in Clinical Genetics, University of Cambridge said:

“This technology has the potential to remove the geographical, logistical, and financial barriers associated with moving exceptionally large datasets. For genomics research, the potential to undertake research across multiple datasets means access to much greater and more diverse data.  Applied at scale, this means huge potential for new discoveries, particularly for research into rare diseases and for reducing health inequalities.”

A key consideration in designing the technology was enabling approved researchers to access data whilst protecting the privacy of individuals’ data and ensuring the correct information governance structures are in place to keep data secure.

Professor​ Tim Hubbard, Professor of Bioinformatics, Head of Department of Medical & Molecular Genetics at King’s College London, Associate Director, Health Data Research UK London Site and Senior Advisor at Genomics England said:

“Increasing numbers of organisations, including NHS England, are adopting the TRE model to provide research access to health data while not allowing its distribution, thereby increasing oversight and transparency while protecting privacy. Developing methods to jointly analysis data held in separate TREs is therefore critical to maximise research insights. The output from this DARE UK Sprint project is an important practical demonstration of federated analysis of genomic data with freely available code that can be built on by the global community.”

Ensuring co-production with patients

Giving patients a voice in the project from the beginning has helped ensure the patient community are confident with their understanding of federation and data research and therefore empowered to take a lead in decision-making about how their data are used for patient or societal benefit.

“I have been delighted to be involved with this exciting project since the beginning”, Rosanna Fennessy, Patient Partner for the project, said: “This has enabled me to better understand the potential that federation brings, both in terms of opportunities for researchers using health data and for patients and the public in terms of ensuring the safety of their data. Many patients and members of the public have worked in collaboration with the team to shape this project, both to the stage it is at now and with how it plans to move forward in the future. Involving us at this early stage will undoubtedly benefit both researchers and the wider public so that we can ensure the safe and fair use of health data to maximise improved outcomes for all.”

The team has published a report, Multi-party trusted research environment federation: Establishing infrastructure for secure analysis across different clinical-genomic datasets, which evaluates the impact of the project and shares learning which can be applied to future healthcare data programmes.

This DARE UK Multi-party trusted research environment federation sprint project consortium includes University of Cambridge, NIHR Cambridge BRC, Genomics England, Health Innovation East, Cambridge University Health Partners and Lifebit.

Read the report

Dr Serena Nik-Zainal talks about the potential for research using different datasets without having to move the data

Share your idea

Do you have a great idea that could deliver meaningful change in the real world?

Get involved

Newsletter