Data sanitisation tool plugs privacy gap

Functional genomics data is vulnerable to de-anonymising attacks...
17 December 2020

Interview with 

Mark Gerstein, Yale University


A lock symbol in the middle of a translucent dial.


A big issue in functional genomics studies - looking at gene functions and interactions - is privacy, because these often require many thousands of genomes to make a really statistically strong case. The best way to do this is collaboration, but there are valid privacy concerns around sharing your participants’ genetic information. A new paper has validated those concerns; researchers combined genetic databases with DNA samples taken off of used coffee cups to prove that even innocuous-seeming data can identify someone if you combine it with a second source of information. They’ve come up with a solution to this, though. Phil Sansom heard from author Mark Gerstein...

Mark - We've created a method for sanitising functional genomics data so it can be shared without impinging on patient privacy.

Phil - What situation are we talking about here? Is this if I've sent my DNA off to 23andMe, or is it something else?

Mark - This is different from 23andMe. This is a subset of that type of genomic information called functional genomics information, that has to do with the activity of genes. Now when someone does a functional genomics experiment, the useful bits aren't the variants associated with you, usually; they're levels of activity of genes, the degree to which various things bind on the chromosome.

Phil - I'm surprised this kind of data wasn't secure already.

Mark - The leakage of information can be very subtle. And one of the ways of ascertaining this is to do a very subtle attack on the data called a linking attack, where you look at their ability to link to another data source of information. And so we've used these linking attacks to measure the very small amount of leakage from the experiments, and to calibrate our cleaning methods to be sure we really scrubbed all the variants away.

Phil - Okay, say I'm a genetics researcher then. How would I take your method and use it to sanitise, as you put it, my data? So that there's a balance between... I've not got rid of the crucial stuff, but I've got rid of enough so it doesn't identify someone or link to someone's identifying data.

Mark - This is less targeted at the individual researcher and more targeted at data repositories, where people would upload the data to be shared with the community. But an individual researcher could do this. It's a piece of software that you would run on the reads - these little bits of sequence that you get from the sequencing machine, and these things - and what it will basically do is go through the reads and find where you have variants in the reads, and remove a small number of those variants, a very measurable number, so as to disable any sort of linking attack or privacy breach from those variants. And then once you do that, you can essentially share the reads from the experiment publicly and without worrying about privacy.

Phil - Of course the other thing is that, if someone gets ahold of information like my street address, they can come find me or rob me or whatever. If someone gets a hold of my genome, what have they got? Well, if I haven't got a condition that I'm worried about... kind of bupkes, right?

Mark - We might not know what to do with genomic information now, but 50 years in the future from now we might know a lot more. And so there you have a potential avenue for people unearthing a lot of private information about one's offspring and future generations. Another scenario is that even with the information today, potentially people can find subtle statistical hints that you might have a predilection to one condition or another, which... a political candidate for office or someone in the public space, imagine if their genome got out and people could easily mine it, they might say, "oh geez, this person has a predilection to alcoholism". Or, "the person has a predilection to this," and whether it's exactly true or not is not really that important. The fact is it still can be very damaging information. There's a final scenario too, which is even more farfetched, but I'll just say it because it gives you a sense of the bizarre things that people think about in this. If someone got a bit of your genomic information, they could go to a crime scene and put a little bit down, right? And then they could implicate you in a crime. And again, this isn't something that the normal person thinks about, but it's something you could imagine, say, a state actor might be thinking about in relation to some foreign adversary or something.


Add a comment