Has someone stolen your genome?

If you've sequenced your DNA, be careful what you do with it...
05 March 2020

Interview with 

Michael Edge, University of California Davis


DNA databases containing the sequence data of millions of individuals


A common trend at the moment is to get your DNA sequenced; often you learn how much Neanderthal is in you, that you’ve got Irish, Welsh and Cossack ancestry, and some of these resources also let you know about things like disease risk genes. Some people go further and upload their genome to a range of online platforms that use the sequence data to do things like reunite you with your long lost relatives. But, as UC Davis researcher Michael Edge explains to Chris Smith, many of these platforms are vulnerable to some simple exploits, so there’s a very danger that you could end up with your genome getting hacked and stolen…

Michael - What some of these websites do is they allow people to upload their own genetic data, so people will get their genetic data from someplace like '23 and Me' or ancestry.com and then they're allowed to upload that information to one of these other websites like GEDmatch, family tree DNA, MyHeritage. And one reason those have been really interesting is that law enforcement police have wanted to upload genetic information they have from a crime scene and use that information to find people who they're interested in. So one very high profile case was there was this serial killer in California called "The Golden State Killer", and about two years ago he was caught using this kind of method. So somebody uploaded genetic information to GEDmatch, was able to find some second and third cousins and then ultimately track him down. And what we were interested in is if somebody uploaded the right kind of information to these datasets, could they actually compromise the privacy of a lot of different people who are in those datasets? So could you actually figure out genetic information of the people who are in the databases by just uploading particularly well-chosen datasets?

Chris - Well, give me an example then of a way in which something could happen that would concern you...

Michael - Well, one reason could be insurance. In the United States, people are protected in terms of their health insurance by an act called GINA. But that doesn't apply to all kinds of insurance. So you could imagine people being interested in this information and maybe using it to discriminate against certain people. Or you could imagine somebody maybe stealing part of somebody's genome, learning part of their genome, and then using that to construct a fake relative and then gain their trust. So maybe if I know part of your genome, I can build a genome that looks like it's your long lost second cousin or something, and then try to connect with you on one of these services. And maybe that gives me an "in" for some kind of phishing scam or something like that.

Chris - So what question did you ask of the system? How did you approach this in the first place?

Michael - We kind of had two broad ideas for how somebody could attack this dataset. One idea was maybe they can just take advantage of the fact that all of our genomes are like a mosaic built from pieces of our ancestors' genomes. And what that means is that we all share little pieces of genome with each other. If I upload my genome, and I'm shown all the places where I match somebody in the dataset, then I know something about all those people's genomes where I match them. And I'll match my close relatives and, depending on how short of a piece of genome the database is willing to show me, I might match people who I'm not even particularly closely related to. So that's one kind of attack: it's just by looking at sort of tiles of my genome that match other people's genomes and in particular not just my genome, but many people's genomes that I could download from the internet, and looking at where all of them match with other people, can I figure out other people's genetic information? And then the other broad method we thought about was attacking the algorithm that these websites use - in particular GEDmatch uses - to identify matching segments. And we found that there's a way you could trick the algorithm into basically showing you somebody's genotype at a site that you care about.

Chris - Have you put these vulnerabilities to the owners of these sites?

Michael - Yes, we did. So three months before we published anything, we went and talked to representatives of all these different companies, told them basically what we were planning to do, and gave them a sense of what our results were going to be roughly.

Chris - And what did they say?

Michael - Many of them assured us that they didn't think that these problems would really affect their website in one way or another in different cases - that was either convincing or not. We know a bit about what GEDmatch has done, but in the other cases we don't know whether they've done much.

Chris - Given that you've put this out in the public domain, one would hope that they would tighten up, but is there a way of tightening up without destroying the raison d'etre behind the whole initiative?

Michael - Yeah, that's a good question. And we think there are lots of ways. So we suggested a bunch in our paper and two pretty easy ones for them to do would be first to only show either long matching segments, or just not show the location of segments at all. So some companies follow this model and that protects them from one of the major types of attack that we propose. And the other method is using more up to date methods for finding segments that match between two people. There's an older class of methods that's vulnerable to a type of hacking that we talk about. Most of the companies have assured us that they use a newer method that wouldn't be vulnerable in the same way, but GEDmatch we think is still using the older method. And then a third thing that would be very effective but is a bit harder - because it requires cooperation - is these sites that allow uploads could actually start only accepting uploads that are tagged with a digital signature that ensures that they're from a trusted source. So that would mean that I, as just an independent person, couldn't go and make a text file that's formatted the way 23 and me does it, and then upload it to one of these places. That I think is a big hole that all of them have now.

Chris - Do you think this really is a big risk though? Because if someone was that hell bent on discovering, say my DNA, they really wouldn't struggle to track me down, shake my hand a few times, make friends with me at work, invent some story to gain a sample from me, which they could do without me even realizing they'd done it. So is there much to be gained really from going to all this trouble of hacking one of these sites?

Michael - Yeah, so I think if you have a specific target you're interested in, you're right - and actually the laws on that in the United States are pretty loose in terms of being able to follow somebody around and use found DNA. The difference with these kinds of methods that we're talking about in this paper is you'd get all this information at the same time. So rather than having to track one individual down or a set of individuals down, you'd get some information from potentially everybody in one of these databases. GEDmatch, for example, has over a million people in it.

Chris - What would you say then is the bottom line here? What's the advice you give to people then - just don't upload your DNA?

Michael - I wouldn't say that. I mean, people come to this with a lot of different concerns and a lot of different interests, so I wouldn't say don't do it, but I would say to be very careful when you do it and to realize what you're sharing. This is information that's not just one's own information. It's about one's relatives, even very distant relatives, even unborn relatives. And once it's out there, you can't retract it in the way that, you know, if my credit card information gets stolen, if I have to, it's a pain, but I can change it. And that's not true with our DNA, right!


Add a comment