Cleaning gigabytes of data

How you check through billions of data points for mistakes?
14 October 2019

Interview with 

Clare Bycroft, Genomics PLC


A magnifying glass in front of binary code.


The gene sequencer's camera isn’t struggling to capture at every base at once - it’s actually that the DNA bases are added one at a time. It requires some complex chemistry. And - lest we forget - there are one and a half quadrillion bases to be sequenced. When Phil Sansom spoke to Mark Effingham from the UK Biobank, he explained how tricky it is just to deal with all that information...

Mark - Historically UK Biobank has made available its data for download by approved researchers. But these datasets are now going to be so large that you simply cannot do that; around about 20 gigabytes per genome sequence, that's probably the equivalent of four high definition movies that you might download from iTunes or similar. So if you then times that by 500,000, that is a substantial amount of data.

With the sheer scale of this project comes extra, unexpected jobs - jobs like what Clare Bycroft used to do. She’s a scientist who once worked on the genotype data from the UK Biobank. Before they started the whole genome sequencing, they analysed the DNA on more limited scale. Although, as she tells Phil Sansom, more limited doesn't mean small...

Clare - I was part of a team that was to take the raw genotyping data that came out of laboratory machines, and our job was to turn this wealth of data into a carefully curated resource. We were a bit like curators in a museum or an art gallery, but for data.

Phil - What do you mean by that? Because isn't the data just the data, and then you just have to put it on a computer or something?

Clare - Yes, perhaps naively you might think that, but it turns out that in such a large experiment like this it's inevitable that occasionally some errors will occur; in the processing, perhaps how the samples were handled, or something that happened in the computers. And our job was to find these errors, and either remove them from the data or communicate these errors to the people who were going to be using it. And if we count how many of those points we actually ended up removing it's about... less than 1%.

Phil - What exactly does a mistake look like?

Clare - All sorts of things. So sometimes the machine or the algorithm fails to actually get enough information to make a call, to be able to decide whether it's a G or a C or a T or an A. But in other cases the letter is actually probably the wrong one.

Phil - How do you tell when a letter’s the wrong one though? Because surely it could just be that letter?

Clare - Yeah. So usually what we do is we think about what kind of biology that we are trying to capture here, and other things that we know about how the genome is inherited. And if we happen to see in the dataset that half of the people carry two Gs and half of the people carry two As, and very few carry an A and a G, then that's an indication that probably some of those are incorrect.

Phil - How many little bits of data were you looking at again?

Clare - There was around 800,000 positions in the genome that were looked at for all of the individuals in the UK Biobank. So that's 800,000 by 500,000.

Phil - My maths isn't that good but that must be in the billions, right?

Clare - Something like that.

Phil - So you can't exactly go in by hand, so you had algorithms to go through and...

Clare - Absolutely right. We developed metrics and statistical-based approaches to try and clean up the data.

Phil - And do you think that you got it basically, that you got a clean set? Or there's still like one in a million, there's an A instead of a C?

Clare - Yeah, that's a really, really important question. And I think useful for people to get their heads around is that: how do you create a resource that was going to be used by hundreds of different people, with many, many different types of questions that they want to ask of this data, and how do you make that clean for everybody? We thought about the major ways that people were likely to use this dataset - one example is learning about the associations between genotypes and common disease - and think about the kinds of things that would be useful for those researchers as well as others. And we also took the approach that sometimes people might be really interested in the unusual aspects of this kind of data, and we didn't want to remove everything that we thought had some minute possibility of being incorrect, so that people could make their own decisions.

Phil - Right, because every person is going to have some rare thing....

Clare - You mean the people in the UK Biobank? Or the researchers?

Phil - The UK Biobank.

Clare - Exactly. In a dataset of this size we expect to see quite a lot of unusual things, in particular with the human genome. Sometimes you can compare sex determined from looking at the genetic data to the sex reported by the individual, and sometimes that is not the same. And historically people have often used this in smaller samples as a way of identifying potential mishandling in the laboratory. But it turns out that in the UK Biobank, it's large enough to be able to see really interesting instances where some individuals have, say, two X chromosomes and one Y chromosome; and it turns out standard ways of looking at this data and inferring sex from this are just not appropriate.

Phil - Now that they're doing whole genome sequencing are they going to have to do this whole thing all over again?

Clare - In some senses yes. But I think what's really helpful with the experiments we've already done is that we've now learned a lot about the properties of the data, and aspects of the genetics, of the individuals in the UK Biobank. So for example one of the things that we had to find out when we were looking at the genotype data is who in this dataset are related to each other. So who in the UK Biobank has a mother, or a child, or a cousin, or a sibling. And so I suspect that they will use that kind of information to help them figure out the needle in the haystack problem of where there are errors in the data.

Phil - Roughly how many people do you think in the course of analysing the data, based on your experience, are going to be bashing their heads into walls?

Clare - I think there will be moments of frustration. There always is in science. But I think it will be overall hugely exciting for the people who are involved.


Add a comment