Quadrillions: Sequencing the UK Biobank
Half a million genomes. That's how many the UK Biobank has, stored as blood samples in freezers up in Manchester. And in September 2019 they announced a project to sequence every single one of them. It's the obvious next step for the UK Biobank, the research study that began in 2006 and now consists of an enormous biological database: the personal and medical information of its 500,000 volunteers. That data is available to any researcher who applies to use it. But how is this, the biggest whole-genome sequencing project ever, going to work? Who's coughing up the hundreds of millions of pounds it's going to take? And what exactly is the point of generating so much data? This month on the show, genetics reaches a new scale. We're talking in the quadrillions.
In this episode
00:00 - What's the UK Biobank?
What's the UK Biobank?
Mark Effingham, UK Biobank
When it comes to genetics, sometimes it seems like data is king. Especially, as Phil Sansom found out, when you’re talking to someone like Mark Effingham...
Mark - I'm Mark Effingham, I'm the Chief Operating Officer for UK Biobank.
Phil - The UK Biobank, if you haven’t heard of it, is this enormous scientific project that’s sort of part research study, part charity, part big Excel spreadsheet. Around a decade ago they managed to get half a million volunteers between the ages of 40-69 to come in, give a bit of blood and urine, and answer just a load of questions about themselves. The idea is that any scientist can cross-reference and study that data however they want. And now the Biobank is adding a pretty significant column to that spreadsheet: whole-genome sequencing.
Mark - I guess this is starting to understand the human genome. That genome effectively contains three billion letters. What we’re trying to do is look at that genetic code across half a million people.
Phil - Hang on a moment. Three billion letters… half a million people… I’m gonna need a calculator… [typing]... OK. That’s a one, and a five, and then fourteen zeroes. One and a half quadrillion little bits of DNA. I’m no expert... but I think that’s quite a lot? After talking to Mark I had a few questions. So I went to find some answers.
Mark - It really is exciting.
Clare - I think there will be moments of frustration - there always is…
Serena - You can have a complete world map...
Hannah - It just takes so long, man!
Phil - Today on the show, we're talking in the quadrillions. I'm Phil Sansom, and this is Naked Genetics.
01:49 - The curious chemistry of gene sequencing
The curious chemistry of gene sequencing
Vince Smith, Illumina
Each bit of DNA that leaves the UK Biobank over the next two years will head to either Iceland, to the company deCODE, or to the Wellcome Sanger Institute in Cambridge. But both places will sequence the genes of their samples in the same way - using tech from sequencing company Illumina. If you've listened to the last couple episodes, you heard how their machines can do a whole genome in forty hours. But we didn’t cover a crucial part: the chemistry, how the machine can look at only one letter at a time. Phil Sansom went to visit Vince Smith, who leads their chemical team...
Vince - We sequence the DNA inside something called a flow cell, which I have in my hand here. Just a couple of pieces of glass glued together with a layer in between them, and there's a chamber between those two pieces of glass.
Phil - It’s probably a lot more complicated than it looks, right?
Vince - Yeah, yeah, superficially it looks like a relatively small piece of glass. It's got a slightly rainbow colour, there's light refracting through it.
Phil - Yeah, it does.
Vince - And that's because of these very tiny patches that we load the DNA molecules into, billions of very tiny cylindrical wells - they're just a few hundred nanometres across.
Phil - Where does the DNA go on that?
Vince - Inside a DNA sample there'll be billions of fragments of DNA that have come from a patient, or from someone in the Biobank project. What we've done is attached short pieces of chemically-synthesised DNA to each end of those DNA fragments.
Phil - And it’s always the same two sequences?
Vince - It's always the same two sequences. We call them adaptor DNAs. They will then bind very specifically to a complementary piece of DNA on the surface of the flow cell, and that's the mechanism by which we attach DNA to the surface so that it's ready to sequence. You remember that DNA base pairs bind to each other, and we're making use of that property to attach the DNA very specifically to the flow cell surface.
Phil - And then what?
Vince - Then what we need to do is that we need to make copies of that one piece of DNA in each well. We end up with about a thousand pieces of DNA in each well. And having a thousand pieces of DNA in that well means that you get a lot more signal.
Phil - How do you get them to multiply?
Vince - So we use a special trick, it's a process called bridge amplification. The first piece of DNA does something called bridging, it bridges over, bending over to form this bridge on the surface, and binds to another complementary strand of DNA on the surface of the flow cell. We then make another copy. That process repeats, and these molecules of DNA continue to walk around the surface with this bridging process. At each step when they do that we make a copy of them, and very quickly you end up with, you know, it's an exponential process, with a thousand or so copies of that one DNA molecule, all with the same sequence.
Phil - So to clarify, what’s going on is that at high temperatures in the machine, DNA - normally a paired double-strand - separates into two single strands. When each strand then bends over to form a bridge on the flow cell, they apply an enzyme that uses the bridge shape to create that strand’s complementary pair. And the pair then stands back up, and because it’s still hot, they separate again. Both single strands bend over, and you keep going. The enzymes that they use are really cool because they need to work in high temperatures, so Illumina often gets them from little bacteria that live in the deep ocean around thermal vents. Using them, this process makes thousands of DNA copies in each of the billions of wells on the flow cell.
Vince - So the next step is that we have these little chemical building blocks of DNA that we've modified in the lab. And what we've done to them is two things. First of all we've added a fluorescent dye to them. The other thing we've done is added something called a terminator, or actually a reversible terminator.
Phil - That's not Arnold Schwarzenegger but he can do a three point turn?
Vince - Yeah, nothing to do with Arnie. It's a chemical modification to the natural building block of DNA that stops you adding more than one base at a time to a growing DNA strand. If we didn't have it there what would happen is the enzyme would just make a copy of the strand in its entirety in a single step. You wouldn't be able to see the DNA letter by letter.
Phil - What’s the enzyme?
Vince - So the enzyme is something called a DNA polymerase. In nature, you add a DNA polymerase to copy DNA and it will very rapidly just zip along the DNA strand and make a copy. In our system we at one building block of DNA at a time. We then image the whole flow cell to see which colour of base is lighting up. Then what we can do using another chemical process is remove the dye, the fluorescent dye, and also remove this terminator. And the process repeats…
Phil - Rinse, repeat.
Vince - Exactly.
06:53 - Cleaning gigabytes of data
Cleaning gigabytes of data
Clare Bycroft, Genomics PLC
The gene sequencer's camera isn’t struggling to capture at every base at once - it’s actually that the DNA bases are added one at a time. It requires some complex chemistry. And - lest we forget - there are one and a half quadrillion bases to be sequenced. When Phil Sansom spoke to Mark Effingham from the UK Biobank, he explained how tricky it is just to deal with all that information...
Mark - Historically UK Biobank has made available its data for download by approved researchers. But these datasets are now going to be so large that you simply cannot do that; around about 20 gigabytes per genome sequence, that's probably the equivalent of four high definition movies that you might download from iTunes or similar. So if you then times that by 500,000, that is a substantial amount of data.
With the sheer scale of this project comes extra, unexpected jobs - jobs like what Clare Bycroft used to do. She’s a scientist who once worked on the genotype data from the UK Biobank. Before they started the whole genome sequencing, they analysed the DNA on more limited scale. Although, as she tells Phil Sansom, more limited doesn't mean small...
Clare - I was part of a team that was to take the raw genotyping data that came out of laboratory machines, and our job was to turn this wealth of data into a carefully curated resource. We were a bit like curators in a museum or an art gallery, but for data.
Phil - What do you mean by that? Because isn't the data just the data, and then you just have to put it on a computer or something?
Clare - Yes, perhaps naively you might think that, but it turns out that in such a large experiment like this it's inevitable that occasionally some errors will occur; in the processing, perhaps how the samples were handled, or something that happened in the computers. And our job was to find these errors, and either remove them from the data or communicate these errors to the people who were going to be using it. And if we count how many of those points we actually ended up removing it's about... less than 1%.
Phil - What exactly does a mistake look like?
Clare - All sorts of things. So sometimes the machine or the algorithm fails to actually get enough information to make a call, to be able to decide whether it's a G or a C or a T or an A. But in other cases the letter is actually probably the wrong one.
Phil - How do you tell when a letter’s the wrong one though? Because surely it could just be that letter?
Clare - Yeah. So usually what we do is we think about what kind of biology that we are trying to capture here, and other things that we know about how the genome is inherited. And if we happen to see in the dataset that half of the people carry two Gs and half of the people carry two As, and very few carry an A and a G, then that's an indication that probably some of those are incorrect.
Phil - How many little bits of data were you looking at again?
Clare - There was around 800,000 positions in the genome that were looked at for all of the individuals in the UK Biobank. So that's 800,000 by 500,000.
Phil - My maths isn't that good but that must be in the billions, right?
Clare - Something like that.
Phil - So you can't exactly go in by hand, so you had algorithms to go through and...
Clare - Absolutely right. We developed metrics and statistical-based approaches to try and clean up the data.
Phil - And do you think that you got it basically, that you got a clean set? Or there's still like one in a million, there's an A instead of a C?
Clare - Yeah, that's a really, really important question. And I think useful for people to get their heads around is that: how do you create a resource that was going to be used by hundreds of different people, with many, many different types of questions that they want to ask of this data, and how do you make that clean for everybody? We thought about the major ways that people were likely to use this dataset - one example is learning about the associations between genotypes and common disease - and think about the kinds of things that would be useful for those researchers as well as others. And we also took the approach that sometimes people might be really interested in the unusual aspects of this kind of data, and we didn't want to remove everything that we thought had some minute possibility of being incorrect, so that people could make their own decisions.
Phil - Right, because every person is going to have some rare thing....
Clare - You mean the people in the UK Biobank? Or the researchers?
Phil - The UK Biobank.
Clare - Exactly. In a dataset of this size we expect to see quite a lot of unusual things, in particular with the human genome. Sometimes you can compare sex determined from looking at the genetic data to the sex reported by the individual, and sometimes that is not the same. And historically people have often used this in smaller samples as a way of identifying potential mishandling in the laboratory. But it turns out that in the UK Biobank, it's large enough to be able to see really interesting instances where some individuals have, say, two X chromosomes and one Y chromosome; and it turns out standard ways of looking at this data and inferring sex from this are just not appropriate.
Phil - Now that they're doing whole genome sequencing are they going to have to do this whole thing all over again?
Clare - In some senses yes. But I think what's really helpful with the experiments we've already done is that we've now learned a lot about the properties of the data, and aspects of the genetics, of the individuals in the UK Biobank. So for example one of the things that we had to find out when we were looking at the genotype data is who in this dataset are related to each other. So who in the UK Biobank has a mother, or a child, or a cousin, or a sibling. And so I suspect that they will use that kind of information to help them figure out the needle in the haystack problem of where there are errors in the data.
Phil - Roughly how many people do you think in the course of analysing the data, based on your experience, are going to be bashing their heads into walls?
Clare - I think there will be moments of frustration. There always is in science. But I think it will be overall hugely exciting for the people who are involved.
14:51 - The whole genome of breast cancer
The whole genome of breast cancer
Serena Nik-Zainal, University of Cambridge
The UK Biobank is a massive database of half a million people’s family histories, diets, and medical data… you name it. And starting now, they’re sequencing the whole genome - every single bit of DNA - for every single person. A big job, with a big price tag - and half of that investment is coming from big pharmaceutical companies. £25 million each. Why? The UK Biobank’s Mark Effingham explains...
Mark - This wealth of genetic data will enable those pharma companies to develop new treatments, and they will get a short period of exclusive access of nine months before those data become available to all UK Biobank researchers.
AstraZeneca’s Catherine Priestley agrees, saying that they’ll be looking at complex diseases such as IPF, idiopathic pulmonary fibrosis, and CKD, chronic kidney disease. “Defining the ‘right target’ is key to any drug discovery. To be able to do this well, you need to do this at scale to uncover rare variants.” As an example of how scale can help uncover the genes behind a complex disease, you can look at breast cancer. It’s the UK’s most common type of cancer - one in eight women gets it at some point. Serena Nik-Zainal from the University of Cambridge has been researching it for years, and in a new study, she tells Phil Sansom about how she looked at whole genomes and found something surprising...
Serena - We've studied a cancer type that’s quite aggressive called triple negative breast cancers. And we have used the genomics to see whether we can identify patients who are going to respond to treatments vs patients who don't respond to treatments. We take a DNA sample from the tumour itself and we also take a sample of DNA from the blood of the patient. And a tumour usually has a highly mutated genome, and so you can do the comparison and find all the mutations that have arisen that have contributed towards cancer development.
Phil - That's what makes it a tumour basically.
Serena - That's exactly what makes it a tumour. That's right.
Phil - Who were you looking at in the study?
Serena - In the south of Sweden they've been recruiting patients with breast cancer since 2010. Every woman with a breast cancer has been included in this study. The recruitment rate is in the order of 85%, over eleven and a half thousand patients recruited already.
Phil - Does it matter that they're all Swedish?
Serena - It certainly has got its own biases because it's a Swedish population. Having said that, there was no bias in who was included, so anybody in the whole of the south of Sweden was included.
Phil - What did you find?
Serena - Patterns of mutations. Specifically we had used an algorithm that we developed, a computational algorithm called HRDetect. HRDetect was trained using machine learning methods to identify tumours that had BRCA1 or BRCA2 genetic defects. These two genes are really important in fixing damage that happens to your DNA. So our DNA is always coming under attack from the environment, from within our cells. BRCA1 and BRCA2 are proteins that are involved in DNA repair. So when you have a mutation in BRCA1 and BRCA2 you can't fix damage and you get a lot of mutations in your genome. HRDetect is trained to identify those patterns and tell you what the likelihood is of any tumour having a BRCA1 or BRCA2 type of defect.
Phil - Here's the stupid question. Why can't you just look at BRCA1 or BRCA2? Because those are genes that you know where they are, right?
Serena - Yes. So if you could just be sure that by sequencing just a gene alone you would find the tumours, then yes that would be the cheapest way. But what we found is that actually about a third of the tumours, we cannot find the genetic cause or an alternative cause. We can't find it in about a third of the tumours. So we can see the patterns, and the patterns look identical to BRCA1 or BRCA2 tumours, but we can't find the genetic defect.
Phil - That's so weird that it's a BRCA1 tumor but it doesn't have the BRCA1 thing!
Serena - That's right. And we don't fully understand why that is. But we don't fully understand the whole genome either. And I suspect there are ways of turning off the gene that we don't fully understand yet.
Phil - Now what does that mean if you find these tumours that people didn't realise had these patterns, but actually they do - what does that mean for people?
Serena - For those women who have tumours that look like the BRCA1 and BRCA2 tumours, currently they're not getting the same treatments.
Phil - How many of them are there again?
Serena - More than 59% had a high score. And these tumours are believed to be sensitive to specific drugs, in particular drugs that were developed for BRCA1 and BRCA2 tumours called PARP inhibitors. Currently actually most women still don't get PARP inhibitors in this country. But even if they did it would have only been 1 to 5 percent, not 50-something percent. That increase is massive. So this drug was initially created for 1 to 5 percent of the population, not huge numbers, not 20-something percent, 50-something percent. So that’s a lot of women who potentially could be getting drugs that they're not getting.
Phil - And because it's got that pattern it might work on them.
Serena - That's right. But we don't know whether that's definitely going to be the case. So we now need to do the clinical trials.
Phil - Why do you need whole genome sequencing for that?
Serena - The other ways of examining the genome are called exon sequencing or targeted sequencing. And exon sequencing captures about 1 to 2 percent of the genome, and targeted sequencing captures even less, 0.1 percent or less. If you don't look at 98 percent of the genome you're going to miss a lot of information. So I kind of think of it as going on a voyage and using very limited landmarks to try to get where you want to go. But today with whole genome sequencing you can have a complete world map.
Phil - Is it like going from “here there be dragons” to GPS?
Serena - Yeah, pretty much it is. Yeah.
21:16 - Gins & Genes: It takes so long!
Gins & Genes: It takes so long!
Eva Higginbotham, University of Cambridge; Hannah Thompson, Cambridge Cancer Genomics
We've been talking about the UK Biobank's genome sequencing project - now it's time to take the edge off with a nice gin & tonic. This month, Eva Higginbotham and Hannah Thompson join Phil Sansom at the Cambridge Distillery...
Hannah - I’m Hannah Thompson, I’m the Chief Product & People Officer at a startup called Cambridge Cancer Genomics.
Eva - I'm Eva Higginbotham and I'm a student at the Department of Zoology.
Phil - Thanks both for joining me. We're going to be drinking some gin, as usual provided by the Cambridge Distillery - courtesy of our friend Mr. Will Lowe. Will, what are we drinking today?
Will - So today I have bought you the latest in our hyper-local gins: this is called the Curator's Gin. All of the flavours you see in this glass came out of the Cambridge University Botanic Garden less than two miles from here.
Phil - So this month we're talking about this big project to sequence all the whole genomes in the UK Biobank. Is this everything it's cracked up to be? Is it a real big milestone in the way we understand genetics?
Eva - I think it will be when it's done - for the UK, at least.
Phil - Is that important, by the way? “For the UK”?
Eva - I think it is important. I mean really... we can sequence all of the genomes in the UK, but actually the Biobank is 94.6% white. Which at the time that they collected participants was actually representative of the population, but we're not talking about, you know, all humans when we're talking about this. We’re actually talking about a very… it’s quite a small subset of types of people. I mean, not only the Biobank segregated by age, so they recruited people between age 40 and 69; but also because of... it's called the healthy volunteer effect. So the kind of person who's likely to respond to a letter in the post that says, “do you want to be a part of our project,” et cetera, tends to be someone who's already quite healthy. And actually the statistics show that they have a higher percentage of women; less likely to smoke; they're less likely to be obese than the general population; they're more likely to be healthy in lots of ways. But, you know, you have to start somewhere.
Hannah - You make really, really good points, and I think it's great to know what's going on at a genetic level, but actually some of those things don't present themselves if they're not put in certain environments. If you really, really want to untangle everything you have to monitor people right from birth up ‘til they die.
Eva - The other thing is that it's, you know, I was talking to my friend who's a doctor and she was saying, “what are we going to do?” As a scientist I'm kind of like, “more information is interesting!” And from a doctor standpoint they're like, “great. So it's this gene. And how can I help this family?”
Hanna - Yeah, I think that will be the next bottleneck.
Phil - Does that seem like something to do with the reason that these pharmaceutical companies are getting nine months’ early access?
Eva - Overall they've put money in - it's 100 million pounds between four companies - they're putting money in, so they're getting something out. And in a way that is a sign of things ticking along and working as they should. One of the questions about pharmaceutical companies having early access to the data comes under the question of patents and patenting genes. So I didn't realise until recently that up until 2013 in the US, it was 100 percent legal to patent genes. And actually...
Phil - Patent genes?!
Hannah - Madness!
Eva - Yeah. And then there were... 4,300 human genes were patented up until 2013. And not just, this isn't just a question of like, “oh, we took a bit of DNA and we messed around with it and now we’ve patented it.” This is, like, a gene that I could have in my body that my body made and used. Where that has really serious consequences is when you think about testing for specific genetic disorders. So the famous one is the BRCA, the breast cancer genes. Those were under patent until 2013. So if you wanted to get tested to see if you had a much higher risk of breast cancer, you had to pay that pharmaceutical company to do the testing because they had patented those genes.
Phil - But you have the gene!
Eva - Yeah, you have the gene! I mean, that's... it brings up all these interesting questions about who owns nature. I'm not sure about the laws in the UK, and they did change them in the US in 2013, but yeah. One of the things that they report from the Biobank for this project is that the pharmaceutical companies will have to report on any patents that they're going to file from information they gather from this mission.
Hannah - I would say also that nine months isn't that much of a head start in the grand scheme of science. They are actually the ones with the most resources able to action any data points that come out of it. I talk to oncologists almost weekly and they say we're getting more genomic information and, “I don't know, I'm not a genomic expert so I don't know how to handle it.”
Phil - Is that the real challenge then? Analysing this stuff?
Hannah - Yeah, I think there's huge amounts of data to understand. And if we get the links between them wrong that has huge implications as well.
Phil - What do you guys think about this idea that, you know, data seems to be the big, hot new commodity nowadays?
Hannah - Data is definitely the new gold; or petrol, maybe. Gas. Who knows. Luckily there are quite strict laws on data privacy, but it's very hard to police exactly what's going on so you can't be looking at what everyone's doing all the time.
Eva - It's interesting to me because I'm not a big data scientist; I take lots of images of fly brains using a microscope. And over the course of the last four years I've taken an unbelievable number of images. But the fact is that only a small portion of that is actually useful. We've reached a point in biology where it's relatively easy to make a huge amount of data. Getting what's good out of it is going to take much longer than actually producing it in the first place.
Phil - Two years down the line when it's actually done - maybe a couple of years after that when a bunch of people have done studies on it - do you think someone's life might be measurably different?
Hannah - I think the problem I have with this is: lots of science news is very exciting and very explosive, and like, “woo, this is going to happen!” But actually it doesn't happen for years and years and years, and we all know the pain of how long science actually takes. So I think there's just a management issue, so managing expectations and the like.
Phil - So manage my expectations.
Hannah - You might die from a genetic disease and not know why still! We're not going to see the true impact of the data and for about... maybe 50 years, if I want to be very cautionary?
Phil - Fifty years?
Hannah - Yeah, I think to be able to research something, find a drug target, put it through clinical trials, make sure it works and that subset of patients that you're interested in; that’s a lot of time.
Eva - And it also costs a lot of money. I mean, one of the things that's good about this kind of project is that hopefully it will make making new drugs cheaper, because part of the cost of getting a new drug to market is the number of drugs that turn out to be useless. You can put years and years and years - and a horrible amount of money - into developing a drug, and then you find out, “oops, it doesn't actually do what you want it to do, never mind!” And so that means that the cost of developing the drug does actually work goes way, way up. So one of the hopes actually with this kind of project is that, because we'll understand more details about complex regulatory links between genes, we might be able to rule out drugs earlier in the process. So in that sense it's going to take a long time to see the full effects of this work, absolutely. But hopefully it will also speed up some of those processes overall.
Hannah - It just takes so long, man! There are new, really exciting things coming out, but just... yeah. When you look back it takes a lot of time to get places.
Phil - Right, that's enough about the genetics. I think we've earned some gin. Cheers everyone!
All - Cheers!
Hannah - Just what I needed on a Monday.