Genes, genomes and junk DNA
Dr. Hubbard:: Nowadays, most of what we're doing here is re-sequencing. So we have the reference sequence of human and some of the other vertebrates, and we're interested in looking at how the sequence varies across large numbers of individuals. And so, when we sequence another individual, for most of them their sequences going to be the same as the reference sequence and so it's a question of comparison and spotting the differences. And so, that's how you deal with a new genome.
Now, there will be unique parts of that. There's a process is called assembly where you basically put these fragments together just like the bricks on the wall. They overlap with each other and you hope that they fit together uniquely so you can make a unique sequence. As the length of the fragments gets longer that becomes easier to build a unique assembly.
Kat:: So you get all these fragments out, you assemble them all together, you line them all up and you get a whole genome. How many different species have had their whole genomes read in this way?
Dr. Hubbard:: I've lost count. We have in the ENSEMBL database around 50 vertebrates but there are projects being planned to sequence, say, 10,000 vertebrates. I'm only talking here about vertebrates. There's many, many more genomes of pathogens, of bacteria. Basically, it's become so cheap to sequence now, so there's enormous amounts of sequencing going on. It's still quite hard though to make the first version of any particular species. It's much harder to build a reference.
Kat:: So you've got this genome, this book of letters that makes up an organism's DNA. That's very nice. I'm sure it's a nice thing to have. The real challenge I guess is understanding it. How do you figure out what's in there? What makes this species the species it is?
Dr. Hubbard:: Once you have an assembly, the first thing you're interested in is the genes. The genes are the fragments that genome that are expressed to make physical proteins which is a part of every cell and mechanisms for regulating those genes turning on and off. Say for human, how many genes are there? So for quite a while, we thought the main definition of a gene was something that makes a protein and original estimates were that there might be as many as 100,000. This is all from year 2000 kind of analysis and gradually, that number came down.
The shock analysis at the beginning saying that there's only 30,000, such it come down now to be more like around 20,000. It looked like a small set of genes but recently, it's become clear that there's other classes of genes that don't make proteins that just make RNAs.
So, the process of using a gene is, there's a DNA sequence, you make a copy of a region of that sequence, that's RNA, and then you translate that and manufacture protein. There's a significant set of genes where you never make the protein. You just make the RNA and then the RNA itself is a functional unit. Sometimes part of regulation, in fact, sometimes there's some sort of structural function. But there seems to be a lot more of those than we thought a few years ago. Exactly how many there are, it's actually quite hard - because they are harder to identify than proteins as there's a signature that makes it relatively easy to identify protein sequences. RNA is rather more difficult.
Kat:: I guess these are kind of the regions of the genome that were almost dismissed as junk DNA before. We couldn't figure out what they did. Do you think the concept of junk DNA still exists? Do you think any parts of the genome are junk or will we find that they are all functional?
Dr. Hubbard:: The definition of junk kind of came with whether a region of the genome was conserved or not, and that conservation is kind of based on comparing sequences between species. Is this region similar in mouse or similar in zebra fishes, similar in chimpanzee or similar in other humans? We've got better at doing that, but similarity is still based around proteins.
It may be that there's other things that are conserved such as the spacing between elements on the genome. It's conservation but you don't score it in the same sort of way. And it's suddenly clear now that we can use this DNA technology to find the whole transcriptome, everything that's transcribed in a genome, we can also use it as another form of assay to spot a whole other layer of modification of the genome, which is the epigenome.
It's a kind of modification layer on top, which seems to at least capture the state of a cell. And so in that sense, most of the DNA is in some sort of state for the epigenome and so you could say, well, how much junk is really left? How much is there that could be classed as junk? Because in fact, most of it seems to be labelled in some sort of way.
Kat:: And we'll be covering epigenetics in our future edition of the Naked Genetics podcast and explore that a little bit more. But in terms of all the data that we have now, this data about the genes that are active, that are transcribed, the genes in an organism's genome, all these species, all these pathogens, how do you organize it, how do you store it and how do you make it available to researchers?
Dr. Hubbard:: There's pretty much a policy that's come out - partly from the human genome project and the other projects around at that time - of releasing that data immediately. There's been a database for DNA for 30 years now, I think, and it's become policy that when you sequence and publish, it is expected that the data is deposited.
But that's a huge pile of raw data and that's rather hard for researchers to use and so databases have grown up to organize that data. And as the whole genomes have been sequenced, the whole genome has become the organizing principle for DNA.
And so, these projects like ENSEMBL have become a place where you can go and see a genome, and everything that's known about its annotation, about where the genes are, where factors bind, regulatory patterns. In the beginning ENSEMBL was based on vertebrates but because data is now being collected around different cell types, you're beginning to be able to display that information as well about the epigenetic state of different cell types as well.
Kat:: That was Tim Hubbard, from the Wellcome Trust Sanger Institute.