Dr Tim Hubbard, Wellcome Trust Sanger Institute
Dr. Hubbard:: Nowadays, most of what weíre doing here is re-sequencing. So we have the reference sequence of human and some of the other vertebrates, and weíre interested in looking at how the sequence varies across large numbers of individuals. And so, when we sequence another individual, for most of them their sequences going to be the same as the reference sequence and so itís a question of comparison and spotting the differences. And so, thatís how you deal with a new genome.
Now, there will be unique parts of that. There's a process is called assembly where you basically put these fragments together just like the bricks on the wall. They overlap with each other and you hope that they fit together uniquely so you can make a unique sequence. As the length of the fragments gets longer that becomes easier to build a unique assembly.
Kat:: So you get all these fragments out, you assemble them all together, you line them all up and you get a whole genome. How many different species have had their whole genomes read in this way?
Dr. Hubbard:: Iíve lost count. We have in the ENSEMBL database around 50 vertebrates but there are projects being planned to sequence, say, 10,000 vertebrates. Iím only talking here about vertebrates. There's many, many more genomes of pathogens, of bacteria. Basically, itís become so cheap to sequence now, so thereís enormous amounts of sequencing going on. Itís still quite hard though to make the first version of any particular species. Itís much harder to build a reference.
Kat:: So youíve got this genome, this book of letters that makes up an organismís DNA. Thatís very nice. Iím sure itís a nice thing to have. The real challenge I guess is understanding it. How do you figure out whatís in there? What makes this species the species it is?
Dr. Hubbard:: Once you have an assembly, the first thing you're interested in is the genes. The genes are the fragments that genome that are expressed to make physical proteins which is a part of every cell and mechanisms for regulating those genes turning on and off. Say for human, how many genes are there? So for quite a while, we thought the main definition of a gene was something that makes a protein and original estimates were that there might be as many as 100,000. This is all from year 2000 kind of analysis and gradually, that number came down.
The shock analysis at the beginning saying that thereís only 30,000, such it come down now to be more like around 20,000. It looked like a small set of genes but recently, itís become clear that thereís other classes of genes that donít make proteins that just make RNAs.
So, the process of using a gene is, thereís a DNA sequence, you make a copy of a region of that sequence, thatís RNA, and then you translate that and manufacture protein. Thereís a significant set of genes where you never make the protein. You just make the RNA and then the RNA itself is a functional unit. Sometimes part of regulation, in fact, sometimes thereís some sort of structural function. But there seems to be a lot more of those than we thought a few years ago. Exactly how many there are, itís actually quite hard - because they are harder to identify than proteins as there's a signature that makes it relatively easy to identify protein sequences. RNA is rather more difficult.
Kat:: I guess these are kind of the regions of the genome that were almost dismissed as junk DNA before. We couldnít figure out what they did. Do you think the concept of junk DNA still exists? Do you think any parts of the genome are junk or will we find that they are all functional?
Dr. Hubbard:: The definition of junk kind of came with whether a region of the genome was conserved or not, and that conservation is kind of based on comparing sequences between species. Is this region similar in mouse or similar in zebra fishes, similar in chimpanzee or similar in other humans? Weíve got better at doing that, but similarity is still based around proteins.
It may be that thereís other things that are conserved such as the spacing between elements on the genome. Itís conservation but you donít score it in the same sort of way. And itís suddenly clear now that we can use this DNA technology to find the whole transcriptome, everything thatís transcribed in a genome, we can also use it as another form of assay to spot a whole other layer of modification of the genome, which is the epigenome.
Itís a kind of modification layer on top, which seems to at least capture the state of a cell. And so in that sense, most of the DNA is in some sort of state for the epigenome and so you could say, well, how much junk is really left? How much is there that could be classed as junk? Because in fact, most of it seems to be labelled in some sort of way.
Kat:: And weíll be covering epigenetics in our future edition of the Naked Genetics podcast and explore that a little bit more. But in terms of all the data that we have now, this data about the genes that are active, that are transcribed, the genes in an organismís genome, all these species, all these pathogens, how do you organize it, how do you store it and how do you make it available to researchers?
Dr. Hubbard:: Thereís pretty much a policy thatís come out Ė partly from the human genome project and the other projects around at that time - of releasing that data immediately. Thereís been a database for DNA for 30 years now, I think, and itís become policy that when you sequence and publish, it is expected that the data is deposited.
But thatís a huge pile of raw data and thatís rather hard for researchers to use and so databases have grown up to organize that data. And as the whole genomes have been sequenced, the whole genome has become the organizing principle for DNA.
And so, these projects like ENSEMBL have become a place where you can go and see a genome, and everything thatís known about its annotation, about where the genes are, where factors bind, regulatory patterns. In the beginning ENSEMBL was based on vertebrates but because data is now being collected around different cell types, youíre beginning to be able to display that information as well about the epigenetic state of different cell types as well.
Kat:: That was Tim Hubbard, from the Wellcome Trust Sanger Institute.