The Death of Biostatistics?
Chris - Almost every aspect of science relies on statistics to decide whether results are valid or not, but when statistics were invented, there was no such thing as social media - Twitter and Facebook didn't exist. Now, given that people are increasingly using this sort of crowd-sourced data to do their research, are old style stats still up to the job? Professor Arnoldo Frigessi is the Director of Statistics for Innovation at the University of Oslo in Norway and he's got a perspective on this...
Arnoldo - That's true, so old statistics is dead. Two big things are happening. One is the data basis is changing dramatically; Before, we had these nicely built up case control studies, taking people that had the disease, people that didn't have the disease and comparing them carefully, maybe 20 in one group, 55 in the other one. Now we have Twitter and Facebook. Now we have millions of people out there that tell us about their diseases and about their bio facts, and their symptoms and we have to use this data to find out, as we did before in case controlled studies, if a drug works or not.
Chris - With that level of user engagement, where before we had small trials and we had to make sure we controlled everything very, very tightly, now we can do effectively massive trials. The noise is actually enormous, yet you can still extract very meaningful data of very high statistical significance from that because of the size of the sample.
Arnoldo - Exactly. So we're trading sample sizes, accuracy, or bias and we have to correct things because of course, not everybody is using the internet, right? And we have to correct to get the right population, and this we can do, that's not difficult. But today, we can monitor Google for example and find out where the flu is, there are papers out there that show that by looking at how many people, and where are the people that are checking for symptoms of flu, we can predict the wave of the flu in the world, two weeks before the World Health Organisation does it. Or we can, for example, use the happiness of people to find out if stock values are going up or down.
Chris - Just like a paper in Science that came out in the last couple of months using Twitter messages to look at dysphoria and happiness, and how this evolves as the day goes on. We're all happier in the morning, miserable by evening, except for one little surge before bed and it's true regardless of creed and culture. I was amazed by that and the fact that you can do these sorts of experiments now with high statistical significance and you don't actually have to leave your lab or recruit people. You just look at data that's in the public domain very easily.
Arnoldo - This is really completely changing how statistics have to work. Now, we have to use enormous data sets which are very bad, where we need models to filter out the noise. We need to correct biases. We need to find out if things are independent. Of course, if we check out how many people like Silvio Berlusconi today...
Chris - Well he likes himself presumably...
Arnoldo - ...Yes, exactly, but not many others! So you need to find people that are a bit independent from each other, right? So think that Facebook is an enormous network where you and your friends are connected by edges and now we have to find independent people. We can't take you and your friends. You mean mostly the same so that's very boring. We have to essentially split up this network in independent clusters. This is a completely new story. We have to sample networks to find independent units so that we can, in my words, reduce variances.
Chris - But how is statistics responding to this? How are people trying to control for this new domain that we find ourselves in, this new regime of doing research?
Arnoldo - I think this is coming. It's not yet a daily thing that we do, no. It's difficult. I mean, Facebook, they're not so happy to give you all the data, right? So there is much to do here still to get this power out there. But for example, if you move slightly, let's take an easy situation. Let's take for example a financial institution that has millions of credit cards out there, and they're checking, trying to find out if there are frauds. So suddenly, they also have enormous quantity of data, millions of clients, each of them with a little data, not very much, and we have to find out which of these credit cards are doing something strange. So now we, statisticians again in this situation, you're trying to find surprises in millions of possibilities, millions of tests, in some sense and we're looking for surprises.
Again, [this is] a dramatic change to what was before. Before, you had your own nice hypothesis. You had your gene that you like very much and you would spend all your life studying that gene. Now we have millions of genes and you're checking them all to find out if there's anything interesting. So the scientist doesn't have any hypothesis anymore. They say, "Oh! I have data. Here are my data. Find something useful please." So suddenly, the statistician is doing a completely different role. We have to find things. Not just check if they're true or not.
Chris - That's the genome-wide association study where it's "changes in search of a disease" now, rather than "[we have] a disease, now let's find the genetic underpinnings"?
Arnoldo - That's right. So it's really changing the way we are building up statistics as a mathematical instrument because statistic is mathematics, right? And so, we have to compute things that are called P values, probabilities that a certain gene is in some sense important for you to describe your disease. And now, we are producing lists of candidates, lists of hypotheses that may be useful. If I give [my biologist colleague] my list of 100 genes that I found after a careful statistical study, I can't tell him that these 100 genes are really all important but I can tell him, by using something called the false discovery rate, that 20% of them are wrong.
Chris - We don't know which 20.
Arnoldo - No, we don't know which 20, but that's alright. He can cope with that. I mean, he came with only data and I gave him quite an interesting answer.
Chris - So what's the next step then? Will we see bio informatics being taught very differently at university? When I was doing biometry at medical school, we were taught about the T-test, students T-test and Gaussian distributions, and we were taught "this is the circumstance when you apply this test". It sounds to me that we're going to have to go back to kindergarten, statistically, for researchers basically to understand how to use the tools we have in this whole new research setting.
Arnoldo - You're right. Statistics is more difficult now and statistics is more present in the core of medical science. So before, it was something that came at the end of your laboratory work and when everything was done, you had your nice Excel file, your nice table, then you would do your statistics. Now statistics comes from the beginning as a core instrument of discovery and this is more difficult so we have to do partial differential equations and networks, and all this stuff that is a bit more difficult.
Chris - It's good for you though, isn't it because it means you're not going to be out of a job.
Arnoldo - No.
Chris - It just means that researchers like me are going to find it much harder to...
Arnoldo - You have to find a statistician.
Chris - Well we have to find a statistician, but also, when we try and read papers, picking out whether or not the numbers are meaningful and it's solid, is going to be much more difficult. I guess most papers will have to float pass someone like you for you to unpick them and decide whether or not what they're saying is really a valid conclusion.
Arnoldo - I think in this world where competition is very strong, where you need that little advantage to make progress, we're looking to second order effects, to small things, 1%, 2%, and these things are much more difficult to find. In a different world, in industry for example, [they] use statistics also to make progress of course, they need to find those small differences or small incremental advantages to beat the competition. And now, you need statistics at a much more refined level. You need to extract much information from your data that is hidden; interactions, dependencies and things that happen together and therefore give an advantage.