Mistaken identity: contaminated cell lines
Up to 20% of the papers that have used cultured cells to study basic biology and disease processes like cancer may have drawn incorrect conclusions, new research is showing. This is because the cell lines used by the authors of these publications may not have been what they thought they were. Speaking to Chris Smith, UCSD scientist Anita Bandrowski has been using a machine learning approach to sift through millions of publications. In the process she’s found potentially hundreds of thousands of papers that may now need looking at…
Anita - We want to make sure that what we're doing is what we think we're doing. But, sometimes, we can be fooled by bad reagents; and this is one of those things we can't really control. If the real cell line that we're using is one type of cancer and we think that it's another type of cancer, because that's what we've been told by the stock centere, it's a big deal, because we're actually testing the wrong cancer!
Chris - Does that seriously happen though, Anita?
Anita - Yes absolutely! I mean the hep G2 cell line is one of the most commonly used liver cell lines. And if you just use it as, you know, some kind of a liver cell great, you can do that. But if you don't know that it's a hepatocarcinoma versus a hepatocytoma, you're looking at actually the wrong kind of cancer!
Chris - And how common are these sorts of mistakes? If you look at the field, how often do you think this is happening?
Anita - So, before our paper, there have been multiple estimates as high as 60 to 70 per cent at certain stock centres of cell lines that were actually contaminated or misidentified. Our estimate, I believe, is the most accurate estimate - it's certainly the largest sample size. And we looked at it and we said OK, sixteen point one per cent of the papers out there that use these kind of cell lines are using one of these cell lines that happens to be either misidentified - so people might think that it's the wrong cell line; it had some contamination - either it's partially contaminated or fully contaminated.
Chris - So that means, potentially, if you're right that between one in six one in five papers to have used cell have potentially drawn invalid or incorrect conclusions because they're not the cells I think they've used?
Anita - That is correct. And so Christopher Korch, one of the leaders in this field, has called for - you know - tens of thousands of papers to be looked at again. He was one of the reviewers on our paper and was very harsh, trying to ask us "what are you gonna do with this data? How are we going to address this field moving forward?"
Chris - So how did you do this. What was your method?
Anita - We actually did a method called "text mining". So we took the papers, we looked only at the method sections, which is where authors say this is the recipe of what I did including the reagents that are being used including the cell line. So we looked at all the sentences that said "I used a hep G2 cell line", or some other cell line HeLa cell line.
Chris - Was this a machine that was doing this for you. And how many papers did you feed into it?
Anita - So we fed in about 2 million papers. So this is everything in the entire open access corpus including all of the eLife papers, including all of the PLoS papers; the algorithm was actually trained to look for sentences that had these different cell lines. And we simply matched them. We said, OK here is a nice list of the things that could be potentially bad - potentially contaminated. And we matched them up, and turns out that it's about sixteen point one per cent of papers that are potentially affected by this problem.
Chris - Now did you - when you were testing your algorithm - did you go and scrutinise to see how sensitive and specific the filter was, to make sure that when you subjected these 2 million journal papers to this filter it wasn't mis-calling them?
Anita - Yes. So we we actually tested an accuracy rate of about 95 per cent.
Chris - This puts you in an interesting position and doesn't it, because you're sitting on a huge amount of data which you can point the finger at certain papers and say "well, these might not be reliable". What are you going to do with this data?
Anita - That is exactly what Christopher Korch asked us. And that's a really tricky question, because I don't have enough manpower - enough muscle - to kind of go through tens of thousands of papers that are generally problematic. Certainly I would not want to drag the reputations of any scientists through the mud unnecessarily. However, it should be one of those things that, at least moving forward, people should be able to look at this.
Chris - Presumably, you've got the raw data from your analysis and that will be available. So, if I were to come along with a paper I'm thinking of publishing and and basing my results and interpretations on other people's findings, I could go and have a look on your list presumably to see if the papers I'm thinking of citing, or people I'm thinking of collaborating with, are on your watch-list?
Anita - Yeah. And actually one of the other things that we will look at in the near future is actually putting that out into an easily-digestible public place. So as people are thinking about looking up these cell lines, they should be able to find these references.
Chris - But how do you think the scientific community who might be on your list of "naughty cells" react to this? Because there could be some really important papers in there, and some people with very big reputations and even bigger egos, who actually could have quite a lot of beef with what you're about to say about their work?
Anita - Well that's that's what makes this very very tricky right. Because we can't tell very easily just from knowing that somebody is using a cell whether they're doing it appropriately or inappropriately and therefore are their conclusions up to snuff. So this is a very tricky thing. We're going to proceed with a lot of caution. One of the other things that can be done, using text mining, is something called "sentiment analysis". So sentiment analysis is a method in text mining that allows you to start to understand things like "does the paper know that they are using a potentially contaminated cell line?" Christopher Korch's paper will actually be flagged with - you know - with "hey, there's a bunch of contaminated cell lines that appear here!" But, of course, his paper says "hey these are contaminated!" So, if we can tell that a particular valence of the language is actually correct, then we should not bother those people, right. We should only bother the ones that probably either don't know that their cell lines are contaminated, or maybe trim down this this larger list of of tens of thousands of papers to looking at sort of the most egregious potential problems...