Big data just got bigger

The Square Kilometre Array show's just how big data can get and the challenges that come with it.

17 November 2015

Interview with

Peter Quinn, University of Western Australia

Part of the show Big Data, Big Deal?

galaxy-10994_1280.jpg

Credit:

CC0, via Pixabay

Play Download

So far we've heard what big data is and how we can use it, but what happens Space when it gets REALLY big? Peter Quinn is the director of the International Centre for Radio Astronomy Research in Perth, Western Australia. He's part of a project that is developing a telescope so powerful that it will generate more data in a day than the entire world will pump out during 2020, when it switches on, as he explained to Connie Orbach...

Peter - The Square Kilometre Array is a radio telescope, so it looks to the sky in frequencies that correspond to radio waves, the sort of waves that radio stations use, or TV stations use. The thing about the SKA is a transformational change in our ability to collect data from the sky. So if you look over the history of astronomy say for the last 400 years, we started with naked eye, and we went to Galileo's telescope, and we went to slightly larger telescopes, you know space telescopes etc. Every time we have built one of those telescopes, about every 20 years or so, the telescopes are kind of twice as good as the one we built before. With the SKA, we are building in one generation a telescope which is 10,000 times better. This is something that hasn't happened before, and it's got all the challenges that go along with things that haven't happened before.

Connie - And what is it you're looking for?

Peter - Essentially where the edge of the observable universe is. The universe began, we think as a big bang, full of glowing gas. All hot glowing gases when they expand cool and eventually they cool to the point where things start to condense out, just like water drops condensing into steam. Those first things that condensed out in the universe, they were the first sources of light, the first things to shine. This happened probably less than a billion years after the beginning of the universe, but we don't quite know where. Once we find that point in the cosmic history, we know what the seeds, if you like, were of all the other things that have happened in the universe. So it's an incredibly significant quest.

Connie - So how much is this going to cost?

Peter - We expect the first phase of construction to cost 650 million Euros, and that's contributed by the eleven member countries in the SKA consortium. The second phase of construction will probably cost several times that much again.

Connie - Wow. This is a huge telescope. I am assuming it's going to be collecting a lot of data. How much are we talking about?

Peter - On a typical day of operation, it'll produce a stream of science data, that's data that's ready to do science with, of around about one exabyte. Now an exabyte is a one with 18 zeros after it, so that's a billion gigabytes. It's basically the kind of data volume that we would expect the whole planet to produce in a year.

Connie - What are the main challenges with this level of data?

Peter - The main challenges are really the technology, how the computer is put together , whether it has the right ratio of input/output to storage to processing. We need a particular ratio for the kinds of data that we're using. Whether it has the right algorithm inside it, so an algorithm that can scale from where we are today up to the SKA scale. Some algorithms witll just break, some systems, some pieces of mathematics that we have used for many, many years just won't work anymore when we go up to the SKA scale. So we more or less have to start again and figure out the maths, and figure out the algorithms to go and analyse this data and that's already started. There's cost, and this is real serious one, when we build these special computing environments. I mean they are very big. I mean we are talking about petaflops of processing power and petabytes of storage, they are very expensive things. Can they be afforded by research projects. So we need cost effective computing, and maybe the way we do that is not perhaps by owning all the computing we would like to have but maybe using some of the new technologies like cloud computing. Basically we just grab a piece of computing and use it for a little while, then give it back if you like because it's computing on demand, we don't have to own big computer centres and own big supercomputers for a long period of time. And then people, I really worry a lot about the training of the people to actually do the data analysis for things like SKA. It's a bit troubling because we are not training enough people to do this stuff, so it's a long term future issue.

Connie - Once you've got all this data, where are you going to put it? Will you be able to store it all?

Peter - The answer is no. When we do radio astronomy we take different kinds of data at different kinds of points in the observing process so when we first look at the sky, we collect what we call raw data. This is the stuff that's not yet an image. That raw data is probably between ten and a hundred times bigger than the exabyte of raw science data, so there's no way we could afford to do that storage and so what we do is we try to keep the data on the move. We take it straight from the telescope and pipe it right into the back of the supercomputer to the processing, and that processing reduces the data volume, called data reduction. So, yes, we can't store that data but we certainly want to store the final scientific images that the astronomers are going use. Astronomers tend to keep data for a very long time. If you look in observatories around the world you can probably find photographic images of the sky that were taken more than a hundred years ago. The reason why we keep those things is that sometimes we'll discover a new asteroid in the sky, or a new planet, we can go back and look at where it was a hundred years ago and figure out its orbit. So this historical data is very important and keeping it there and keeping it available for people to do other things with it, other than the original purpose of the data, multiplies up its usefulness. That data curation problem for data sets of the size of the SKA is a very serious one and we're going to have to pool our resources globally to figure out how to do that.