IDC Numbers (Sometimes) Make Me Twitchy

 

A common refrain in many modern storage systems research papers is how data growth rates will (or are) causing problems and how we desperately need new research/system/practice/goat-sacrifice to handle the oncoming data tsunami. Usually there is some perfunctory citation to the yearly IDC numbers about the looming data growth tsunami apocalypse firestorm, and how data growth is vastly outstripping our ability to store it.

Now, before I really start digging in, I think we absolutely are already stressing some storage systems, and will continue to. Similarly Data Growth Is A Thing. That’s not really in question.

Anyways, what I do question, at least when I see another reflexive citation of the the IDC’s numbers to justify the latest storage systems research project, is whether or not their numbers are actually meaningful to storage researchers as a rationalization for the need for increasing scale.

The reason I question this, is because the per-unit cost of storage is continuing its precipitous fall. If data growth is so massively outstripping data storage availability, then we wouldn’t expect prices to be dropping as rapidly as they are. In fact, we’ve had something of a glut of NAND based storage this last year, further eroding prices. This means the supposedly explosive growth in storage demand isn’t actually there, at least not at the level the marketing folks would like us to believe.

This to me, suggests a few different, non-exclusive, possibilities.

  1.  The data growth numbers are inflated. This is sort of admitted even by the IDC itself as they don’t differentiate where a piece of data is in its lifetime. So a 1 megabyte video of your cat getting brainfreeze, watched 1 million times, counts as one terabyte of data. Even if none of it is ever persistently stored outside of its CDN source.
  2. The data aren’t actually valuable. As a silly example, I had a fliphone up until 2014 (yeah, yeah, I’m a bit late to the game). It had this annoying feature of being able to trigger the camera from an external button. This means I produced about a bajillion pictures of the inside of my pocket. Data generated, yes. Useful data anyone gives a rip about. Absolutely not.
  3. We can’t actually capture most of the data, nor would we want to. Lets be honest, most data isn’t actually that interesting or useful. Frankly, most of its is noise that can be filtered trivially at its source. e.g. video cameras in empty rooms. Without this filtering, even a small number of cameras running continuously with no filtering generate a staggering amount of data. Personal experience at my day job shows a dozen cameras at a modest resolution, with good compression, can generate upwards of half a terabyte per day. Note these cameras are pointed at an entirely empty room with no activity. To say nothing of the bandwidth we take up slorping up that completely junk data.
  4. Even if we could get it all, it’s damn hard to make use of the data, so why bother with the expense? Now, this to me, is where we really need to be thinking. Storage researchers (well, systems researchers in general) tend to be overly infatuated with mechanisms and forget that we’re ultimately supporting players. We make infrastructure. We need to not just serve our data quickly and efficiently, we need to make it so its easy for it to be useful. It does no good to move garbage around quickly, when nobody wants the, erm, garbage.

Now, with all this kvetching, what would my actual suggestion be? Realistically, we’re not going to stop citing the IDC for justifying our work. But I would like to see some more nuance besides “Uhhhh, more data means more storage!”. There’s no shortage of valuable storage research problems to work on, but lets stop justifying them with sketchy numbers.

 

<steps down from soapbox>

 

Unknown's avatar

Author: Ian F. Adams

Scientist. Woodworker. First your pants, then your shoes.

Leave a comment