Ubiquitous computing enabled by massive data centres already influences our daily lives in ways we take for granted. But these systems sometimes fail with serious consequences. Professor Bianca Schroeder is exploring how to make these data centres more reliable and efficient.

While Hollywood likes to portray the relentless takeover of humanity by pitiless machines, computers are in fact a little more human than we may have thought. More precisely, they err. Such fallibility could be considered positively endearing if the consequences weren’t quite so alarming—complex computer systems underpin everything from our infrastructure and economy to our health-care system.

Using failure to fix the problem

Professor Bianca Schroeder studies these systems and how to make them more reliable and energy efficient. “When I started this work I really got the sense that people are working blindly,” she says. “They’re trying to fix a problem that we really don’t understand. So my goal was to get real data from large-scale computers to understand how computers fail so we can then figure out how to fix the problem.”

Schroeder released a research paper based on a two-year study of data from Google’s vast fleet of servers. It found that computer memory error rates are dramatically higher than previously thought. The memory modules she studied experienced correctable error rates of somewhere between 25,000 to 75,000 failures per billion hours. To put that into context, one previous oft-cited study had pegged the number in the 200 to 5,000 range.

Another surprise was that errors were more commonly the result of hardware rather than software problems, contrary to much previous thinking.

Why then, one may ask, does Google seem to work just fine? “For the most part, a lot of work is going on at a lot of different levels of the computer system to hide those errors… but it doesn’t always work,” says Schroeder. “Gmail went down for some time, for example.”

Making industry take notice

The industry has taken these findings to heart, especially smaller companies that don’t boast the fail-safe infrastructure of Google. Schroeder says many companies now only buy machines with memory systems that feature strong error-correcting codes. She also says many are putting in extra effort to avoid errors in the first place, such as following the report’s recommendation to update hardware more frequently.

After these findings, Schroeder made waves a second time with another paper that challenged industry assumptions. This time, she showed that data centres could operate at higher temperatures without significantly compromising performance. Most data centres typically operate somewhere between 20C to 22C with some kept as cold as 13C—mostly in response to very conservative guidelines from manufacturers. The study, however, found that higher temperature didn’t affect reliability nearly as much as thought and that high usage rates were a much more likely culprit responsible for errors.

Raising the thermostat a few degrees is a big deal because the world’s data centres are incredible energy hogs, consuming an estimated one percent of global energy use. Put another way, they use up the entire output of 17 power plants each generating 1,000 MW and they collectively emit as much carbon dioxide as all of Argentina. More than a third and sometimes up to a half of this giant electricity bill is devoted to air conditioning. It is estimated that only a one degree increase in temperature could save two to five percent of the energy the centres consume.

Are we ready for a data-driven future?

Schroeder says the significance of these findings will only increase in the years ahead as the demand for data escalates. Indeed, the amount of data that humanity generates doubles every two years. By 2020, experts predict we will reach 44 zettabytes, or 44 trillion gigabytes, as personal smart devices proliferate and everyday objects are embedded with computational devices sending and receiving data through the so-called Internet of Things.

Modern data centres, which already consume extremely high levels of energy and are prone to hardware failure, will have to shoulder this data deluge. Through her research, Professor Schroeder hopes to make the massive data systems we increasingly rely on more stable, less prone to error, and more environmentally responsible.