Google, Seagate AI Identifies Problem Hard Drives Before They Fail
Google and Seagate have announced they’re building a machine learning model intended to predict when a hard drive is likely to die. This question — and we’ve all asked it at one time or another — is surprisingly hard to answer, even for companies like Google, with access to reams of data about the behavior of millions of hard drives in its data centers over the past 20 years.
The Google blog post announcing this effort doesn’t do the best job illustrating the complexity of the task at hand. There’s a 2016 blog post from Backblaze discussing the SMART attribute system for hard drives that offers some valuable additional information on the scope of this problem.
Back in 2016, Backblaze tracked five different SMART attributes for predicting hard drive failure. The company had found that five attributes — SMART 5, 187, 188, 197, and 198 — correlated well with drive failure. 76.7 percent of HDDs that failed over the relevant period had at least one SMART failure in these five attributes. Only 4.2 percent of operational hard drives reported a failure in one or more of these five attributes.
Attempts to find strong correlations between the five attributes, however, proved tricky.
This chart shows the chance that a failure in any given SMART attribute corresponds to a failure in another of the other five attributes. Only two attributes correlate well — SMART 197 and SMART 198. SMART 188 and SMART 187 have almost no correlation at all.
One thing Backblaze notes in its report, however, is that the error patterns are different if you examine drives where errors accrued slowly over time versus drives where errors appeared suddenly. Backblaze’s overall discussion makes it clear that juggling even a modest handful of SMART attributes was difficult back in 2016.
Today, Google and Seagate collect an unspecified amount of SMART data, combined with host data from host systems made up of multiple drives, HDD logs (OVD and FARM), and manufacturing data off of the drives, including the model number and batch numbers. While we can’t say for certain, it looks as though Google and Seagate are collecting far more information than what Backblaze was working with five years ago.
According to Google, it evaluated two different approaches: an AutoML Tables classifier and a custom “deep Transformer-based” model. The AutoML model actually worked better, with a precision of 98 percent and a recall of 35 percent.
Here’s what that means: Imagine running a Google search for a given topic. Precision measures how many of the links the search engine coughs up actually matter for the purposes of your search. Recall, in contrast, measures how many relevant links were retrieved out of all the relevant documents that potentially exist. Google’s documentation suggests thinking of the difference this way:
Precision: “What proportion of positive identifications was actually correct?” (98 percent, in this case).
Recall: “What proportion of actual positives was identified correctly?”
There is a tradeoff between precision and recall. The two are sometimes combined into a metric known as an F-score, which measures a test’s accuracy. We don’t know what kind F-score weights Google might apply, but an F1 score would be the harmonic mean of the precision and the recall. If we punch Google’s claimed values in, the AI it built performs barely better than random chance, at 0.5158, where a 1.0 indicates perfect precision and recall, and a 0 indicates you have a real problem with your graduate thesis. The default model with 20-25 percent recall performs worse than random chance, at 0.3984.
Google’s blog post implies that the company’s results were better than random chance, however. The company writes that the new AI model allowed it to identify the top reasons behind drive failures, “enabling ground teams to take proactive actions to reduce failures in operations before they happened.”
Google doesn’t provide any additional contextual information on what recall rate it wants, or if 35 percent is sufficient. It ends with: “We already have plans to expand the system to support all Seagate drives—and we can’t wait to see how this will benefit our OEMs and our customers!”
Indeed. Anything that can help manufacturers detect hard drive failures before they happen is going to be a popular product.
Credit: Patrick Lindenberg on Unsplash