Smarter than smart

 

Connecting state and local government leaders

How long will your hard drives last? New reports suggest that estimates aren't reliable and that life cycles might not be as long as you think.

It's a miracle that hard drives work at all. The read-write heads fly only nanometers above the disk surface, which spins at 7,500 revolutions per minute or faster. If the heads fly a little too high, the magnetic domains become inaccessible and the data is unreadable. If they fly just a little too low, it's crash city.Despite this precarious state of affairs, agencies continue to entrust their most valuable data to hard drives. It's a breathtaking leap of faith.And now this faith is being tested as never before, through some studies that show hard drives don't last nearly as long as previously imagined.Hundreds of millions of hard drives are already in use in agency data centers, and millions more are sold and installed every year. These hard drives handle current data for applications and backed-up data for archiving. Hard drive failure can mean not only temporary data unavailability but also permanent data loss.With such large numbers of hard drives deployed, managing them consumes a significant part of information technology budget and effort. Agencies need to keep the data stored and flowing. This means anticipating hard drive failure, moving data on risky drives and replacing failing drives before they give up the ghost completely.And anticipating these failures may be trickier than previously assumed.Luckily, it's not necessary to rely completely on faith to make such predictions. We can analyze hard drive failures the same way we do for any other electromechanical device.There are basically two classes of failures: predictable and unpredictable. Unpredictable failures, such as circuits burning out, occur suddenly and randomly. There's no warning or advance notice, so there's no strategy for anticipating them. All you can do is mop up after one occurs.The situation is more hopeful for predictable failures, which include most mechanical failures. Typically, parts age and wear out gradually. As its performance degrades ' or simply changes ' over time, we can anticipate the ultimate failure of the drive well before it happens.A landmark study by Seagate in 1999 indicated that some 60 percent of drive failures are predictable. Any handle we can get on such failures will clearly be helpful in managing vast numbers of hard drives.Fortunately, we have technologies in place to help monitor the condition of hard drives, make repairs and adjustments, and predict the onset of failure. Self-Monitoring, Analysis and Reporting Technology (SMART) reports on hard drive conditions, including the driver, disk heads, surface state and electronics (See sidebar, 'Hard drive smarts').The goal of SMART is to warn systems administrators of impending drive failure while there's still time to take preventive action, such as copying threatened data to another storage device.All the major hard drive manufacturers subscribe to SMART, and system and operating systems vendors incorporate various combinations of SMART attributes.The original version of SMART functioned by monitoring certain online hard drive attributes. The next version included off-line attributes, which gave more information and thus improved failure prediction. The latest version, SMART III, adds the ability to detect and repair sector errors, using more off-line data acquired during periods of hard drive inactivity.Industry estimates suggest that SMART can predict about 30 percent of hard drive failures. However, two recent independent studies indicate that failure prediction is not so simple ' and SMART may not be as helpful as once thought.The USENIX Conference on File and Storage Technologies in February included two studies of hard drive failures that reached similar conclusions.The first, 'Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?' was written by Bianca Schroeder and Garth Gibson of Carnegie Mellon University's computer science department. These researchers looked at about 100,000 hard drives, some over their entire five-year lifetime.The hard drives in this study had nominal mean times to failure (MTTF) ranging from 1 million hours to 1.5 million hours, which is typical for the industry.Those levels suggest a failure rate of at most 0.88 percent per year. However, the researchers found that in the field, annual replacement rates (ARR) routinely exceeded 1 percent, with 2 percent to 4 percent common, and some were as high as 13 percent. The weighted average ARR was actually 3.4 times higher than 0.88 percent. They concluded that such failure rates were not what we should expect based on the rated MTTFs.The researchers used ARR rather than MTTF because in actual practice, administrators replace drives that may not have failed yet. 'Drive replacements include drives that have not failed, commonly referred to as no-trouble-found drives, which can make up as much as 40 percent of that replacement population,' said David Szabados, a spokesman at Seagate Technology.Zeroing in on specific age ranges of hard drives, the researchers found that for older systems (five to eight years of age), MTTFs underestimated actual replacement rates by a factor of as much as 30. Even for young drives (less than three years), the difference was as large as a factor of 6.Contrary to industry expectations, they observed that replacement rates grew steadily with age. Clearly, all these results have significance for planning hard drive acquisitions and managing hard drive populations.What is the lesson here? First, hard drives may require replacement more frequently than stated MTTFs suggest ' in fact, several times more frequently. In addition, even after a shakedown year establishes that a drive isn't a dud, you cannot trust that the drive will be good for the next three to four years, as is commonly assumed. Because drive failure increases with time, older drives are more suspect.The other study from the USENIX Conference was 'Failure Trends in a Large Disk Drive Population' by Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso, all researchers at Google, an enterprise with considerable experience with hard drives.In many ways, Google is in the perfect position to study hard drives, because they run so many of them in their data centers.Like the Carnegie Mellon work, this study involved more than 100,000 hard drives. The researchers observed annual failure rates (AFRs) from 1.7 percent for drives in their first year to more than 8.6 percent for those at least three years old. These results closely match the Carnegie Mellon outcomes.But the Google study went one step further. It also included an analysis of SMART parameters and their correlation to hard drive failure.Although other studies ' and common sense ' suggest that higher temperature or activity levels would contribute to hard drive failure, this study found little correlation. However, certain SMART attributes were found to have a large impact on failure probability. For example, after their first scan error occurs, hard drives are 39 times more likely to fail within 60 days than drives with no such errors. Similarly, first errors in reallocations, off-line reallocations and probational counts also strongly correlate to higher failure probability.'Methods of predicting hard drive longevity are mostly age-related, with modifiers such as the four SMART data we found,' said Pinheiro, a software engineer at Google.Yet perhaps the most remarkable finding of this study was the lack of impact of SMART attributes on most failures. Out of all the failed drives, more than 56 percent of them had no count in any of those four best SMART predictors. So even models based on those good predictors can never anticipate more than half of drive failures. Even including all SMART attributes, more than 36 percent of all failed drives had zero counts. In other words, many failed drives gave no advance warning that they were failing.The researchers concluded that 'given the lack of occurrence of predictive SMART signals on a large fraction of failed drives, it is unlikely that an accurate predictive failure model can be built based on these signals alone.' Better predictions will need to use more information than SMART provides.What does this mean? Try not to rely too heavily on SMART. The first occurrence of certain SMART events ' namely, scan errors, reallocation counts, off-line reallocation counts and probational counts ' can be important clues that hard drive failure is imminent. However, most hard drive failures will occur with no meaningful warning whatsoever. 'The accuracy of longevity prediction is very poor for any single drive,' Pinheiro said.Industry experts have different perspectives on the Google and Carnegie Mellon studies. For example, Aloke Guha, chief technology officer of storage solution provider COPAN Systems, points out that the results of both papers are based on drives that are always spinning and thus do not provide insights into how AFRs depend on actual power-on-hours (POHs).'It's no surprise that drives meant to be used in low-duty-cycle or low POH are exhibiting high failure rates when used in transactional storage systems,' Guha said. He suggests that drives using Massive Array of Idle Disks technologies experience lower AFRs, from about 0.22 percent. COPAN is the exclusive provider of MAID systems.Similarly, David Lethe, president of diagnostic software vendor SANtools, suggests that proper burn-in testing can weed out bad disks early. 'Some storage and subsystem manufacturers invest millions of dollars developing appropriate testing methodology and algorithms,' Lethe said. He also recommends monitoring SMART attributes continuously, not just on start-up, as many BIOSes do.The lessons from both studies ' and from industry experts ' translate readily into strategies for acquiring and managing hard drives, especially for large agency installations. First, administrators should select the proper drive for the task, not the cheapest drive. 'It costs more ' considering downtime, replacement costs, labor and data lost ' to buy consumer-class disks and use them for demanding tasks,' Lethe said. 'You get what you pay for.'Because stated hard drive MTTFs may not accurately reflect the replacement rates of hard drives, administrators may need to requisition more drives than they normally would. Administrators should investigate the wording of current supply arrangements: Are they tied to specific numbers of hard drives or levels of actual storage?For large-scale hard drive use, agencies may want to formulate burn-in strategies to eliminate dud drives. Since this is so complex, it may be simpler to work with vendors on their burn-in procedures.In addition, administrators should not expect that hard drive failures will peak in the newest and oldest drives, with ages in the middle exhibiting lower failure rates ' the so-called bathtub curve. Instead, they should anticipate that failure rates should increase with age and schedule spare drives in advance to replace failing drives.Because several SMART attributes seem to have some predictive powers, administrators should ensure that they are monitoring those attributes. This may require configuring BIOS, operating system, Simple Network Management Protocol reporting, network and management software to pass the necessary indicators along. SMART monitoring should be continuous.Administrators should take warnings seriously. Hard drives that give a clue to a possible failure should be retired and their data transferred to other devices. 'Strategies for dealing with hard drive failure include moving data, replicating data, masking failures by diverting accesses' and Redundant Array of Independent Disks, Pinheiro said.Although most hard drive failures may have no prior warning from SMART, administrators should be on the lookout for trends in their population of hard drives. 'It is important to not only understand the kind of drive being used but the system or environment in which it was placed and its workload.'Szabados said. It may be that your hard drive failures involve the characteristics of your local installation: power, workload, temperature or other conditions.Regardless of the reason for hard drive failure, administrators should be prepared to restore data from backup and resume operations. No predictive mechanism is ever going to replace the insurance these simple systems can provide.Hard drive research will continue, and it will be relevant to systems administrators. Hard drive manufacturers may respond to such research with new predictive attributes, different estimates of MTTF or new strategies for dealing with hard drive failure.Regardless of where the industry goes, agencies should evolve their own strategies for dealing with hard drive failure ' and keep the faith.

For further research:

'Get S.M.A.R.T. for Reliability' (Seagate, 1999)

This is the paper that first defined the benefits of hard drive SMART technology, the industry standard that could be used to predict hard drive failures. www.seagate.com/docs/pdf/whitepaper/enhanced_smart.pdf

'Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?'

In this paper, two Carnegie Mellon University researchers analyzed how often disk drives were replaced in large data centers. They found that drives were replaced more often than predicted by manufacturers' estimates for how long the drives should last. www.cs.cmu.edu/~bianca/fast07.pdf

'Failure Trends in a Large Disk Drive Population'

Google researchers studied hundreds of thousands of disks on their own server farms and found the SMART technology did little to help predict failures. labs.google.com/papers/ disk_failures.pdf

Hard drive smarts

A feature of new hard drives, the Self-Monitoring, Analysis and Reporting Technology (SMART) reports on hard drive conditions, that include the driver, disk heads, surface state and electronics. All the major hard drive manufacturers subscribe to the SMART system, and system and operating systems vendors incorporate various combinations of SMART attributes.

SMART keeps track of dozens of hard drive attributes. Some critical SMART attributes are:


  • Read error rate: The rate of hardware errors when reading data from a disk surface.
  • Reallocated sectors count: The number of sectors reallocated, meaning their data was transferred to another sector after a read/write/verification error.
  • Reallocation event count: The number of attempts to transfer data from reallocated sectors.
  • Current pending sector count: The number of unstable sectors waiting to be remapped.
  • Uncorrectable sector count: The number of uncorrectable errors when reading/writing sectors.
  • Disk shift: The distance the disk has shifted relative to the spindle.

Other attributes that SMART reports on are various measures of temperature, and the flying height of the heads above the disk.

'Edmund X. DeJesus












Studying failure
















Not so fast



















Not so smart

























Dollar wise





















NEXT STORY: DISA plugs into on-demand

X
This website uses cookies to enhance user experience and to analyze performance and traffic on our website. We also share information about your use of our site with our social media, advertising and analytics partners. Learn More / Do Not Sell My Personal Information
Accept Cookies
X
Cookie Preferences Cookie List

Do Not Sell My Personal Information

When you visit our website, we store cookies on your browser to collect information. The information collected might relate to you, your preferences or your device, and is mostly used to make the site work as you expect it to and to provide a more personalized web experience. However, you can choose not to allow certain types of cookies, which may impact your experience of the site and the services we are able to offer. Click on the different category headings to find out more and change our default settings according to your preference. You cannot opt-out of our First Party Strictly Necessary Cookies as they are deployed in order to ensure the proper functioning of our website (such as prompting the cookie banner and remembering your settings, to log into your account, to redirect you when you log out, etc.). For more information about the First and Third Party Cookies used please follow this link.

Allow All Cookies

Manage Consent Preferences

Strictly Necessary Cookies - Always Active

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data, Targeting & Social Media Cookies

Under the California Consumer Privacy Act, you have the right to opt-out of the sale of your personal information to third parties. These cookies collect information for analytics and to personalize your experience with targeted ads. You may exercise your right to opt out of the sale of personal information by using this toggle switch. If you opt out we will not be able to offer you personalised ads and will not hand over your personal information to any third parties. Additionally, you may contact our legal department for further clarification about your rights as a California consumer by using this Exercise My Rights link

If you have enabled privacy controls on your browser (such as a plugin), we have to take that as a valid request to opt-out. Therefore we would not be able to track your activity through the web. This may affect our ability to personalize ads according to your preferences.

Targeting cookies may be set through our site by our advertising partners. They may be used by those companies to build a profile of your interests and show you relevant adverts on other sites. They do not store directly personal information, but are based on uniquely identifying your browser and internet device. If you do not allow these cookies, you will experience less targeted advertising.

Social media cookies are set by a range of social media services that we have added to the site to enable you to share our content with your friends and networks. They are capable of tracking your browser across other sites and building up a profile of your interests. This may impact the content and messages you see on other websites you visit. If you do not allow these cookies you may not be able to use or see these sharing tools.

If you want to opt out of all of our lead reports and lists, please submit a privacy request at our Do Not Sell page.

Save Settings
Cookie Preferences Cookie List

Cookie List

A cookie is a small piece of data (text file) that a website – when visited by a user – asks your browser to store on your device in order to remember information about you, such as your language preference or login information. Those cookies are set by us and called first-party cookies. We also use third-party cookies – which are cookies from a domain different than the domain of the website you are visiting – for our advertising and marketing efforts. More specifically, we use cookies and other tracking technologies for the following purposes:

Strictly Necessary Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Functional Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Performance Cookies

We do not allow you to opt-out of our certain cookies, as they are necessary to ensure the proper functioning of our website (such as prompting our cookie banner and remembering your privacy choices) and/or to monitor site performance. These cookies are not used in a way that constitutes a “sale” of your data under the CCPA. You can set your browser to block or alert you about these cookies, but some parts of the site will not work as intended if you do so. You can usually find these settings in the Options or Preferences menu of your browser. Visit www.allaboutcookies.org to learn more.

Sale of Personal Data

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Social Media Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.

Targeting Cookies

We also use cookies to personalize your experience on our websites, including by determining the most relevant content and advertisements to show you, and to monitor site traffic and performance, so that we may improve our websites and your experience. You may opt out of our use of such cookies (and the associated “sale” of your Personal Information) by using this toggle switch. You will still see some advertising, regardless of your selection. Because we do not track you across different devices, browsers and GEMG properties, your selection will take effect only on this browser, this device and this website.