
RAID Rebuild: Why It Can Lead to Loss of All Data
You replaced the failed disk. Started rebuild. Progress indicator shows 47%. And then... another disk failed. All data lost.
This is not a nightmare. It's a real scenario that happens more often than it should. RAID rebuild, which is supposed to restore data, is paradoxically one of the riskiest processes for your data.
What is RAID Rebuild
Definition
RAID rebuild is the process of restoring redundancy after disk failure. Controller reads data and parity from remaining disks and calculates missing data to new disk.
How It Works
RAID 5:
- Controller reads all sectors from healthy disks
- For each stripe calculates:
New sector = Disk1 XOR Disk2 XOR ... XOR Parity - Writes result to new disk
RAID 6: Similar principle, but using two independent parities.
RAID 1/10: Simpler – plain copy from mirror disk.
Duration
| RAID capacity | Approximate rebuild time |
|---|---|
| 1 TB | 2-4 hours |
| 4 TB | 8-16 hours |
| 12 TB | 24-48 hours |
| 24 TB+ | 2-4 days |
Depends on disk speed, controller, and load.
Why Rebuild is Risky
Stress Test for Remaining Disks
During rebuild, controller must read every sector of remaining disks. This is complete reading of entire capacity – something that never happens during normal operation.
What it means:
- 100% I/O capacity utilization
- Increased disk temperature
- Mechanical stress (for HDD)
- Revelation of latent defects
Discovery of Hidden Problems
Some sectors haven't been read for months or years. They may be degraded, but normal operation doesn't detect it – files in those sectors nobody uses.
Rebuild reads everything. And finds problems you didn't know about.
URE – Unrecoverable Read Error
Key concept for understanding rebuild risks.
URE: Silent RAID Killer
What is URE
Unrecoverable Read Error is read error that disk cannot correct. Sector is unreadable even after repeated attempts.
Occurrence Statistics
Every disk has URE rate specification – probability of unrecoverable error:
| Disk type | URE rate |
|---|---|
| Consumer HDD | 1 in 10^14 bits |
| Enterprise HDD | 1 in 10^15 bits |
| Enterprise SSD | 1 in 10^17 bits |
Mathematics – Why It's a Problem
Let's calculate URE probability during 12TB RAID 5 rebuild with consumer disks:
12 TB = 12 × 10^12 bytes = 96 × 10^12 bits
URE rate = 10^14 bits per error
Probability WITHOUT error when reading 12TB:
P(OK) = (1 - 1/10^14)^(96×10^12) ≈ e^(-0.96) ≈ 38%
Probability of at least 1 URE:
P(URE) ≈ 62%
With 12TB consumer disk, there's ~60% chance of URE during full read.
Consequences for RAID 5
For RAID 5, one URE during rebuild = entire rebuild failure. Controller has no way to calculate missing data if one input sector is unreadable.
Result: Array remains in degraded state, rebuild fails, and if another disk fails – loss of all data.
Why RAID 6 is Safer
RAID 6 has two independent parities. One URE during rebuild is not a problem – controller can calculate data from second parity.
That's why we recommend RAID 6 for:
- Large arrays (6+ disks)
- Large disks (4TB+)
- Consumer disks (worse URE rate)
RAID configuration comparison →
Probability of Failure During Rebuild
Risk Table
| Situation | Failure probability |
|---|---|
| RAID 5, 4×1TB, new disks | ~1-5% |
| RAID 5, 4×4TB, 3 years | ~10-20% |
| RAID 5, 8×8TB, 4 years | ~30-40% |
| RAID 5, 8×12TB, 5 years | ~40-60% |
| RAID 6, 8×12TB, 5 years | ~5-15% |
Factors Increasing Risk
Disk age: Older disks = more wear = higher probability of URE and failure.
Disk size: Larger disks = more data to read = higher probability of URE.
Number of disks: More disks = more potential failure points.
SMART warnings: Disks with warnings have significantly higher probability of failure during rebuild.
Hot Spare – Solution or Illusion?
What is Hot Spare
Spare disk connected to RAID array but unused. During disk failure, automatically replaces failed disk and starts rebuild.
Advantages
Automatic start: No waiting for new disk, rebuild starts immediately.
Shorter degraded time: Smaller window when array is vulnerable.
Disadvantages
Rebuild is still risky: Hot spare doesn't reduce rebuild risks – URE, domino effect, disk stress.
False sense of security: "We have hot spare, we're safe." No, you're just faster in rebuild phase.
Cost: Disk that normally does nothing.
Recommendation
Hot spare YES, but with awareness of limits. It's supplement to backups, not replacement.
Correct Procedure for Rebuild
Before Rebuild
1. Full backup (if possible) If array readable, backup critical data. It's insurance for case of rebuild failure.
2. SMART check of all disks Check SMART values of remaining disks:
- Reallocated Sector Count
- Current Pending Sector
- Spin Retry Count
If any disk shows warnings, don't rebuild – better professional recovery.
3. Documentation Record:
- Model and serial numbers of disks
- Disk positions
- RAID configuration
- SMART values
4. Plan B What will you do if rebuild fails? Have contact for professionals ready.
During Rebuild
1. Minimize I/O Shut down applications using RAID. Less load = lower risk.
2. Monitoring Monitor progress and disk temperature. High temperature = risk.
3. Be Prepared for Failure If rebuild fails or errors appear, immediately stop and call for help.
After Rebuild
1. Verify integrity Run consistency check (scrub) if controller supports.
2. Test backup Verify that backup is current and functional.
3. SMART check Check SMART values again – rebuild may have revealed latent problems.
Alternatives to Rebuild
Professional Recovery
Instead of risky rebuild, data can be recovered professionally:
- Sector copy of each disk
- Virtual RAID reconstruction
- Work with copies, not originals
Advantages:
- Safer (we don't work with originals)
- Can recover even with multiple failures
- Expert diagnostics
Disadvantages:
- Cost
- Time (days instead of hours)
Restore from Backup
Safest option. If you have functional backup:
- Create new RAID array
- Restore data from backup
- Done
This is why having backups matters.
Upgrade to RAID 6
If you must address failure anyway, consider upgrade:
- New controller supporting RAID 6
- New disks (different batches)
- Data migration from backup
When Better Not to Rebuild
More Than 1 Disk with SMART Warning
If any remaining disk shows SMART warnings, rebuild is gamble. Professional recovery is safer.
Very Old Disks (5+ years)
With old disks, probability of URE and domino failure is very high. Consider recovery instead of rebuild.
Critical Data Without Backup
If you have no backup and data is critical, rebuild is too risky. Professional recovery is only safe path.
Already Failed Attempt
If first rebuild failed, second attempt has even less chance. Disks are further worn. Call professionals.
What to do in degraded state →
FAQ
How long does rebuild take?
Depends on capacity, disk speed, and load. Approximately:
- 4TB: 8-16 hours
- 8TB: 16-32 hours
- 12TB+: 1-3 days
Can I use server during rebuild?
You can, but you'll slow rebuild and increase risk. For critical data, we recommend minimizing operations.
Is rebuild on SSD safer?
Yes. SSDs have better URE rate (10^17 vs 10^14) and aren't susceptible to mechanical failure. Rebuild is faster and less risky.
Rebuild failed, what now?
Immediately stop further attempts. Disks are in worse condition than before rebuild. Contact professionals.
Need Safer Solution?
If you have RAID in degraded state and fear rebuild, we can help. Professional recovery is safer than risky rebuild.
24/7 Hotline: +420 775 220 440