Sorry for the clickbait headline, but this is important. As of this writing, I’m watching in horrified fascination as Arq Backup validates an 8.8 TB backup set on Amazon Cloud Drive. It started on March 6th at 02:00 and is not close to being finished. Arq has been validating for over 4 days(!) and probably won’t finish until the 7th day, when it will rest. 😏
(Boy was I optimistic when I first wrote this. See updates below for the reality.)
This would be fine if it did this maintenance activity in the background in parallel with the backups, or at least paused to let other backups run. It doesn’t. My hourly backups have been patiently waiting for 106 (count ’em) hours and will probably be stalled for ~170 hours before the validation finishes. I don’t dare stop this validation because I don’t know whether Arq is smart enough to resume next time, or if it starts all over from the beginning again if it never completes once. I do know that if I interrupt a validation to allow other backups to run, the validation does not resume for that backup the next time it runs. It waits the default 60 days before it validates again. I’ve been told that subsequent validations supposedly won’t take as long (this is the second validation of this set), but in the meantime no backups for days.
To aggravate the situation, Arq doesn’t warn you that this validation has started and your data is unprotected. Unless you watch it like a hawk, like I do, then you will be blissfully unaware that your data is at risk. Unprotected for days at a time. That critical project you’re working on? I sure hope you have other backup schemes like Time Machine.
You can’t rely on Arq as your sole backup means.
Unfortunately, Arq is my sole offsite backup means. For a week my data will be at risk, backed up only on a Time Machine disk. In case of fire or theft, I’m screwed. The risk is low, but isn’t that why we back up offsite?
As I said to Arq Support in early January when I first noticed this aberrant behaviour,
Backup programs are like insurance. You hope you never need it, but it can be a life saver when you do. Would you be happy with a car that turned off ALL safety systems—air-bags, seatbelt pretensioners, stability control, anti-lock brakes—for  hours of driving because it was running a diagnostic on the air-bags and wouldn’t stop until you manually halted it?
And it didn’t warn you?
Arq’s behaviour is by design. Backup software that doesn’t back up for days, and doesn’t tell you it’s not backing up. By design. Imagine! 😱
Arq is still better than CrashPlan, although I don’t recall CrashPlan halting backups for this long.
Remember how I said I’d have to rely on my Time Machine backup because Arq was so unreliable? Yeah, well I found out this morning that Time Machine silently stopped backing up 36 hours ago leaving my data with no backups for that window of time. Fortunately, since I’ve known Time Machine to do this on occasion, I had written an audit script to notify me when it had stopped for > 24 hours. I had to set it as high as 24 because Time Machine periodically reindexes for many hours and I was getting too many false alerts. Once Time Machine starts indexing, there’s nothing you can do but let it run. AND, as of macOS Sierra (siooma?), Time Machine only runs when it feels like it.
Yesterday, Arq has decided it needed to re-upload files it backed up a long time ago that haven’t changed. Does this mean the validation failed and Arq is just doing its job to ensure the backup is intact? If so, who borked the original backup? Arq or Amazon? Any time a validation feels it needs to re-upload files, that should be a major red flag to the developer to either ensure his program isn’t screwing up, or for him to write up a serious bug report to the cloud service being used as the backup destination. Someone’s program is buggy.
As of this writing, I’m forecasting Arq will be finished validating in 16 days (gasp!). Did I mention all other backups have stopped waiting for this validation to complete? So yeah, 16 days without any hourly backups running. By design.
This is the two week anniversary of Arq stopping all backups while it validates a backup set. PARTY TIME!
No, wait. I should be in mourning for the loss of my backups. I’ve been using backup software on the Macintosh since Redux on a Mac Plus when I backed up a massive 100 MB hard disk with 100, 1.4 MB floppy disks. I’ve forgotten all the backup software I’ve tried. In recent memory I administered Retrospect at a company that backed up the entire disk on each of 10,000 computers using 100+ servers at 10 sites—daily. I’ve used Time Machine and CrashPlan for personal use. I’ve never seen backup software that stops backing up for over two weeks while it does maintenance. This is what Arq Backup does by design. I can’t stress that enough.
Neither Redux nor Retrospect needed to do maintenance. The worst I can remember CrashPlan stopping backups for maintenance was a week, and I think they fixed that as I only saw maintenance for hours while it synchronized blocks before I gave up on it and moved to Arq.
I’m hoping Arq’s next validation of this set is much shorter as I’ve been promised by support. That’s why I’m adamant about letting this validation complete since it’s the second validation I remember seeing and I wouldn’t consider two weeks to be a short amount of time. I aborted the previous validation because I needed the other backups to run. I suspect Arq doesn’t resume from an aborted validation, but starts over.
Since Arq’s default interval before validating is 60 days, if you have a large amount of data and a fast uplink to Amazon Cloud Drive, you can probably expect to see the same behaviour I’m seeing when it does your first validation. Using my ETA of ~16 days for 8.8 TB, expect Arq to take 45 hours per TB to validate the backup in your first 60 days. I don’t know how much faster it would be to a local drive. I’ll know in a couple of months if Arq will, in fact, validate faster the next time. But it’s totally unacceptable to halt backups for two weeks even once. Hourly backups should run hourly, right? Unfortunately, I’m not sure the Arq developers agree with me.
IMHO, Arq desperately needs to multi-task validations and backups. Even different backup sets should proceed concurrently if the source and destinations are different between sets. I’m not holding out much hope for that though.
Update 2017-03-22 16:26:04:
🎼 Celebrate! Celebrate! 🎶 Dance to the music! 🎵
Arq Backup finally—FINALLY—finished validating a 9 TB backup after
16 days, 13 hours, 26 minutes, and 0 seconds
810 backups missed
Un-be-lievable. Backup software not backing up for over 16 days without warning, leaving you at risk of data loss, is like…
- Your car disabling all safety systems for 16 days of driving while it runs a routine diagnostic, but doesn’t warn you.
- Your security system DVR not recording any video for 16 days while it runs a disk check, but doesn’t warn you.
- The aircraft anti-collision system not being functional for 400 flying hours because it’s doing a self-test, without informing the pilot.
You get the idea. To say I am appalled at this design choice is to be generous. I can only hope the developers of Arq Backup understand the severity of this scalability issue, ask themselves “what the hell were we thinking,” and get it fixed ASAP.
And one additional bug that needs to be fixed. I have Arq configured NOT to “Include file list in backup logs and email reports.” At the end of the validation, it emailed a 60,346 line message detailing every block it uploaded.
I’ll close by saying…
It is never acceptable for any critical software service to stop performing its primary function for 16 days, let alone without warning the user.
Has it been 60 days already?! My, how time flies when you don’t have to babysit backups. Yes, it’s been 60 days since Arq did a validation so it was scheduled to do another on the 6th. I know you’re all wondering if what support told me about validations being shorter after the first one was true. I won’t leave you in suspense.
Yes. The validation that started on the 6th was indeed much faster than the validation that started this posting. Instead of taking 397.5 hours to validate, it only took a mere 26 hours, 43 minutes, and 29 seconds. I would celebrate this except for one teensy detail.
None of my hourly backups ran for 26 hours!
The other nasty discovery is that, since this was the third validation and I had aborted the first validation to let other backups run, Arq does not resume from an aborted validation. It starts over from the beginning. That is why the second backup took 16.5 days. It had started from the beginning.
With large backups, you simply must let Arq finish validating or you will be perpetually in a lengthy validation cycle as it starts from the beginning each time. Sad, but true.
- Don’t manually abort validations.
- Don’t reboot your computer.
- Don’t do anything that would cause the validation to fail.
Arq will penalize you severely, if you do. Oh, you still need those hourly backups to run? Tough. Arq doesn’t care. You’ll have to find some other way to safeguard your data while Arq fails to perform its primary function of backing up your data while it does maintenance. And since Arq doesn’t warn you it’s off to La La Land, you better watch it closely if you value your data.
Man, I hope Arq Backup 6 fixes this über-serious design flaw.
Arq did an object cleanup today. It’s not clear what that means as Arq’s Help is notoriously weak on this and other areas of its operation. I’m guessing it has something to do with thinning backups.
It ran for over 7 hours, once again blocking all other backups from running.
It’s nice that Arq does all this validation and cleanup. It would be much nicer if it didn’t stop doing it’s primary function—backing up our data—while it did maintenance. Can you imagine a restaurant that stopped serving customers while an employee mopped the floor or cleaned the washrooms? Or if the city shut down a residential street while the street sweeper cleaned the roads? I consider reliably backing up my data to be far, far more important.
Update 2018-01-23: Arq support has announced the release of version 5.11 with background validation obsoleting this post. 😁
However, I will wait a week or so before updating to see if early adopters report any problems. This is a major change but a most welcome one.