Backups

November 30, 2011

I always thought that one of the great things about digital technology is the ability to have backups - physical items can break but with digital data there is no reason why you should ever loose it, because creating exact copies of it is easily possible. And yet few people I know have a convincing backup strategy. Since I will hold a lecture at the filmArche filmschool next week on Workflow with digital files and Backups I thought this would be a good time to write the most important points about it down.

Your hard drives will fail

The question is not whether your hard drive will fail, but when. Not to have a backup of your important data is negligent and easily avoided. So I think it's well worth thinking a bit about it. If reading the title of this blog post made you feel a little guilty because you do not have a backup of your important data, read on. I promise to explain things in simple terms and walk you through some common backup scenarios for individuals or small groups of people.

Oh and one little disclaimer: thinking about backups can be a lot more complex that what I present here. This is just the bare minimum any normal person should know about backups in this digital age :)

What is a backup?

Let's start with a simple thought - what is a backup? It's an complete, independent copy of your data that does not share a single point of failure with the original copy. What's a single point of failure? Anything that can go wrong, that will destroy both copies at once. For example, if you have a backup on the same hard drive as the original file then this hard drive is a single point of failure. If it dies, then both your backup and the original data are gone.

This single point of failure thing is the key. I worked on a video shoot once where they shot with a camera that records to flash cards. The content of the flash cards was copied onto a hard drive and then to another backup hard drive before the contents of a flash card could be deleted and reused. So far so good. But then at the end of the first day the guy who did the data wrangling packed both hard drives into the same backpack and put it with the rest of the equipment for someone else to transport it to the next location. It didn't occur to him that if that backpack would fall down or get lost, both copies would be affected - it was the single point of failure.

What to back up?

Ideally you would just backup everything and be done with it. But some things are trickier to backup than others. Let's start with the obvious thing: all your "User" data should be backed up, i.e. your photos, the texts you write, the ("project") files of the applications you work with (spreadsheet software, video editing software, databases etc). This is the stuff that you really should not loose. I work as a photographer and so all photos I take fall clearly into that category, as do all the business related files like my accounting softwares files etc..

The nice thing about this category of "user data" is that you usually work with it on a regulary basis and thus know where it is. Ideally you do what I do and put all that stuff on it's own hard drive(s), away from the main operating system hard drive. This makes it easy to identify what is the most important stuff to backup (this one hard drive).

The second class of data is stuff that the software you use saves but that you do not directly interact with. Preferences for example. I work with software like Adobe Lightroom or Microsoft Visual Studio, both of which are very complex pieces of software that have a ton of user settings and user generated presets. If you were to loose these, it would probably not be the end of the world, but it would suck. The nice thing about this group of data is that it usually is rather small.

The third group of data is stuff like your operating system or the installed applications. Now while it would be nice if you could just copy all this stuff onto it's own disk and if the main hard drive fails restore it to a new disk and be done with it, this usually doesn't work. Operating systems need a boot loader and need to have certain data at certain sectors on the disc, stuff like that. So you need disc imaging software to backup this class of data which is why I do not bother to back it up. If the main system drive fails, it will take me a day or so to install the OS and the rest of the software again from the orignal DVDs, but that is ok. Your milage may vary though - if you simply can't afford a day of downtime it may make sense to create a scheme were you can back this kind of data up as well or have a second computer ready as a standby machine so you can switch quickly.

What kind of backup to do?

Most of the data we produce is not static - text files change, photos are edited, new files created all the time. So a backup needs to be done regularly. And with this comes an important decision: Do you need just one copy of your data or is it important to be able to go back to the way things looked some time ago? If all you need is an up to date copy, then it may be enough to just run a program that can mirror all the changes that happend onto your one backup. This is called a 1:1 backup. But if you need to be able to have a history of your data things become more difficult.

When you need past versions of your data, one approach is to buy X discs and use them in turn. So if you have 7 backup drives and you create 1:1 backups to one of them every day than you can go back in daily increments up to one week into the past. But this is expensive. So there is some software that let's you do this in clever ways and only store the data from the last backup. Because, you know, usually only a few files change between two backups.

There are a lot of different ways to these so called incremental backups. The easiest on the mac is Time Machine - on PC it's a bit difficult (Genie Timeline does something similiar). For smaller amounts of data, online services like dropbox usually provide some sort of history (although you have to trust that service provider to take good care of your data). But even time machine and genie timeline don't work well if you need to manage several disks, which is a common case today.

I have recenlty discovered that the powerful command line tool rsync now has the ability to create incremental backups with hard links, much like Time Machine does. I will write up my findings in another blog post.

A good solution for the real world

If all the data you care about are a few hundred MB and you usually have internet access, then a service like dropbox does all a normal user needs. But nowadays, even my grandparents have a few dozen GB of photos etc, and I have around 5 TB of important data that I need to keep safe. So I will now describe the setup I use and that I think is pretty safe and not overly complicated.

Step 1: data organization

It's important to know where your important data is. So I have a policy not to use the usual "My documents" folder or the "Documents" folder on my OS harddrive but instead put all the important data on dedicated data disks. Let's call the two data disks A and B. All your important data should be on these two disks. If the main OS drive dies, your main data should survive, even without the backups.

Step 2: get two more hard drives for each data hard drive

Why two you may think? Two reasons: First, while your backup drive is attached to your computer, your computer is a single point of failure. If you have a really nasty virus that wipes all your attached hard drives (or worse, encrypts them and extorts you with the password :) ) and you only have one backup then that's it. You had no backup because your two copies had a single point of failure. The second reason is that you should keep one of your two backup drives at another location. At a friend, your office, your lover, doesn't matter. Just pick a secure place and store it there and exchange the two disks every now and then (maybe every week or so). This way, even if you get robbed or your house burns down, the data will still exist somewhere. It will be a couple of days, maybe (if you're lazy) a few weeks old, but at least most of it will still exist. (Hint: if you do not encrypt your backup drives then you will want to store it with people you fully trust :) )

Step 3: Create your first backup

Because time machine and similar tools usually do not work with more than one hard drive, I recommend using simple 1:1 backups instead (or use a clever rsync strategy which I will desribe in another blog post soon). With todays huge disks, even a 1:1 backup that overwrites everything takes quite long, so I recommend to use rsnyc on linux or mac or robocopy on windows to create efficient 1:1 copies that only copy what has changed. This way I can backup 3 TB of data every night within 1-3 hours.

Rsync and robocopy come preinstalled with mac os x and windows respectively and both have gui frontends available for those who do not enjoy working on the command line (e.g. arrsync and yarcgui). Rsync defaults to creating a 1:1 backup with the -a option, robocopy useses /MIR for the same result.

With the gui frontends it is pretty simple to set up one copy job per hard drive (just tell it from where to where it should copy the data).

Once you finished setting up rsync or robocopy to copy the data disks to the backup drives, let it run once and check if everything worked (this first run will take a few hours per TB, possibly longer if you use a slow USB2 or Firewire connection). Subsequent runs should be much faster.

Step 4: Setup dropbox or similar for your most important, smallish files

The most important files we have are often pretty small. Many may be text files. Many will change frequently. For this kind of data it is reasonable to use dropbox and so I advise you to use dropbox in addition to the setup described above for the kind of data that is rather small and that changes a lot and where earlier versions of the file may be useful in the future.

Step 5: Setup reminders to switch backup drives, test backups

It's important to switch your two sets of backup drives regularly and to check if the backup works, so set up reminders in your calendar to do so. The best way to check if a backup works is to go to a different computer and try to open your files. If your files work with all their dependencies (linked media files in a video editing software e.g.) then you are safe. If not it's time to improve your setup.

A note on the side: RAID is not a backup

One last thing: some people have RAID setups configured for their data and think that they don't need any further backup. This is wrong. RAID is a system to protect you against hard drive failure by using redundant drives but of course the raid system itself and the computer it is attached to and the apartment that computer is located at are all single points of failure. RAID systems are nice but they are not a replacement for backups.

Final thoughts

Whew, quite a post. This stuff may seem a bit complicated but it is the simplest version I found that keeps me safe and makes reasonable compromises for my personal use case. I could go into a lot of detail on the various thoughts on why I prefer uncompressed backups etc but I think this will do for now. If you found this post useful, if you have questions or if you spotted an error, please let me know in the comments below!