By Tim Torian, Torian Group, Inc.
Backups are an essential part of your IT strategy, even if you just have a single computer at your home office. Changes in technology have opened up new options for backup. Cloud computing offers large amounts of storage at low cost. Improvements in broadband service make it more practical to consider backing up over your internet connection. Imaging tools combined with Virtual Machine technology make it feasible for small business to recover from a lost server in hours instead of days.
However, the fundamentals of disaster prevention have not changed. A backup and disaster prevention strategy is still created by thinking through these issues:
- Identify the data that is critical to your business, and where it is stored. In particular, be aware of where your email is, and watch for programs that use some form of database that runs on the server, which may have special backup requirements.
- Consider the types of things that could go wrong, and the consequences.
- Determine how long you can be without specific types of data or functionality if these things did go wrong, and what the cost would be, including the cost of loss or temporary unavailability. This determines your “service level”.
- Create a backup and disaster prevention plan which allows you to recover functionality within the requirements of your service level.
- Develop a plan to monitor and regularly test your backup strategy.
Here is how to create a Business continuity plan as it relates to IT
Identify planning responsibilities.
Who in the organization is responsible for business continuity planning? If necessary, budget time and resources needed. Executive level commitment and accountability are critical for success.
Business continuity planning is part of the ______________ job role.
_______________ is responsible for creating, maintaining and managing implementation of our plan.
Identify critical business functions that depend on IT
Determine what data is critical to your business, and how it is stored. List applications affecting business processes, and what they depend on. This often will include creating a physical and logical network map, and documenting software licenses and maintenance/support agreements.
Determine which business processes depend on the computers / Internet.,
Here is a typical list:
Accounting data, inventory data.
Client information (Contact list, billing information, leads database, client files, etc.)
Industry specific software or database
Custom software, or customizations to your software.
Office documents (Word, Excel etc.)
Graphics, business card design, brochures etc.
Email, Calendar, contacts
Internet access: Online databases, research tools, vendor and client web sites / ordering tools.
Web site – often offsite.
Cell phones / PDA’s, if used for business.
Identify Risks. Make a list of what could happen, and how likely it is. If you have a security plan, there will be some useful overlap. Most businesses will need to consider these situations:
- Someone deletes a file accidentally, or overwrites the wrong file. An older version of a specific file or program is needed from a particular date in the past.
- Something goes wrong with the computer or device where the data is stored, and it needs to be replaced. Both the programs and data on the system need to be recreated. This could include human error, hardware failure, viruses, theft, etc.
- A service such as internet access, email, or power is unavailable.
- Something happens to the entire building where the data is stored. You need to be able to recover by using an offsite copy of your programs and data.
- A key person is lost. The procedures and information they have in their head is gone, and someone is taking over their role.
Other things to consider:
Damage to data: Viruses, disgruntled employees, Hackers, Data corruption etc.
Incompetence leading to data or equipment damage.
Equipment failure. Most likely parts to fail are fans, hard drives, power supplies.
Theft or vandalism.
Vendor issues: buggy software, business closure or merger, critical services unavailable.
A good security plan is also critical to keeping the integrity of your data.
Determine acceptable down time. This is how long you can afford to be without access to a particular business function or information.
Analyze the effects of an outage over time to identify the maximum allowable time that you would want to risk running your enterprise without a particular function. Also analyze the effects of the outage across related resources and dependent systems to identify any cascading effects that may occur as a disrupted system affects other processes that rely on it.
The point at which the cost of a disruption equals the cost of recovery represents the amount of investment that your organization should make in an availability and disaster recovery solution.
Determine your level of acceptable data loss. Can you get by with losing data since last night? Last week? Some data may need to be backed up more often if you can’t do without it after a disaster.
Develop a plan. For each risk, determine possible steps that could be taken, and the associated costs.
Prevention: what you can do ahead of time to avoid problems.
Mitigation: what you plan to do if something happens.
Implement the plan, and test
Set priorities, allocate resources, and track progress. Assign responsibility and schedule necessary recurring tasks. Build in accountability for specific results.
Typically this consists of some initial work, and then ongoing maintenance.
Tests of your recovery plan is essential (test restore from backup, etc.)
Schedule regular review. Set a date to review the plan. In particular, as the business changes, there may be changes in where critical business data is stored, new applications to consider. Changes in business process or staff should trigger a review of the plan.
Schedule regular tests of recovery steps. Keep documentation current.
Typical Strategies to include in your plan
Have an acceptable use policy. This should include a policy on where business data is and is not stored. This will reduce the chances of getting viruses, and other things users can do to damage the systems.
Have and follow a security plan. Again, this prevents problems.
Have and follow a preventative maintenance plan. Monitoring and maintenance can often prevent problems before they disrupt business.
Clearly define job roles for backup and disaster prevention. Build in accountability.
Keep good documentation.
Use a backup program. Get backup software that can back up any specialized software such as databases or email systems. Make sure you can restore a file from as far back in time as needed by keeping multiple copies of your backups.
Use software that will make a complete image of key computers. Make these images regularly, and keep them separate from the computer being imaged – on an external drive. Back up the image offsite. Test your ability to restore an image and use the system.
Keep critical data offsite
In addition to offsite backups and images, keep a list of software licenses with keys, documentation on configuration and installation procedures sufficient to rebuild critical equipment, and vendor contact information including account numbers and logins. Keep a copy of all software needed to rebuild your IT systems offsite. In particular, the server operating system and the backup software.
Put redundant parts in servers
Servers can be ordered with redundant power supplies, and redundant hard drives. These are the parts most likely to fail.
Buy more than one of the same equipment
Troubleshooting is much quicker and easier if you have the ability to swap out identical parts. Establish hardware standards.
Use a Battery backup. Put a hefty UPS on each server, with software that will automatically shut it down if the power stays off. Sudden loss of power can cause the server data to be corrupted. An orderly shutdown prevents this. Schedule regular tests of the battery, and replace it when needed. They usually last 2-3 years.
Keep critical spare parts. If justified, keep spares of power supplies, hard drives, and other key parts for equipment which can’t be down long enough to order parts
Here are some specific suggestions based on the required service level
If internet access is critical, you may want a router that will handle redundant internet connections, and purchase internet services from at least 2 different independent sources (such as DSL and Cable)
Small offices without a server – 1-2 weeks of down time to replace a computer.
Typically critical data consists of some excel and word documents, a quickbooks database and pop3 based email, and possibly some photos.
Use an external USB drive for regular backups, with workstation backup software. Subscribe to an offsite backup service such as Carbonite, or Mozy. We recommend the Vembu Storegrid backup which can backup both onsite and over the internet.
Alternatively, you can rotate flash drives that you take home.
Small Offices without a server (or critical workstations)– 1 day down time.
In addition to the above, use imaging software with the ability to restore to dissimilar hardware. Schedule nightly images, and back up a copy of the image offsite nightly. If needed you can restore your entire computer to a new workstation purchased locally within a day.
1-4 Days downtime for servers with a failed part
The biggest delay in getting a server up and running again is usually the time it takes to get a replacement server or replacement parts. You may be able to get replacement hardware under warranty, often overnight.. Ways to mitigate this include having spare parts, having a system which could act as a server in emergencies, having same day or next day hardware warranty service. It is also important to have good documentation if identical hardware is needed for a rebuild/restore.
Usually the sequence is:
1. Get the hardware working again. If the hard drive data is intact, you are back to work.
If not, you can use the an image as described below. The alternative is:
2. Re-install the operating system
3. Reconfigure any hardware, drivers etc. to get it minimally working.
4. Install the backup software
5. Restore from backup
6. Test, fix any remaining issues caused by going to a previous point in time with the system.
Servers – under 1 day downtime
Make regular backups of data. Use this to restore in case of a failed database, viruses, or user error. Use imaging software in combination with point in time backups. Make an image of the server nightly, as a scheduled task. Image to an external USB drive, or dedicated backup server. Make a copy of the image file and take it offsite regularly, or use an offsite backup program to do it automatically. In case of a server failure, purchase a hefty workstation, install the Virtual Machine (VM) host program, import your image to a virtual machine, and run the server as a VM until you get replacement hardware. When the new server arrives, image the VM, and restore it to the new server.
Currently we recommend using Acronis True Image Echo server for the imaging, and Vembu backup for internet based offsite copies. Alternatively, you can rotate External USB drives for offsite backup.
Tape backups are now more expensive, less reliable, and harder to restore than using a combination of external drives and internet backups.
Less than 1 day downtime for critical servers
Keep at least 2 servers with identical hardware and excess capacity (drive space, memory and processing power). Run critical applications on one server, install a backup copy (updated daily or more frequently) on the 2nd server. Create a plan for running the application on the 2nd server if the first fails. This includes how to handle the change in IP address and server name for connecting clients, and how to restore the latest data when failing over.
Use backup software which has “bare metal restore” capability, such as Veritas Backup Exec. Continuous protection (About $800.) or Acronis True Image Echo Server ($600) This will allow you to recover to another server in the time it takes to run the restore process, usually a couple of hours, plus the time it takes to set up the recovery server.
Zenith infosystems provides a managed care solution which includes a backup server and backup software for between $1000- $3000 depending on the number of servers backed up. The server is constantly backed up to a virtual machine image, which can be activated and be working in less than an hour. The backup server is intended as a temporary solution, and will not be as responsive, but will keep you working.
Running servers as virtual machines makes this much easier to manage, as you can have a VM image copied over and running in a couple of hours. Microsoft HyperVisor comes bundled with server 2008, making it a good value for newer servers. Backup software which supports backing up a VM image will be needed.
Functionality is available for SQL server that allows “log shipping”, allowing a critical server to keep another backup server updated with all database changes. This can be useful for applications that depend on SQL.
Keep in mind also, that if your plan requires a tech onsite, the response time of the tech will affect how long till you are back up.
Less than 1 hour downtime for critical servers
For critical servers, run the server in a Virtual Machine environment. Use VM management software that automatically creates a duplicate image of the system every few minutes. Use management software (usually on a separate server) that monitors the running servers, and automatically fails over to the duplicate image (on separate hardware) in case of a failure that is not handled by redundant power supplies and hard disks.
VMWare Esx http://www.vmware.com/products/esxiis one option. This is expensive ($7,000 in addition to server hardware). Citrix Xen is another option.
Less than a few minutes down time
Clustered servers are often used where services can’t go down and high performance is required. Multiple servers run the same application, or are set up to automatically fail over to a “hot spare” server. If one server fails, another takes over within a few seconds. This solution requires a more complex recovery plan, and still has single points of failure. This is also expensive, and requires skilled ongoing maintenance.
Have a backup plan, which includes full “point in time” backups of all important data. No matter how reliable your servers are, you still need multiple backups. If something happens to the data, you need to be able to go back to a time before the damage occurred. Identify what should be backed up and how often – how much work in each critical application could you afford to lose? This is part of the disaster prevention plan. Locate any critical business data which is not stored on your server, and include it in your backup plan.
Be aware of the time involved in locating and restoring a backup if needed. This is particularly important if you are using an offsite backup service. The recovery time will depend on a working internet connection (which in some cases relies on your server), and the bandwidth available to copy the backup image. For example, a moderately sized server backup could take 12 or more hours to copy down over your internet connection. For this reason, we recommend a combination of onsite and offsite backups.
Make sure you have offsite copies of critical software needed to restore your system, as well as the backup data. At a minimum this should include the server operating system and the backup software, and associated install keys.
Rotate your backup images in the following way: Have backups for each day Monday thru Thursday (for example). Have a Friday 1, Friday 2, and Friday 3 backup, which rotates weekly. Have a month 1, Month 2 etc, which are used on the last Friday of the month. Here is a typical rotation schedule:
Monday – Monday backup
Tuesday – Tuesday backup
Friday – Friday 1 backup
Monday – Monday backup (Erase and reuse)
Tuesday – Thursday – Erase and reuse daily
Friday – Friday 2 backup
Monday – Thurs – Erase and reuse daily backups
Friday – Friday 3 backup
Monday – Thursday – Erase and resuse daily backups
Friday – Month 1 backups
Monday – Thursday – Erase and resuse daily backups
Friday – Erase and reuse Friday 1 backup
Monday – Thursday – Erase and resuse daily backups
Friday – Month 2 backup
This rotation allows you to go back to most any point in time and recover from bad data.
This is often setup to run automatically from your backup software.
Be sure to make sure your backup program verifies the backup, and that a person checks daily to see that the verify was successful, and that the backup was actually performed. This is a frequent cause of problems. Perform a test restore on a regular basis.
External hard drives are replacing tapes as the backup medium for onsite backups. A 1.5 Terabyte external drive is now under $250. A tape backup drive and tapes can easily cost over $1000.
Keep a current backup off site. This can be done by rotating tapes and taking one home, rotating external hard drives, or using an offsite (internet) backup service. If you have multiple sites, you can set up a backup between sites.
Keep a copy of everything you need to recreate your server off site. The tapes will not do you much good without the backup software you used to create them.
General suggestions for disaster prevention
Have a plan for recovering from the following:
Hard drive crashes. – Mirrored drives or raid arrays are the best protection from this. Critical workstations can be set up with SATA Raid without much additional expense. How long can you be without a computer? Do you have a spare? How much work is involved in recreating your workstation configuration (installing and configuring applications). Consider using standardized workstation hardware, and creating an install image which could be used to restore your configuration to new hardware.
Power supply failures – Have redundant power supplies, or keep a spare.
Utility power failure - Use an Un-interruptable power supply on the server. Consider UPS protection for Network hubs and workstations if work is critical enough that essential work could be lost or damaged. Remember, anything not saved (work in progress) on your workstation vanishes when the power goes off, and if the file was on the server, it could be corrupted when the file is not closed (depending on the application.)
Complete server replacement – If you can’t afford more than one server, consider having at least one workstation built from the same or similar parts which you could rob, or use as a backup server in an emergency. The alternative is to have a certain and immediate source of supply for a replacement. It may be worth pre-configuring a workstation with server software and setting it up for dual boot or virtual machine hosting, and having regular backups to this backup server.
If possible, go through the exercise of simulating the loss of your server. What would you do? Who would do it? How much would it cost, in replacement cost and lost work?
Network switch failure – Have a spare, or a source for a rapid replacement. None of the workstations can work without the hub.
Keep a list of software installed, licenses, settings, vendor & support information, and procedures to reinstall if necessary. Keep a network log, and document the configuration of essential components. This includes passwords, Users and groups, User and group rights, Directory structures, hardware installed and resources used by hardware, drive partitioning & configuration.
Document licensing, vendor contact information, and job responsibilities relating to the network, including relevant disaster planning. Document any relevant WAN configuration information – in the simplest case, a list of fax and DSL modem phone numbers, login account info, and how they are used. Keep a map of your network cabling. Keep a copy somewhere other than on the server - preferably offsite.
Keep a separate box or file for each computer that contains all manuals, disks, and spare parts for that machine. Magazine file boxes work great. If the cost to replace is great enough, keep backups of critical software and documentation off site.
If you have a hosted website, is it being backed up? Make sure it is included in your backup plan.
Security planning is closely related to disaster prevention. Poor security can lead to corrupted or compromised data. A security policy and plan can prevent a disaster.
Some additional general thoughts on planning for catastrophe:
Are there people who are essential to your business? Is their knowledge documented as much as possible? Who could act as a backup in emergency?
What forms/physical files are essential to your business? Can you recreate them in the event of a disaster?
Do you have vendors or vendor related information that is critical? Can this be recovered or replaced quickly enough? Are there vendors that if lost (they have a disaster), they could not be replaced?
Does your phone system depend on utility power? (if it is a PBX it does.) How long will it run without power, and what do you do then?
Backup and bare metal restore
Tim Torian has taught computer networking at the College of Sequoias and Cal Poly Extension. He has a BS in Computer Science, and has been consulting on computer networking for the past 30 Years. His industry certifications include: Cisco CCNA and CCNI, Microsoft MCSE. He was recognized as Entrepreneur of the year for 2008 by the Tulare County EDC. He is president of Torian Group, Inc. which provides a full range of Technology Consulting services to local business, including computer services, networking, web and custom software development. www.toriangroup.com