Wednesday, July 9, 2008

Data center disaster recovery considerations checklist

Looking back on my data center disaster recovery experiences as both an IT director and consultant, I regularly encountered organizations in various stages of defining, developing, implementing and improving their disaster recovery capabilities. Disaster recovery policies and architectures, especially in larger organizations, are complex. There are lots of moving parts: standards and procedures to be defined, people to be organized, technology functions to be coordinated and application dependencies to be identified and prioritized. Add to this mix the challenge of grappling with the inherent uncertainties (chaos) associated with a disaster – whatever the event might be – and the complex becomes even more convoluted.

It is critical to come to an agreement on some fundamental assumptions in an effort to establish and ensure both internal and external (think stakeholders and stockholders). This should be done in order to recognize the need to address these many facets of disaster recovery development. Failure to do so will only lead to significant problems down the road.

I've given many presentations that address the "DR Expectations Gap," in which business assumptions concerning recoverability are often misaligned with actual IT capabilities. It's a fact that without explicit assumptions being clearly identified and communicated, your disaster recovery heroes of yesterday's recovery will become tomorrow's disaster recovery scapegoats.

Key among these assumptions, of course, is establishing classes of recovery in terms of RTO and RPO, but there are also a number of fundamental considerations that need to be measured, weighed and incorporated into the disaster recovery planning process. Here are a few practical planning items whose assumptions must be stated explicitly in order to drive an effective disaster recovery design and plan:

  1. Staff: Will the IT staff be available and able to execute the disaster recovery plan? How will they get to the alternate disaster recovery site? Are there accommodations that need to be made to ensure this? When a disaster recovery event hits, you better understand that some of your staff will stay with their families rather than immediately participate in data center recovery.
  2. Infrastructure: What communications and transportation infrastructure is required to support the plan? What if the planes aren't flying or the cell phones aren't working or the roads are closed?
  3. Location: Based on the distance of the disaster recovery site, what categories of disaster will or will not be addressed? Looking at some best practices, that site should be far enough away to not be affected by the same disaster recovery event – is yours?
  4. Disaster declaration: How does a disaster get declared, and who can declare it? When does the RTO "clock" actually start?
  5. Operation of the disaster recovery site: How long must it be operational? What will be needed to support it? This is even more important if you're using a third party. (e.g. what's in my contract?)
  6. Performance expectations: Will applications be expected to run at full performance in a disaster recovery scenario? What level of performance degradation is tolerable and for how long?
  7. Security: Are security requirements in a disaster scenario expected to be on par with pre-disaster operation? In some specific cases, you may require even more security than you originally had in production.
  8. Data protection: What accommodations will be made for backup or other data protection mechanisms at the disaster recovery site? Remember, after day one at your recovery site, you'll need to do backups.
  9. Site protection: Will there be a disaster recovery plan for the disaster recovery site? And if not immediately, then who's responsible and when?
  10. Plan location: Where will the disaster recovery plan be located? (It better not be in your primary data center). Who maintains this? How will it be communicated?
Obviously, there are many more considerations that are critical to identify and address for successful disaster recovery, but hopefully this tip helped to point you in the right direction.

Sunday, July 6, 2008

UPS Apparent & Real Power

"80%" figures come from several different things and, since the same percentage number is coincidentally used, it can be confusing as to what is "required" versus what is "recommended". Two of these "80%" figures are strictly engineering-related, but are not generally understood. If they are well known to you, my apologies, but I think they bear explanation.

First is Power Factor ("pf"), which is the way engineers deal with the difference between "Real" and "Apparent" power. In our industry, the pf is usually created by reactive devices such as motors and transformers. "Apparent Power" is Volts x Amps. This is the "VA" rating of the equipment, (or "kVA" if its divided by 1,000). "Real Power" is Watts or kW – the "useful work" you get from electricity. Since you have stated that you are using Liebert hardware, we'll assume the 0.8 Power Factor on which their designs (and most UPS designs) are based. With this in mind, what's important to understand is that there are really two UPS Ratings you can't exceed. 100% Load means 400 kVA (kiloVolt-Amperes), but it also means 320 kW (kiloWatts). That comes from the formula kW = kVA x pf. In years past, computer devices with a 0.8 pf were common, so both the kW and kVA ratings of the UPS were reached essentially simultaneously. However, since most of our computers today are designed with much better Power Factors, (between 0.95 and 0.99), it is virtually certain it is the kW rating that will be your limiting factor, not the kVA rating. (For example, a device measuring 10 Amps at 120 Volts draws 1,200 VA or 1.2 kVA of "Apparent Power." If the Power Factor is 0.8 it consumes only 960 Watts or 0.96 kW of "Real Power" – the same ratio as your UPS ratings. However, with a 0.95 pf, the "Real Power" is 1,140 Watts or 1.14 kW, so the kW capacity of the UPS will be reached before the kVA limit.)

The second "80%" number is the National Electric Code (NEC) requirement for Circuit Breaker Ratings. NEC states that you can't load any circuit to more than 80% of the Breaker Rating. This means, for example, that a 20-Amp Breaker on a 120-Volt circuit, running light bulbs or heating appliances which have a pf of 1.0, cannot be continuously loaded to more than 1,920 Watts (120 x 20 x 80%). A 20-Amp breaker can handle a full 20-Amp load for a short time, such as when a motor starts, but running a sustained current will eventually cause it to trip. That's the way they're designed. This, however, has little to do with how a UPS can be loaded, since all the circuit breakers are designed to operate within legal range when the UPS is at capacity. I explain it only because some people have thought this limited the total UPS loading. It does not.

Now to the third "80%" consideration. Any piece of electrical equipment generates heat when it operates, and the more power it handles, the more heat it produces. That's where the "Real Power" goes; it is converted to heat. Industrially rated devices, such as large UPS systems, are designed to withstand this heat – at least so long as proper cooling and ventilation are provided to remove it in the manner for which the equipment was designed. However, heat eventually causes a breakdown of electrical insulation and shortens the life of components, especially if it's applied over a long period of time. Therefore, "good practice" has always been to operate electrical equipment at 80% or less of rated capacity simply to ensure longer life, as well as to compensate for the fact that virtually nothing actually gets cooled and ventilated in the field as well as it does in the lab, or as perfectly as the specifications call for. But top-quality equipment (and any 400 kVA UPS is bound to be from a manufacturer making high-quality goods) is designed with enough "headroom" to run its entire rated life at 100% loading. If there's a weak point, of course, full-level operation will expose it, and it can then be fixed, but that's not what we're talking about here. Although there is no "written rule" I'm aware of, 80% continuous loading is the generally accepted "rule-of-thumb" for maximizing the service life and reliability of electrical equipment.

Now let's discuss the "Parallel Redundant" or "Power Tie" configuration, because that sheds a different light on the situation. In this configuration, as you obviously are aware, you must manage your power so that neither UPS is loaded beyond 50% of its continuous load rating. (Again, both kW and kVA readings must be examined, with the kW reading likely to be the governing one.) This is so the loss of either UPS, whether due to failure or to an intentional maintenance shutdown, will not load the remaining UPS beyond 100% of its rated load. Should we be concerned that one UPS is running at 100% when the other is shut down? Not at all. Even if this load continues for some number of days, the UPS is only operating as it was designed. It is highly unlikely that a few hours, or even a few days, of full-load operation will shorten its life (again, hidden flaws notwithstanding, in which case any load may cause a failure at some point in time). What we should consider here is normal conditions, where each UPS is operating at less than 50% of its capacity. This is obviously well below the 80% rule-of-thumb, so the UPS is literally coasting. Under this loading, it should run for many years beyond its expected life span if kept clean and the batteries remain in good condition.

Therefore, your stated limit of 40% on each UPS is ultra-conservative. Obviously, you don't want to be so close to the 50% level on either UPS that someone plugging in a temporary device runs you over, but in your situation you have some 32,000 Watts of "2N redundant headroom" at the 40% level, and that's a lot of expensive cushion. (Estimate $1,200 per kW for each UPS, and you're at more than $76,000 in "insurance").

As you observed, any good UPS can sustain a little over 100% rating for a short time, so if you happened to exceed 50% temporarily, and a failure were to occur, the second UPS might Alarm, but it should continue to function at least long enough for you to accomplish some manual load-shedding. You have also mentioned, however, allowing capacity for parallel installations and change-outs, which is a valid reason to operate below the 50% level. Only you can determine how much margin you really need for that purpose, but since redundant UPS capacity is expensive, as noted above, it might be more cost-effective to run test setups on lesser, rack-mounted UPS's (full-time, not "line interactive") than to maintain a high level of headroom on a parallel-redundant system.

Surges should not be a particular concern. There's not much in the data center that can cause a significant surge. Those tend to occur on the input side, and are part of what the UPS is supposed to get rid of. (Hopefully, you have good surge protection on your bypass feeders.) The one big thing that must be evaluated in choosing UPS systems for redundant operation is the "step loading function." In the event of a failure, the resulting sudden load shift can be 100%, literally doubling the load on one UPS virtually instantaneously. The UPS must be able to sustain this rapid change, and maintain stable output voltage, current, frequency and waveform, to be suitable for redundant service. This is an easy performance item to verify with, and to compare among, manufacturers.

Regarding your PDU's, the kW capacity is dependent on the pf of your data center loads. Liebert's PDU Technical Data Manual instructs you to assume a 0.8 pf if the actual pf is unknown. This would mean each of your 225 kVA PDU's could deliver only 180 kW of power. Today, it is more probable that the pf is in the order of 0.95, as discussed above, which would mean that each 225 kVA PDU could deliver 214 kW or more. (Incidentally, you should be able to read the kW, kVA and pf from your PDU metering systems.)

If we understand your PDU configuration, you have only two 225 kVA units connected to your 400 kVA parallel redundant UPS, rather than two per UPS which would be a total of four. If this is correct, then with 0.8 pf loads it would be possible to run each PDU at 89% of capacity without exceeding the kW rating of your redundant UPS (89% x 180 kW x 2 PDU's = 320 kW). If the loads are closer to a 0.95 pf, which is likely, then you could load each PDU to only about 75% of maximum before reaching the limit of your UPS capacity (75% x214 kW x 2 PDU's = 321 kW). This is obviously well below the 80% "rule of thumb" for both the PDU's and the UPS's. In most data centers, because we like to minimize the number of devices on a single circuit, branch circuits are rarely loaded to more than a fraction of capacity, so the 80% breaker maximum is rarely a consideration, and total PDU loadings are often far less than maximum.

But this is another place where redundancy must be considered. If you have only two PDU's, and you are connecting dual-corded equipment plus, as you indicate, single-corded equipment with Static Transfer Switches (STS), then you must maintain the total load on both PDU's at no more than 100% of one PDU's capacity. The easiest way to ensure this is to keep the loads fairly evenly balanced, and below 50% of capacity on each PDU. In your case, assuming (2) 225 kVA PDU's and equipment with a pf of 0.95, the load on each PDU should be no more than 107 kW or 113 kVA. This would result in a total maximum load on your UPS of only 214 kW (67% of the 320 kW redundant capacity) or just 33.5% capacity on each UPS. This is gross under-utilization of the UPS. But if you load either of your 225 kVA PDU's to more than 50% of capacity (assuming the total now exceeds 100%), and either PDU must be shut down for service, or its main breaker trips, then the total load will instantly shift to the remaining PDU, which will now be overloaded and will also shut down – if not immediately, then in a short time.

My preference is to use more, smaller PDU's in order to maintain redundancy, as well as to gain as many discreet circuits as possible. One can then configure data hardware connections to minimize PDU vulnerability, as well as for UPS redundancy, make better use of each PDU, and realize the full usable capacity of the UPS. Without an actual diagram of your installation, of course, we are just speculating as to how you are configured. There are several ways to connect an installation, and our response is based on the way we read your question.

You seem to be running your UPS very conservatively. We are not so sure about how you are running your PDU's, but it seems you have a reasonably good understanding of the principles. We hope this answers your questions, gives you a little more insight, and perhaps lets you confidently get more from your valuable UPS.