Risk Assessments for Critical Operations Power Systems

Engineers are asking many questions about the new Article 708, Critical Operations Power Systems (COPS), in the 2008 National Electrical Code (NEC). For example: How do we comply with the documented risk assessment required in Section 708.4? Unlike the other special systems of Chapter 7, there is subtext to the new article that may depend upon mandates from state or federal authorities.

By Michael A. Anthony, PE, Robert Arno, Robert Schuerger, PE, and Evangelos Stoy June 1, 2008

Engineers are asking many questions about the new Article 708, Critical Operations Power Systems (COPS), in the 2008 National Electrical Code (NEC). For example: How do we comply with the documented risk assessment required in Section 708.4? Unlike the other special systems of Chapter 7, there is subtext to the new article that may depend upon mandates from state or federal authorities.

Fine Print Note No. 1 reads: “Critical operations power systems are generally installed in vital infrastructure facilities that, if destroyed or incapacitated, would disrupt national security, the economy, public health, or safety, and where enhanced electrical infrastructure for continuity of operation has been deemed necessary by governmental authority.”

To NEC traditionalists who believe the NEC should remain a prescriptive installation code for electrical safety in building premises wiring only, this may be unsettling. They want clear, bright-line code that makes their point at construction board of appeals meetings. Even though there are a few sections of the code that address design, along comes Article 708, placed in the special system chapter, with all of these subtleties not seen in the other Chapter 7 articles that govern backup power systems commissioning, maintenance, and testing. It has annexes with new words that seem to make paperwork as important as wiring (see COPS terms of art, at right).

It also raises issues of how these COPS requirements will be financed. You can build the cost of a generator into the first cost of a building on the basis of NFPA 101. However, is ensuring 72 hours of fuel for it (as required in 708.22-C)—because the COPS is an integral part of the building—considered an operating expense? Conversely, can a homeland security grant help pay for an infrastructure upgrade where the designated critical operations area (DCOA) is an embedded system in a multi-function building? Does testing to compliance with 708.6 mean that third-party agencies need to be paid from a county homeland security grant or from the same municipal budget that covers inspections for general commercial occupancies?

These are questions that will be answered as 708 is integrated into other standards and codes. Article 708 cross-references six other NFPA standards within itself. Progress in the integration process will be measured by how fast we see reciprocal referencing of 708 in other NFPA, International Code Council, Federal Emergency Management Agency (FEMA), and Dept. of Homeland Security documents; and how fast precedents track in white papers and advisories of public utility commissions and county emergency management agencies.

Here, we describe the characteristics of an NEC-compliant risk assessment for a single designated critical operations area. We assume it does not trigger a multi-trade infrastructure upgrade of the larger facility within which the COPS is embedded. We borrow heavily from the technical foundation provided by the U.S. Army Corps of Engineers (USACE) Power Security Enhancement Program whose technical manuals, used in the design and operation of Command, Control, Communications, Computer, Intelligence, Surveillance, and Reconnaissance facilities have been released for civilian distribution.

Because the geometry of regional critical infrastructure typically is widely scattered, keep in mind that the risk assessment methods described here can be scaled outward to encompass an entire city, county, or state. The emergency operations plan required in 708.64 should reflect a comprehensive understanding of critical operations as a system-wide entity that may encompass several locations networked together as a single operation.

STEP 1: DETERMINE LOCATION(S)

A template for a regional risk assessment was introduced before the 2008 NEC went to print, before it had been established that the technical panel assigned to the task was dealing with a Chapter 7 special system rather than a Chapter 5 special occupancy. The article included a qualitative discussion of how jurisdictions might rank regional risks. Simply put, DCOAs had to be scaled according to the relative likelihood of natural or human-made disasters.

In a follow-up article, a nominal prioritization procedure that resembled a mode criticality “multi-voting” technique promoted by the American Society for Quality was used to produce a disciplined regional risk assessment. An example ranking the likelihood of earthquakes, convective weather, and pipeline accidents was tabulated for a southeastern Michigan county.

In identifying candidate locations that can meet 708.5 requirements for physical security, it is likely that adopting jurisdictions will have a choice of existing emergency response/disaster management centers. After all, there has been a great build-out of emergency management facilities—even before Article 708 came along—and we have to assume that they are able to communicate with one another. It’s important to keep communication channels open, so the writers of the 2008 version of Article 708 appended references to the signaling between these agencies in Annex G: Supervisory Control and Data Acquisition. The same topic is covered in substantial detail elsewhere.

Ideally, COPS should not contain a single point of failure that would allow both the normal electrical service and the emergency backup power to be affected by a single incident. This is more difficult to achieve than it first appears with only one transfer switch.

The normal design for uninterruptibe power supply (UPS) batteries is 10 to 15 min, providing power continuously to the UPS output while the generator starts and comes on line. For a facility that has personnel on-site 24/7, providing a means to either manually transfer the switch under no load (which most ATS have built-in) or bypassing the switch would solve this single point of failure. For an unmanned site, redundant components would be required. These kinds of failures can be identified by a fault tree analysis or a failure modes and effects analysis.

Emergency and normal electrical equipment should be installed separately, at different locations, and as far apart as feasible. Beware of co-locating essential or redundant feeders with other major utilities. Some types of facilities—where tornadoes or blasts rank high on the regional risk assessment—need redundant electrical distribution to critical areas. Central utility shafts may be subject to damage.

One qualitative method of evaluating the architectural, utility, and site layout aspects of this part of the risk assessment is found in the “Building Vulnerability Assessment Checklist” developed by the Department of Veterans Affairs and described in detail in a FEMA manual.The reference takes the user through a consistent security evaluation of designs at various levels.

There are other methods, such as those used by USACE in the evaluation of the utility systems for prospective C4ISR facilities (see Risk assessment: determining location, at right).

Keep in mind that if a DCOA remains embedded within an existing general occupancy building, the COPS for the DCOA will have to be protected from changes to the mechanical and electrical infrastructure associated with the general occupancy parts of the building.

STEP 2: DETERMINE REQUIRED AVAILABILITY

Availability is the percentage of time the system is in operation or is available for operation. Suppose the state emergency management agency issues a directive that all city fire departments must have backup power available to at least “three nines” (99.9%), less than the 72 hours required in 708.22-C. Three nines would not be an unfamiliar metric: the power provided by most U.S. utilities is available out to three and four nines. The agency is asking for utility-like availability at the DCOA using the power sources listed in Part III of 708.

Identifying an acceptable risk is one of the most difficult decisions for higher governmental authorities to make. While the cost of additional nines is a well-known parameter in the business continuity industry guided by NFPA 1600, “Standard on Disaster/Emergency Management and Business Continuity Programs,” these agencies would have to look at the economics of additional nines to public service organizations with different missions.

Getting availability out to three nines could be costly if fuel storage or emissions restrictions apply to the site. As the availability table in Annex F shows, three nines of availability translates into a potential 8.76 hours of downtime in a year. If getting to three nines means periodic testing that triggers a local cap on emissions, designers may have to investigate other feasible sites. They could develop vehicle-mounted generators covered in 708.20(F)(6) as an option, along with fuel-handling logistics.

Administrative options should not be overlooked in the hunt for savings. Limiting the use of parts of the overall facility provides a workaround, sometimes as simple as scaling the size of the DCOA to the number of people or the level of training of the people who will use it.

To achieve 708 objectives with a single engine-generator set suggests a considered design in which administrative procedures and power chain hardware is balanced carefully. One characteristic of the reliability engineer’s art is how administrative options are translated into the system model. Translation of system needs or requirements to reliability numeric is critical in establishing a facility meeting the 708 requirements. Caution should be exercised here not to fall into a “cookie cutter” approach.

STEP 3: BASE CASE: ESTABLISH AVAILABILITY POTENTIAL

Several different tools exist to model the availability potential of the COPS design. In general, reliability engineers use modeling tools with the capability of statistical simulation to incorporate variations in component capacities associated with consumables such as water and fuel. Failure rate and repair time data play an important role in the analysis of a system to determine whether it meets requirements.

IEEE reliability experts began collecting data on failure rate and repair more than 20 years ago, published in the Gold Book7. This effort ran parallel to an international trend in total quality management (TQM) that employed statistical and probabilistic methods to improve component and system quality. A reference to TQM also shows up in Annex F. Realizing the importance of gathering failure rate and repair cost data to its own mission for civilian infrastructure security, the USACE funded the most comprehensive reliability data collection effort to date. The failure data used in doing any DCOA/COPS study should be included in the appendix of any risk assessment.

Unlike the cable reactance data in the wiring tables of Chapter 9 of the NEC—some of which is the better part of 100 years old now—power chain component reliability data are far more dynamic, reflecting improvements to “commodity” components of a COPS such as engines, generators, transfer switches, and UPS systems.

None is a commodity in a strict economic sense. They are components, lumped as parameters within a system, and manufacturers are always improving them. Because they tend to be factory-assembled and can be shipped to the system site, they can be treated as interchangeable commodities in a COPS. The core value of a COPS lies in the proper application of those commodity components to produce an initial and continually verifiable nameplate availability.

Skilled reliability engineers will apply their judgment in reconciling the competing requirements for redundancy and simplicity. They will hit the availability target with the least number of components, concentrating dollars on the components that will yield the largest value. Construction and maintenance activities are reciprocal partners in that effort, so the reliability engineer will balance attention to both the specifics and the interconnectedness of the first-cost/long-term O&M cost conundrum.

With commercial software, one can enter the failure and repair data from user-defined libraries into branch and node input data screens similar to the manner in which short circuit or load flow studies take in system data.

STEP 4: HITTING THE TARGET

To focus on the areas needing the most attention in the system improvement process another set of useful tools are Failure Modes and Effects Critically Analysis (FMECA) and Fault Tree Analysis (FTA).

Each method is described in Annex F, Part I, and each employs a different probability distribution equation, but both accomplish the identification of weak members of in a system that contribute to prospective unavailability. These procedures focus efforts to critical branches and nodes of the system needing the most improvement.

These quantitative methods are new to the NEC, but are not new to other NFPA documents. For example, NFPA 1600 mentions FTA in its Annex A. Annex A of the 2006 “Vehicular Fuel Systems Code — NFPA 52” also mentions FMECA as a hazard analysis method.

FMECA and FTA methods are standard operating procedure in many industries that continually seek to improve designs for products and processes. They are required to comply with safety and quality requirements, such as ISO 9001, QS 9000, ISO/TS 16949. Substantial how-to information specific to mission critical power systems appears in another technical manual.

Note that four-nines availability exceeded the initial target of three nines. If four nines is too expensive, the reliability engineer might try to “value engineer” the COPS with less expensive components at initial construction to be compensated with a more robust testing plan in the long run.

Design engineers also can run cases to demonstrate to the inspector how COPS availability erodes over time unless maintenance tasks are performed, or how reliability might grow as commodity components burn-in. Other cases could be run to show how too-frequent use, coupled with over-testing, would trigger a longer repair time in a high-speed generator due to manufacturer recommended engine overhaul.

STEP 5: PERFORMANCE TESTs

The development and implementation of functional performance tests (FPT) described in Part II of 708 applies an operational perspective to system design. Functional tests are developed during design and performed after construction to demonstrate that the COPS system will function according to the desired nameplate availability. Since FPTs are based on the actual installation, if the system changes, the tests may need to be modified.

Although baseline test results are required in 708.8, the FPT described in Annex F is optional material, “included for informational purposes only.” The FPT would be not just commissioning of components or subsystems, but a full system-wide performance test exercising as many functions as possible. It should be a multilevel simulated failure exercise to ensure system readiness. Ensuring its nameplate availability throughout the lifecycle of the COPS can be accomplished by applying NFPA 70B, Recommended Practice for Maintenance of Electrical Equipment.

Business continuity companies are seeking business models wherein the commodity portion of sophisticated systems (UPS, on-site generators, transfer switches) can be released turnkey to a supplier who can build to availability specification and manage the supply chain to keep it there. Emerging specifications for connectability and maintainability allow these systems to integrate faster with broader business continuity networks. Installing the best COPS will require the establishment of new partnerships and the development of new supply chains. The more economically we can build these systems, the more systems we can build, which is better overall.

THE BOTTOM LINE

The subject of electric power security is a minefield of sensitivities about boundaries and budgets, risk, and civil readiness. Interdependent systems that support electricity supply are not perfect and institutional mechanisms to support reliability, security, and survivability need to strengthen at the building premises level. NFPA understood the need and found the means to convey the best practices of the business continuity industry into public sector emergency preparedness.

Article 708 looks a lot like performance-based design—something the building safety community still tends to put at a distance. The science involved in developing a COPS is at least as sophisticated as the multi-disciplinary science advanced by the Society of Fire Protection Engineers and described in Chapters 5 of the Life Safety Code (NFPA 101), the Uniform Fire Code (NFPA 1), or the Uniform Building Code (NFPA 5000).

Despite NFPA’s extensive coverage of performance-based practice, most jurisdictions still regard performance-based designs as the exception rather than the rule. There are at least three reasons for this, each loosely related:

The difficulty in verifying the claims of substantial equivalency among complex systems . It is easy to compare two nameplates, but hard to compare two binders full of facility engineering documentation—even if you could find it, and even if it were up-to-date. The prospect of split functionality of an engine-generator system for an emergency, legally required, and COPS, exacerbates the problem.

An aversion to anything that cannot be counted . Prescriptive methods such as “one smoke detector every 30 ft. down an egress corridor” or “two sources are always better than one,” while sometimes wasteful, can be verified by the naked eye. The price we pay to for visibility and standardization is that we overbuild.

The insurance company wants to see a prescriptive solution. Enough said.

We have to be careful about what could be perceived as regulatory excess. We do not want to under-do COPS, but we cannot overdo them either. Jurisdictions always have the option of delaying adoption or even ignoring any NEC requirement. Experienced electrical professionals know that the cheapest time to build backup power systems is the day before the next major regional contingency.

The prospect of stepped-up regulation frequently stimulates innovation to avoid it, creates opportunities for unproven technologies, or creates improvements in commodity components. Unless the jurisdiction can afford the multiple-use of generating sources contemplated in 708.22(B), we know that backup systems are not perceived to have value until they are needed.

Nevertheless, a vast industry process is just booting up. Innovation in fast-turn design and construction techniques—including partnerships that span the power supply chain—have emerged. Other ways of steering capital to COPS, though not necessarily cheaper, may lie in integrating them into regional distributed power regimes. While these are typically more expensive than centralized systems, in a distributed power regime, COPS would become more common so we could reach the distributed generation “tipping point” envisioned by alternative energy futurists. The concept is already tracking at the Federal Energy Regulatory Commission.

For the moment, lack of electricity should not be among the problems of first responders and disaster recovery teams. Instead of trying to manage a crisis within a crisis, Article 708 establishes the framework for managing a plan within a crisis.

ITEM LABEL
Sample parts list from which a statistical framework can be built and the numbers run. Each component has a failure rate measured in mean time to failure and a mean time to repair, both measured in hours.
Utility UTL
Dist swbd SB1
Circuit breaker: draw out type CB4
Generator GEN
Circuit breaker: draw out type CB4
Auto transfer switch ATS1
Dist swbd SB1
Circuit breaker: molded case CB4
Circuit breaker: molded case CB4
Rectifier RECT
Static switch SS4
Inverter INVR1
Battery/cap BAT CH
Battery BAT

MTBF (hrs) MTTR (hrs) Availability
The results of the sample COPS hitting the prescribed availability target.
Utility and generator to ATS feeding aswitchboard, UPS module with battery and static bypass switch. 115,900.10 1.55 0.9999867

COPS terms of art

Power reliability and power security are often used interchangeably and have become the focus of an expanding intellectual history involving the combined—but not always harmonious—efforts of standards writers. Annex F, the informative material in the 2008 National Electrical Code that accompanies Article 708, contains a description of these concepts. For the purpose of this article, we refine these concepts further with the use of a simple notational identifier to distinguish between reliability (the measured, scalar quantity) and reliability (the general term).

Reliability is a term that reflects the overall state of a system, the Fine Print Note of NEC Section 700.12, for example. Reliability is a number, typically expressed as a percentage or a decimal, that reflects the probability and frequency of failures and is expressed as a probability over a given duration of time cycles. It appears in the NEC Annex F. In general industry practice, it represents the probability that a product or service will operate properly for a specified period of time under design operating conditions without failure. It is a time-dependent metric; the longer the time interval, the lower the reliability, regardless of what the system design is. The better the system design, the higher the probability of successful operation for a longer period of time.

Availability is always measured in terms of percentage of uptime versus downtime. The closer to 100%, the better.

The limited vulnerabilty design concept originated in work by the U.S. Army Corps of Engineers is used to describe a building structure designed to detect potential terrorist threats, isolate resulting damage, and promote survival of personnel affected by an event while propagating continued parallel mission activity.

Risk assessment: determining location

Determination of location in the risk assessment of critical operations power systems (COPS) can be difficult for emergency management agencies that have many informed (and vocal) stakeholders in land/space issues. We illustrate below how the U.S. Army Corps of Engineers designs its designated critical operations areas (DCOAs). The location of a civilian DCOA will depend heavily upon whether meeting Article 708 requires a new stand-alone facility, or whether a DCOA within an existing building can be “hardened” to the requirements. Getting this decision right may be the most costly aspect of meeting 708 requirements.

The interior floor plans below illustrate how the limited vulnerability concept is used to describe a building structure designed to detect potential terrorist threats, isolate resulting damage, and promote survival of personnel affected by an event, while propagating continued parallel mission activity. It relies on compartmentalizing the construction of the DCOA facility into multiple zones, each of which is separated from the other zones by barriers adequate to withstand the range of potential threats. Site-specific factors will determine how the COPS engine-generator and fuel supply would be positioned with respect to the secure perimeter.

Anthony is senior electrical engineer at the University of Michigan, Ann Arbor. Arno is director of the C4ISR group at EYP Mission Critical Facilities. Schuerger is director of risk assessment at EYP Mission Critical Facilities. Stoyas is a consulting engineer who was a member of Code Panel 20 that developed Article 708 and a member and former chair of the technical panel that writes NFPA 70B.