Project success is the sum of all parts

In a data center, uptime is more than just the sum of its pieces.

09/27/2012


DeCoster is CEO of Mission Critical West. He is an acknowledged expert in critical facilities infrastructure, and has consulted for hundreds of data centers and satellite earth stations over his 30-year career.In most cases, following best practices can resolve nearly any engineering issue. Let me give you an example. 

On the surface, the data center had all the pieces: good engineering, commissioning, proactive maintenance, training, procedures, monitoring, and redundancy. Uptime should be assured. Yet in one reported outage event earlier this summer, it all came crashing down. What happened? 

Several things happened. A double substation utility power loss from a cable fault started the mess, perhaps a bit uncommon but not rare. The site went to generator backup; OK, so far so good. But then one generator that was maintained earlier went offline due to overheating. A cooling fan that did not operate was the culprit, perhaps due to a human error following maintenance. Then the distributed loads failed over to another protected power feed, which immediately opened due to an incorrectly configured circuit breaker. Then the IT software attempted to transfer the affected loads transparently to another site, except that the transfer failed causing a catastrophic crash. On the surface, none of the four events should have affected loads. In fact, no two or even three of them should have. 

This story is not unique. Several other major data center load losses have occurred in the past year alone, all with some kind of redundant configuration. Yet in a recent study I did on data center availability, we discovered dozens of data centers with less than Tier IV resiliency that experienced no load losses, in some cases for periods of more 15 years. So what lesson, if any, is behind all of this? 

Data center availability or uptime is more than the sum of its pieces. The sites we found that were successful routinely employed “best practices” at virtually all phases of design, construction, and operation.  Engineering took into account lessons learned from early grid collapses, technology improvements, and cumulative industry failure assessments. Monitoring was excellent and done in real time, often with trending information. Configurations, even if less than 2N+1, were sound for the mission and more redundant in areas known to be weakest, such as batteries as opposed to magnetics where budgets were at issue. Labeling was clear and complete. 

Commissioning was not just done, but done correctly. Maintenance was reliability-centered, proactive, and managed. Security and change procedures were rigid. Vendor technicians were not allowed to go off script. If any component with system impact was changed out, testing followed on that subsystem before it was relied upon. There were double sign-offs on method of procedure/method and procedure (MOPs/MAPs) before any actions on critical gear.  Escalation procedures were in place and instantly available on building management system/network management system (BMS/NMS). And in the optimal case, almost every conceivable failure eventuality was brainstormed by stakeholders, with simulations scripted and rehearsed for both system effect and personnel training. All of these separate considerations must be integrated, not just into the formal plan, but into the mind-set. In large measure, it is attitude. And commissioning and continued testing are particularly critical. 

Commissioning, or testing in a broad sense, provides piece of mind to data center management. “We passed Cx” sounds good, but what does it mean? You load-banked a UPS or genset for hours. You checked harmonics. Great, but what happens in a real-world transfer scenario when phase displacement exists, or poor power factor (PF), or high capacitance, or cumulative inrushes, or improperly high recharge, or something else? A 15-minute multi-string sealed battery was rundown tested, but was it tested with one string out? The replacement circuit breaker (CB) arrived, but did you notice that it was set to fastest trip for legal reasons? 

Every subsystem should be individually wrung out at worst-case conditions, then the entire system should be tested in the worst-case anticipated loading and transfer scenarios. Any major changes or modifications to critical elements afterward signal a need for reasonable retest for confirmation. This requires been-there, done-that experience to execute well, but these steps may have saved a crash one summer day. 

Murphy’s Law says if something can go wrong, it will. To combat Murphy’s Law when engineering building systems, follow best practices.

 


Dennis DeCoster is CEO of Mission Critical West. He is an acknowledged expert in critical facilities infrastructure, and has consulted for hundreds of data centers and satellite earth stations over his 30-year career.



Engineers' Choice Awards
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by Control Engineering subscribers.
System Integrator Giants
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
How to Maximize Factory Automation Efficiency with Low Cost Machine Vision
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Wireless Reliability in Harsh Environments
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
Human Factors and the Impact on Plant Safety
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
June 2018
Discrete and process sensor fundamentals, autotuning controls, system integrator roundtable
May 2018
Salary and Career Survey, IT and OT convergence, robotic standards and safety, secure circuit protection
April 2018
Cybersecurity best practices, artificial intelligence, robotic additive manufacturing, embedded systems, IIoT integration, energy efficiency
Edge Computing
This article collection contains several articles on how today's technologies heap benefits onto an edge-computing architecture such as faster computing, better networking, more memory, smarter analytics, cloud-based intelligence, and lower costs.
IIoT: Machines, Equipment, & Asset Management
Articles in this digital report highlight technologies that enable Industrial Internet of Things, IIoT-related products and strategies.
PLCs
Programmable logic controllers (PLCs) represent the logic (decision) part of the control loop of sense, decide, and actuate. Featured articles in this digital report compare PLCs and programmable automation controllers (PACs), industrial PCs, and robotic controllers.
SIDB

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

June 2018
Machine learning, produced water benefits, progressive cavity pumps
April 2018
ROVs, rigs, and the real time; wellsite valve manifolds; AI on a chip; analytics use for pipelines
February 2018
Focus on power systems, process safety, electrical and power systems, edge computing in the oil & gas industry
John O. Ayuk, PE, CFSE, PMP, CAP
Automation Engineer; Wood Group
Doug Baker
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
Engineers' Choice Awards
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by Control Engineering subscribers.
System Integrator Giants
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
How to Maximize Factory Automation Efficiency with Low Cost Machine Vision
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Wireless Reliability in Harsh Environments
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
Human Factors and the Impact on Plant Safety
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
June 2018
Discrete and process sensor fundamentals, autotuning controls, system integrator roundtable
May 2018
Salary and Career Survey, IT and OT convergence, robotic standards and safety, secure circuit protection
April 2018
Cybersecurity best practices, artificial intelligence, robotic additive manufacturing, embedded systems, IIoT integration, energy efficiency
Edge Computing
This article collection contains several articles on how today's technologies heap benefits onto an edge-computing architecture such as faster computing, better networking, more memory, smarter analytics, cloud-based intelligence, and lower costs.
IIoT: Machines, Equipment, & Asset Management
Articles in this digital report highlight technologies that enable Industrial Internet of Things, IIoT-related products and strategies.
PLCs
Programmable logic controllers (PLCs) represent the logic (decision) part of the control loop of sense, decide, and actuate. Featured articles in this digital report compare PLCs and programmable automation controllers (PACs), industrial PCs, and robotic controllers.
SIDB

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

June 2018
Machine learning, produced water benefits, progressive cavity pumps
April 2018
ROVs, rigs, and the real time; wellsite valve manifolds; AI on a chip; analytics use for pipelines
February 2018
Focus on power systems, process safety, electrical and power systems, edge computing in the oil & gas industry
John O. Ayuk, PE, CFSE, PMP, CAP
Automation Engineer; Wood Group
Doug Baker
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
Engineers' Choice Awards
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by Control Engineering subscribers.
System Integrator Giants
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
How to Maximize Factory Automation Efficiency with Low Cost Machine Vision
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Wireless Reliability in Harsh Environments
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
Human Factors and the Impact on Plant Safety
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
June 2018
Discrete and process sensor fundamentals, autotuning controls, system integrator roundtable
May 2018
Salary and Career Survey, IT and OT convergence, robotic standards and safety, secure circuit protection
April 2018
Cybersecurity best practices, artificial intelligence, robotic additive manufacturing, embedded systems, IIoT integration, energy efficiency
Edge Computing
This article collection contains several articles on how today's technologies heap benefits onto an edge-computing architecture such as faster computing, better networking, more memory, smarter analytics, cloud-based intelligence, and lower costs.
IIoT: Machines, Equipment, & Asset Management
Articles in this digital report highlight technologies that enable Industrial Internet of Things, IIoT-related products and strategies.
PLCs
Programmable logic controllers (PLCs) represent the logic (decision) part of the control loop of sense, decide, and actuate. Featured articles in this digital report compare PLCs and programmable automation controllers (PACs), industrial PCs, and robotic controllers.
SIDB

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

June 2018
Machine learning, produced water benefits, progressive cavity pumps
April 2018
ROVs, rigs, and the real time; wellsite valve manifolds; AI on a chip; analytics use for pipelines
February 2018
Focus on power systems, process safety, electrical and power systems, edge computing in the oil & gas industry
John O. Ayuk, PE, CFSE, PMP, CAP
Automation Engineer; Wood Group
Doug Baker
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
click me