Architecture for mitigating effects of external faults

Choosing tools and techniques for creating fault-tolerant control environments and networks.

By Dave Denison, Emerson Process Management October 31, 2011

Experiencing a fault in a plant automation system is an undesirable and often unpredictable event. Its impact on plant operations may vary from virtually no effect to creating a catastrophe with potential loss of life, property destruction, or damage to the environment. Each plant site must establish carefully engineered strategies for mitigating the effects of a fault on plant operations and must develop action plans for containing the risk. In general, it is desirable for users to select a state-of-the-art plant automation system that is flexible enough to mitigate potential risks, while still affordable for critical or noncritical application needs.

Faults, failures, and defects

A fault is simply a failure, a defect, or a flaw in a component or a device. In the context of plant automation systems, a fault in a device causes the device to malfunction, so it does not provide its expected or designed function. Most process automation systems are processor based and typically include hardware and software components, both of which can fail. The robustness of the system architecture will dictate the impact of a fault. In a well-designed system, a fault may not result in any type of failure. On the other hand, the same fault in a poorly designed system may cause multiple instances of failure.

Faults can be classified as either internal or external. Internal faults are often caused due to systematic design faults such as hardware design flaws or software defects and bugs. Such faults are generally repeatable for a given set of inputs. Randomly occurring internal faults in a device aren’t usually repeatable and may result from defects introduced during manufacturing. Suppliers of process automation systems usually minimize these faults through stringent quality standards.

External faults and disturbances originate outside a device but may propagate into the device causing it to fail.  Examples of external faults may include environmental effects (electromagnetic interference, temperature changes), operational faults (operator errors), accidental damage (power surges, physical damage to network equipment), and maintenance/installation faults (improper grounding, shorting).

Mitigation strategies can minimize the impact of external faults on the functionality of the system. While developing mitigation strategies, focus on selecting strategic components of the process automation system to provide fault tolerance at a reasonable cost.

Architecture of a plant automation system

Consider a typical three-level hierarchical architecture for a manufacturing enterprise. Starting at the bottom, a plant automation system typically includes a control layer with field devices and controllers as a foundation for performing real-time control functions. In the middle, integrated with the control layer, is a plant-operations level that provides operator, supervisory, and maintenance functions. Manufacturing processes are automated by various components within these first two levels.

The top level is the business system that processes tasks such as resource planning, accounting, and management reporting. The plant automation system also provides integration in real time between business systems and plant operations. In general, the criticality of performing assigned tasks within a predefined time interval is lower at the business systems layer than it is at the control layer. That is, control tasks must be executed within a significantly shorter time interval as compared to operation tasks. Operation tasks, in turn, are then executed at a shorter time interval as compared to business transaction oriented tasks.

The impact of a fault on plant operations may vary widely depending on the criticality of the device that has failed. At the control level, failure of a multiloop controller regulating a high-pressure, exothermic reaction may be catastrophic since the failure could result in loss of life, property destruction, or damage to the environment. On the other hand, at an operations or business systems level, if a printing device used for creating a report fails, it may be viewed as noncritical in nature. Given these differences in criticality, it is desirable for the architecture of a process automation system to provide flexibility to the user to develop mitigation strategies that match the potential risk as well as provide affordability of various fault tolerant solutions on a per-application basis.

Increasing plant availability

Fault tolerance is the ability of a system to perform its function correctly even in the presence of faults. The purpose of fault tolerance is to increase the reliability and availability of a system, allowing it to respond gracefully to an unexpected fault. The level of gracefulness in a fault condition may be measured in terms of the availability of the system and operational degradation to system functionality. State-of-the-art architecture would provide the user with high system availability and a low level of degradation without negatively impacting cost, performance, and ease of use of the system.

Fault tolerant architecture uses three techniques to minimize impact of a fault:

  • Fault recovery
  • Fault containment, and
  • Redundancy.

Fault recovery techniques use fault detection mechanisms followed by a series of steps to recover system functionality lost due to the fault. Examples include error correcting memory (ECC), watchdogs, software check pointing, and others.

 A fault containment technique prevents a fault from propagating within the system in order to limit the amount of damage. The damage may be compartmentalized or isolated using containment barriers such as firewalls, intrinsically safe I/O systems, microprocessor memory management units, and others. 

Redundancy for high-availability applications

Redundancy may be defined as having more of a resource than the amount minimally necessary to perform a desired function. A redundancy-based technique for mitigating the impact of a failure often uses duplication of components. These components have the ability to take over execution of the desired function from one another should one of them fail. 

Redundancy provides fault tolerance for a wide range of faults with virtually no operational decay. It also enables execution of other applications such as fault detection and online upgrades. However, using component redundancy to provide a fault tolerant architecture may result in higher costs, a larger footprint, and an increase in internal system complexity. These penalties may be overcome by a modular incremental design, use of very-large-scale integrated circuits, and transparency of redundancy from the user perspective.

Redundancy schemes for control, communications, and power conversion functions include:

  • Simple duplication
  • Diverse technologies
  • Active/hot standby redundancy, and
  • Lock step redundancy.

Simple duplication uses at least two components performing the same function independently of one another. With their independent operation, there is no need for synchronization or coordination between the components. Simple duplication offers high availability for most external faults, but places responsibility on the user to ensure that there is adequate redundancy. Figure 1 illustrates four operator workstations demonstrating simple duplication in that they are all configured to perform the same function.

Diverse technologies prevent systematic failures by deploying differing types of duplicate components. For example, a critical temperature measurement may use both a thermocouple and RTD. As another example, field inputs and outputs may be interfaced to the controller by using hardwired and wireless technologies. However, this may increase user configuration, maintenance effort, and life-cycle costs. 

Flexible system architecture

Critical process control applications such as high-pressure or exothermic reactions often require high assurance that failure of a component controlling these processes would have no impact on the availability of the process. Some process control suppliers offer a common architecture for two highly integrated yet independent systems for implementing projects demanding a diverse range of plant availability.

Figure 1 illustrates a distributed control system (DCS) that is used for process automation applications and a safety instrumented system (SIS) that is used for process safety applications (e.g., up to SIL 3 level). The systems are integrated using the same communications network, shared workstations, and other components. Both provide a fully redundant architecture that supports the use of active/hot standby redundancy and lockstep redundancy configurations to switch over automatically when a fault is detected.

Figure 2 illustrates a pair of controllers in an active/hot standby redundancy configuration. In this approach the components are in two distinct operational states. The active controller is responsible for updating its standby backup and for making switchover decisions. The differentiation of roles provides for disparate fault detection and for temporal resistance to common cause faults. One drawback to the active/hot standby approach is that there continually exists a very small finite window of time during redundancy switchover in which the standby needs to transition into the active state. At this time logic execution is briefly suspended and outputs are held at their last value.

Figure 3 illustrates a lockstep redundancy scheme that eliminates the switchover window of time characteristic of the previous approach. In a lockstep approach both the active and standby redundant components are concurrently active and operating in a lockstep manner to ensure that there is no switchover latency. Within each lockstep redundancy component, an internal voting mechanism determines the health of the device and the state of the output.

Multiple layers of component redundancy are supported at the process I/O interface level, including fully redundant traditional and HART I/O cards, communication links, and power supplies. In addition, an adaptive wireless mesh network provides multiple communications paths between the controller and its configured wireless field devices. These multiple layers of redundancy permit the user to select a desired level of component redundancy at an affordable cost.

Business system level redundancy

At the operations and business system level, it may be highly desirable to opt for redundant components to mitigate the risk associated with interrupted production and lost product, or loss of vital data, such as plant history data required for regulatory compliance. The architecture of the system is designed to support redundancy of advanced control functions. This may include production of batches and campaigns in a batch-oriented manufacturing process and redundant data server functions such as redundant OPC servers to integrate process data with business system applications. Standard benefits of using redundant components, such as automatic switchover, online upgradeability, and bumpless transition, that are built into control-level redundancy components are also available at the operations level.

Redundancy at the core

Fault tolerance in automation systems should be considered essential for meeting the requirements of critical process applications. Fault tolerance may be implemented through a variety of techniques. A customized effort is required for each plant site to balance risk of failure against the affordability of each fault tolerant solution. A state-of-the-art process automation system allows a user to select mitigation strategies based on fault tolerant components at control, operations, and business system levels in order to optimize reliability, lower cost, and reduce the risk of failure.

Dave Denison is software engineering manager for Emerson Process Management.