Applying network best practices via Ethernet network redundancy

Implementing redundancy in an industrial network and choosing an appropriate application method can ensure continuing reliability and operational success.

By Ken Austin, Phoenix Contact USA August 2, 2016

Among the most relevant—yet overlooked—issues to address when planning an industrial Ethernet network is the use of an appropriate redundancy mechanism. Consider that a communication failure in a production network can create costly downtime, cause the loss of important company data, or even initiate conditions for serious damages to production equipment—or worse: injury to personnel. A redundant physical layer network structure protects against production downtime by ensuring the availability of communication continues, even if errors occur.

The following questions should be asked and meticulously discussed. Is redundancy required? If so, which redundancy method is best for the application? Often, network planners and end users consider the redundancy option extremely costly or a technology that’s too daunting to deploy. A decision to have redundant connectivity in the network does necessitate the higher cost of managed switches, as well as the added time and effort of primary configuration and continuing management. However, the additional equipment cost and resources can be far offset by continuous system uptime experienced, superior network monitoring/management capabilities, and reduced troubleshooting measures through extended diagnostics (see Figure 1).

Network redundancy involves the integration of hardware with software. This ensures that the availability of the network remains optimum in the event of a single point of failure. The communication system—the industrial network—is the core of every modern automation project. To handle network errors, a protocol can be selected from various options and integrated into the infrastructure elements. Redundancy methods can be categorized into four groups: IEEE open source, proprietary sub-second, standardized high-availability network redundancy protocol, and zero-interruption (bumpless) redundancy (see "Redundancy methods and examples"). Characteristics of these categories are distinguishable among the many redundancy methods and are particularly suited to certain applications and requirements (see "Table: Redundancy methods and choices below").

During the planning stage, if it is determined that the industrial network must be resilient and able to automatically recover in the event of a Layer 1 failure, a redundancy mechanism absolutely must be employed. After that determination is made, attention must turn to the selection of an appropriate redundancy mechanism. If the process that the network supports can withstand up to a couple of seconds of delay while a network re-convergence takes place, then tried and true rapid spanning tree protocol (RSTP) can be used, and there is no need to look further.

Conversely, when the topology is less tolerant to an extended outage and a communication gap of seconds may cause system alarms or input/output (I/O) faults, then a high-speed mechanism should be deployed. Often, these are proprietary, but can provide the user with sub-second recovery times, superior to IEEE standards-based redundancy, and can accommodate hundreds of switches in single or multiple rings. 

Spanning tree protocol (STP)

Ethernet networks with redundant data paths will form a meshed topology with impermissible loops. Due to these loops, data packets can circulate endlessly within the network and also can be duplicated. STP is an open protocol that is described in IEEE 802.1D-2004: IEEE Standard for Local and Metropolitan Area Networks—Media access control (MAC) Bridges. It is an Open Systems Interconnection (OSI) Layer 2 protocol that guarantees a closed, loop-free local area network (LAN). It is based on an algorithm developed by renowned software designer and network engineer Radia Perlman while she was employed at Digital Equipment Corp. STP made it possible to extend the network whereby redundant links are integrated. In this way, an automated backup path was provided in the event an active link dropped out for whatever reason without creating closed loops in the network.

To apply this protocol and gain the maximum benefit, as with the other redundancy methods, the used switches must support the protocol. After the interruption of a segment, it can easily take 30 to 50 seconds before the alternative path becomes available. This timer-based delay is unacceptable for controls, and 30 seconds is extremely long for any monitoring application. Generally, the standard STP delay in executing a recovery is too long to be acceptable in an industrial application. Unfortunately, the strengths of the STP are what make it inherently not unsuitable for redundant ring structures.

The complexity that allows STP to support a variety of topologies limits its performance in a relatively simple redundant ring. Thus, when a fault occurs in a ring, the obvious solution is to treat the interrupted ring as two separate network segments until the link layer break is remedied. Given that there is only one fault recovery solution in a ring, the typical time taken by standard STP to collect fault data and process the messaging to create an analysis of that fault is likely unacceptable. 

RSTP

To deal with the shortcomings of STP, IEEE established RSTP in 2001. RSTP is a standardized, open redundancy method (IEEE 802.1D-2004) supported by a vast range of managed switches regardless of their manufacturer. The protocol supports ring and tree topologies, as well as meshed networks, and easily can be enabled in any managed network. The protocol initially was described in IEEE 802.1w-2001: Rapid Reconfiguration of Spanning Tree. Then, in a 2004 revision of the standard, the original STP was noted as superfluous in the IEEE 802.1d standard and recommended the use of RSTP instead of the original STP whenever possible. IEEE 802.1w is therefore included in the 802.1d standard.

The network tree structure is calculated by the RSTP algorithm so there is one switch configured as the root (see Figure 2). Different redundant physical connections can be created within the network. Without the presence of a redundancy mechanism, this would result in the occurrence of unacceptable loops that would quickly congest the entire network, which would create failures. RSTP converts this topology into a tree structure, albeit inverted, by closing off a number of ports whose paths are deemed as lesser by the algorithm. This creates the necessary, logical, loop-free environment. With an infrastructure device configured as the network root and logical blocks created from that root, all other switches can be reached via one path. If a network error does occur, such as a broken or disconnected cable, then a new active path is automatically created.

The recovery times experienced with RSTP are significantly lower than those of the original spanning tree (hence the name) and are specifically 1 second to a few seconds, instead of the 30- to 50-second times of the original iteration. Depending on the application, the recovery time of RSTP already may be fast enough to ensure dependability.

RSTP has had quite a lengthy tenure as the redundancy of choice in many IT and industrial network installations. Although faster redundancy schemes have been developed during the ensuing years, today, RSTP actually remains quite a viable choice for the average application, especially where device cost may be a concern.

Proprietary redundancy-extensions of RSTP

To meet the ever-expanding requirements for a faster recovery of the automation network, many manufacturers have developed proprietary redundancy mechanisms to attain recovery times of less than 1 second. Often developed as an extension of RSTP, proprietary redundancy schemes differ in their recovery speeds and setup complexities. They employ various concepts, such as fast polling by the root bridge to gauge ring health and to initiate the fastest potential recovery of the network. In some proprietary technologies, the ring ports that are used to connect the switch to the ring—as well as perform the interconnection or coupling of rings—are established automatically.

In others, these ports must specifically be selected and configured before connecting the switches to form the ring. As an example of operation, upon power-up, the switches will automatically choose the last switch in the ring. As the last switch, this device will be the final one to power up, then proceed to block a ring port, and finally send out packets to initially determine the health of the ring. During operation, each switch in the ring will independently monitor the status of its ring ports, and upon a ring link failure, the adjacent switches will send a link-down message out on the ring to the last switch, which unblocks the previously blocked port.

Regardless of the redundancy selection, when a network connection or switch drops out, recovery times of 15 to 500 milliseconds can be realized. A recovery time of not more than 500 milliseconds can be had, even in extensive automation networks with heavy traffic loads and a considerable number of address entries to the media access control (MAC) tables in the switches. As with other technologies, the witnessed recovery times are scalable and can be much shorter for installations of fewer terminals (smaller MAC tables) in the network. With any of these scenarios, though, sub-second recovery can be achieved, and thereby quality of service can be maintained in the building of redundant automation networks. 

Device level ring (DLR)

The device level ring (DLR) redundancy protocol is part of the EtherNet/IP standard. DLR realizes recovery times of less than 3 milliseconds and therefore provides nearly bumpless switchover (see Figure 3). DLR is supported by many current EtherNet/IP field devices, such as I/O modules or programmable controllers that can be networked in a ring topology using DLR via an integrated two-port switch function. The use of switches that natively support DLR enables the simultaneous integration of several devices on the DLR, while the integral system diagnostics of DLR allow faults and errors to be identified and remedied quickly.

The DLR protocol supports a single ring topology; multiple rings or overlapping rings are not allowed. It is possible though, when using suitable switches, to connect a redundant ring or operate multiple restricted rings. As such, the DLR protocol information does not leave the individual ring or appear in other rings. In Figure 4, the topology shows two DLR rings, each of which make up individual segments. These segments are redundantly meshed and connected with one another via RSTP redundancy. As such, the individual segments are respectively separated from other segments so that no protocol information can leave the individual ring.

The DLR need not be tied exclusively to EtherNet/IP. In many non-EtherNet/IP topologies, such as networking wind turbines, a redundancy mechanism is required that switches over as fast as possible in the event of an error to prevent system alarming. In these instances, as well, the DLR redundancy mechanism can guarantee a switchover of the transmission path in less than 3 milliseconds. 

Media redundancy protocol (MRP)

The media redundancy protocol (MRP) is for ring installations and is part of IEC 62439-3: Industrial communication networks. MRP guarantees a maximum recovery time of 200 milliseconds in the event of an interruption in the ring topology, with a maximum of 50 MRP node devices. MRP is supported by Profinet switches and many Profinet field devices to achieve increased reliability directly at the device level in a machine network. The integrated error diagnostics allow errors to be removed quickly.

As previously mentioned, MRP is part of the Profinet standard. So, in the case of MRP, a ring manager blocks one port to obtain an active line structure. In the event of a network error, the network splits into two isolated lines in the network, which are again linked together after the error is resolved and the blocked port is released. Generally, recovery times typically are less than 100 milliseconds. 

Parallel redundancy protocol (PRP)

In contrast with the other technologies, PRP does not need to plan a change of the active topology in the event of a network error. The PRP protocol functions on two parallel networks and in parallel redundant transmission. According to IEC 62439-3, all data telegrams are transmitted twice via two autonomous networks. This means that uninterruptible or bumpless communication is attainable, even if there is a failure in one of the networks. Given that each data frame is sent over the two networks, the receiving node processes the message that arrives first and rejects the secondary message when it is received. The PRP protocol ensures the copying and forwarding, as well as the ultimate rejection, of the duplicate messages, all at Layer 2. In doing so, PRP also makes the double network invisible to the higher layers in the communication stack.

It is not uncommon to find PRP deployed in critical areas of application, such as in energy switchgear systems, chemical manufacturing, and wastewater treatment systems. Because PRP does not require any reconfiguration time, its use is particularly suited to the critical infrastructure sectors whose systems and networks are considered so vital that their incapacitation could have a debilitating effect on national security, the economy, and/or public health and safety.

Applying best practices

In the end, the analysis and design phase of an industrial communications network will necessitate asking a broad variety of questions and will address several potential technology selections. Whether it is pondering items, such as the network transmission speed (Ethernet, fast Ethernet, Gigabit, and beyond) and media type (copper or fiber), or the implementation of high-end functionalities, such as port-based dynamic host configuration protocol (DHCP), link aggregation, or remote authentication dial-in user service (RADIUS) authentication, each selection will be important in its own right. The decision whether to implement a redundancy mechanism in an industrial network and then choose the most appropriate method for the application will be of paramount importance to the ongoing reliability of the network and the overall success of the operation (see Figure 5). 

Redundancy methods and examples

Redundancy methods can be categorized into four groups: IEEE open source, proprietary sub-second, standardized high-availability network redundancy protocol, and zero-interruption (bumpless) redundancy. The following groups include explanations and examples. 

IEEE open-source redundancy: This method has the slowest fault recovery time, but it is the least expensive and the easiest redundancy to implement and maintain. This redundancy can be applied to any topology selection including mesh and includes STP and RSTP.

Proprietary sub-second redundancy: Generally, there is little-to-no cost difference in equipment or time compared to RSTP. Only ring structures are possible, but it does allow the flexibility of various ring configurations, such as multiple rings, dual redundant rings, and large capacity rings. It also includes the ring-coupling and dual-homing connectivity concepts. Unlike IEEE open redundancy, the selection of a proprietary redundancy technology locks the network—or at least that ring segment—to one switch manufacturer’s technology and devices. Due to the variations in the operation of the products, switch interoperability within one ring is not feasible. This method includes extended ring redundancy and fast ring detection. 

Standardized high-availability network redundancy protocol: Ordinarily for ring topologies only, these mechanisms are beacon-based and maintain a hierarchy with a ring manager or supervisor and participating ring nodes that are able to process the messaging. This method includes MRP and DLR. 

Zero interruption (bumpless) redundancy: This method can be used in a line or ring topology and under certain conditions can be deployed in a mesh configuration. However, there is a price to be paid for the speed because it requires additional hardware-either integrated into the switching infrastructure or in the end devices. This method includes high-availability, seamless redundancy and PRP.

Ken Austin is lead product marketing specialist for Ethernet at Phoenix Contact USA.

This article appears in the Applied Automation supplement for Control Engineering 
and Plant Engineering

– See other articles from the supplement below.