How to build resilient industrial networks and reduce downtime

Resiliency is crucial for industrial networks because it helps companies withstand failures, faults and disturbances that lead to downtime.

By Henry Martel August 15, 2024

Industrial networking insights

  • Network resilience in industrial settings prevents costly downtime, ensuring continuous operations and safeguarding against high temperatures, electrical interference, and harsh conditions.
  • Key resilience strategies include redundancy, robustness, resourcefulness, and rapidity, with protocols enhancing network stability and recovery.

In industrial environments, network downtime often results in delays, production losses and even potential danger to employees. Resiliency is crucial in industrial Ethernet switching because it enables networks to withstand failures, faults and disturbances that lead to downtime.

It’s important to understand the basics of network resilience in industrial Ethernet switching as well as the key strategies and technologies for achieving it, including how to implement network redundancy mechanisms and the spanning tree protocol (STP).

What is network resilience?

Resilience refers to the capacity of a network to withstand disturbances so it can continue offering services at an acceptable level. Resilient networks ensure efficient administration, oversight, and operation of factory infrastructures and critical processes.

While maintaining a resilient network with high availability even in the best of operating conditions is difficult, there are additional challenges in industrial environments. Among the risks that could affect network reliability and performance include high temperatures, electrical interference, unforeseen network outages and harsh environmental conditions.

According to estimates from Gartner, the average manufacturing company loses over $300,000 for each hour of downtime. Other research suggests this estimate may be overly conservative, putting the number two or three times higher. By restoring network functions when they are interrupted, resilient industrial networks help prevent downtime and the associated costs.

A resilient network infrastructure strives for 99.999% uptime in its operations. Also known as the “five nines” of network availability, this translates into about six minutes of downtime per year. Only a highly resilient network infrastructure can meet such demands.

Network redundancy and network resilience

Network redundancy and network resilience are used interchangeably. However, network redundancy is just one dimension of network resiliency. It ispart of the so-called “four Rs” of network resilience: Redundancy, robustness, resourcefulness and rapidity.

Network redundancy is the practice of maintaining a duplicate in the form of extra physical or virtual hardware or connections. In the event a device or connection goes down, another picks up its job and normal network operation resumes. Without a backup disaster recovery plan or effective layer 2 redundancy, users face an uphill climb to get systems back up and running.

A common example of redundancy is a redundant firewall featuring an active and a standby mode. This configuration consists of a primary and secondary unit. The secondary unit sits idle in standby mode while monitoring the active primary unit’s health status. If it detects the active unit has failed, the secondary unit moves from standby to active. A variation of this configuration is to have both firewalls set to active modes, equally sharing responsibilities for routing and security policy enforcement. If one fails, the other takes over its duties along with performing its own.

Ethernet switching redundancy protocols

This brings us to industrial Ethernet switching network redundancy. This type of redundancy refers to the ability of a redundant network to survive a failure in its switch-to-switch links by providing an alternative data path.

Star topology

To illustrate this point, consider a basic star topology. If one device in a star network wants to send data to another device, it first sends the info to the connecting network device at the center of the star that then transmits the data to the designated device.

The disadvantage of providing multiple paths is if the network switch at the center fails all nodes attached are disabled and users at multiple data centers can’t participate in network communication. A consequence of single path designs is any hardware failure, power outage or cable disconnection will interrupt all types of network communications.

Figure 1: Star network with switch at center. Courtesy: Antaira Technologies

Figure 1: Star network with switch at center. Courtesy: Antaira Technologies

To get around these limitations and improve redundancy, network administrators can add segments or additional industrial switches, or use another type of topology such as mesh, link aggregation and redundant rings. It’s important to note that whenever computers share information over a LAN with redundant pathways, looping issues can emerge and bring about broadcast storms.

Broadcast storm

Broadcast frames can be taken down by flooding the network with bogus frames, therefore preventing important frames from getting on the network or reaching their destination. Two major sources of these types of frames come from either malicious denial of service attacks or failing Ethernet devices. There has been fewer of the latter in recent years as Ethernet device quality has improved.

A bad configuration also might cause this issue. Normally, a broadcast frame is passed through a switch to all ports. It is a broadcast like the name says and goes to everyone. However, a switch with broadcast storm protection turned on will see too many broadcast frames and squelch them down, preventing them from propagating throughout the network.

Once the stream of broadcast has subsided, the switch will permit the traffic to pass once again. It resets itself. This is often turned on by default in most switches. It is possible some applications might require this to be turned off due to traffic being intentionally broadcast traffic, but this is rare.

Spanning tree protocols

To break looping cycles and avoid broadcast storms, network administrators have long implemented STPs, a popular layer 2 protocol. STP prevents the occurrence of network loops by blocking all redundant networks’ ports. In a loop-free network a single device with a blocked port will still receive data but it will not send that data out to other devices on the network. STP disables links that are not a part of the spanning tree, leaving just one primary path and one active channel between any two network nodes. When a network failure does occur, however, devices are able to continue communicating across the network since data can be rerouted around the failure. The port selected depends on the topology of the configuration.

Figure 2: Network devices showing spanning tree protocol. Courtesy: Antaira Technologies

Figure 2: Network devices showing spanning tree protocol. Courtesy: Antaira Technologies

There are three versions of the STP protocol: STP (802.1d), Rapid STP (RSTP, 802.1w) and Multiple STP (MSTP, 802.1s). The main advantage to RSTP over STP is its reduced convergence time. When there is a topological change, RSTP can often react in a matter of 5 to 10 seconds, whereas STP can take up to 50 seconds. MSTP is the application of STP to a virtual LAN (VLAN). MSTP maps a group of VLANs into a single multiple spanning tree instance, resulting in improved network performance and stability by ensuring only one active path exists between any two nodes in an MST instance. A switched network is divided into multiple regions by MSTP, and each region has multiple independent spanning trees. MSTP not only facilitates rapid network convergence but also lets the data flows from different VLANs be routed separately.

Ethernet networks must not have loops. Spanning tree protocols prevent loops by disabling one of the connections. If one of the working connections should fail, spanning tree will enable the originally disabled link providing connectivity once again. RSTP differs from STP by using faster algorithms to block and unblock the links. MSTP works on VLAN connections rather than physical interface connections which allows it to block data from a single VLAN that has created a loop while allowing other VLANs, which are not looped, to continue using the link.

Other network resilience strategies and protocols to consider

Besides STP, RSTP and MSTP, there are several other resilience protocols and technologies. Three worth noting are Ethernet ring protection switching (ERPS), link aggregation and virtual router redundancy protocol (VRRP).

Ethernet ring protection switching (ERPS)

The open standard ITU-T G.8032 ERPS protocol has a <50 ms network recovery time standard to create a ring of nodes configured to prevent loop issues. While nodes are arranged in a ring, one connection is always blocked to prevent the creation of a loop. This way, traffic can flow in both directions around the ring but always stops at the blocked link. If another link in the ring goes down, it becomes the blocked link and the previously blocked link is opened, allowing data flow to continue at the same rate with almost no loss of speed.

Figure 3: ERPS ring example. Courtesy: Antaira Technologies

Figure 3: ERPS ring example. Courtesy: Antaira Technologies

ERPS rings can also be connected in multiple layers to create larger stacks. Even over hundreds of miles of fiber connections, the protected ring structure of ERPS means that ping won’t drop, and connections will remain stable. If you’re building out a new network redundancy and framework that prioritizes rapid recovery, ERPS may be the best choice.
Again, Ethernet networks must not have loops. ERPS, like STP, disables a link to remove the loop from the network. Like the spanning tree protocols, if a working link should fail, the previously disabled link will be re-enabled creating a more resilient network. While STP can be used in a network that looks like a mesh, disabling multiple links to prevent loops, ERPS can only be implemented in a loop. By limiting the design to a loop, ERPS can provide faster healing times (sub 50ms) to the network.

Link aggregation

Link aggregation bundles multiple individual Ethernet links together from two or more devices, so the links act as a single logical link. This can be done without having to use STP to disable a redundant link. Connecting a switch to another switch, a server, a network attached storage device, or a multi-port access point are the most typical device combinations.

Figure 4: Link aggregation example. Courtesy: Antaira Technologies

Figure 4: Link aggregation example. Courtesy: Antaira Technologies

Besides optimizing load balancing, an important reason for using link aggregation is to provide fast and transparent recovery. An aggregate set of ports is referred to as a link aggregation group, or LAG, and each of these links must be the same type of Ethernet and configured identically. The physical links operate in an active-active or active-backup setup, meaning if one physical link fails, the other can take over and restore the traffic forwarding previously sent over the failed link.
Link aggregation configuration protocol (LACP) is a point-to-point protocol that creates redundancy and increased bandwidth between devices, typically industrial switches. In the above example, a loop is created by connecting two Ethernet switches together with two links. LACP prevents issues by creating one logical link out of the two links and eliminates the issues caused by a loop. Both links are capable of transmitting different data at the same time thus doubling the bandwidth. If one link fails the other can still carry data. Up to eight links can be bound together to form a single LACP connection.

Virtual router redundancy protocol (VRRP)

VRRP is an open standard protocol that enhances network reliability by providing router redundancy for network services. VRRP does this by using physical hardware and creating a virtual router made up of several physical routers. When packets are delivered to the virtual router from one server’s IP address, the industrial router with the highest priority acts as the master. The group’s other routers stay in standby mode, prepared to take over if the master router malfunctions.

In an interconnected industrial world, a network outage can be catastrophic. However, many organizations continue to run on outdated technologies, which can impede growth, raise cybersecurity threats, and reduce productivity. Modernizing an industrial network is not just about upgrading outdated technology, but about improving resiliency.


Author Bio: Henry Martel is a field application engineer with Antaira Technologies. He has over 10 years of IT experience along with skills in system administration, network administration, telecommunications, and infrastructure management. He has also been a part of management teams that oversaw the installation of new technologies on public works projects, hospitals, and major retail chains.