Design for a fallible world

As control engineers, we tend to think that keeping the process running is our only job, but just because the process is operating doesn't mean that the plant is operating. A larger issue also needs to be addressed when using IT systems in manufacturing operations. That issue is a system architecture designed for IT failures.


As control engineers, we tend to think that keeping the process running is our only job, but just because the process is operating doesn't mean that the plant is operating. A larger issue also needs to be addressed when using IT systems in manufacturing operations. That issue is a system architecture designed for IT failures.

The need for a robust system architecture was made obvious in 2003 when the MSBlast and LoveBug worms shut down IT systems around the world. These events revealed two types of manufacturing companies; those that continued production and those that had to shut down because of the worms or the IT response to the worms. Some companies were able to "raise the drawbridge" and continue production by disconnecting their operational systems from their business systems. Other companies could not separate their business and operational systems and had to shut down production. These responses illustrated the difference between tightly coupled systems and loosely coupled systems and between global systems and local systems.

Many IT departments have an unspoken bias to tightly coupled global systems because such systems are generally easier to build and maintain. For example, maintaining a single global instance of a document management system is easier than maintaining one instance per site. In addition, if connections are needed to other systems, it is easier to hardcode a tight synchronous connection than design and implement a standard-based asynchronous connection.

When everything is working, tightly coupled global systems are fine, but when things go wrong, the effect can be catastrophic. The best approach to system design should be to emulate the well-known saying, "Expect the best, but plan for the worst."

Companies that require absolutely reliable systems, such as banks and other financial institutions, will often set up separate servers, networks, and support organizations for the critical systems. These are separate from the normal business systems supporting HR, purchasing, and logistics. This same approach is required for manufacturing companies when the IT systems are critical to maintaining plant operations.

One of the first steps in this approach is to identify the systems critical to operations. Every system in use by operations needs to be examined. For example, if a company is using a centralized configuration management server for controlling all changes to PLC code, then when that system becomes unavailable, the maintenance group may be unable to make any emergency maintenance changes in PLC code during an IT outage. Another example of a critical system may be a global license manager that grants right-to-use for software packages. If an IT outage occurs, then users may be unable to access displays, reports, or recipe management systems because they can't obtain the right-to-use license.

When designing manufacturing systems to withstand IT failure, a good approach is to use local systems instead of global systems. This eliminates outages because of WAN failures, which were a major cause of plant shutdowns during the MSBlast attack. Multiple local systems are more expensive to maintain, but are much more robust in the face of failures.

Another good approach for designing robust systems is to use system interfaces that are asynchronous and buffered. This approach allows for temporary loss of communications or system failures, without causing cascaded system failures. Interfaces based on messaging systems are especially robust in the face of network and application failures. These interfaces should be the default choice for systems that do not need real-time synchronous communication.

Fortunately, not all solutions to robust operations in the face of IT outages need to be technology oriented. Phone lists and faxes can be used to collect critical decision information, and paper backup systems can be used to record critical information. However, the worst situation is to design an operational system based on the assumption that there will be no failures in network infrastructure or in IT applications.

Author Information

Dennis Brandl is the president of BR&L Consulting, a consulting firm focusing on manufacturing IT solutions, based in Cary, N.C.

No comments
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by...
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
Additive manufacturing benefits; HMI and sensor tips; System integrator advice; Innovations from the industry
Robotic safety, collaboration, standards; DCS migration tips; IT/OT convergence; 2017 Control Engineering Salary and Career Survey
Integrated mobility; Artificial intelligence; Predictive motion control; Sensors and control system inputs; Asset Management; Cybersecurity
Featured articles highlight technologies that enable the Industrial Internet of Things, IIoT-related products and strategies to get data more easily to the user.
This article collection contains several articles on how automation and controls are helping human-machine interface (HMI) hardware and software advance.
This digital report will explore several aspects of how IIoT will transform manufacturing in the coming years.

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

Infrastructure for natural gas expansion; Artificial lift methods; Disruptive technology and fugitive gas emissions
Mobility as the means to offshore innovation; Preventing another Deepwater Horizon; ROVs as subsea robots; SCADA and the radio spectrum
Future of oil and gas projects; Reservoir models; The importance of SCADA to oil and gas
Automation Engineer; Wood Group
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
click me