Design for a fallible world

As control engineers, we tend to think that keeping the process running is our only job, but just because the process is operating doesn't mean that the plant is operating. A larger issue also needs to be addressed when using IT systems in manufacturing operations. That issue is a system architecture designed for IT failures.


As control engineers, we tend to think that keeping the process running is our only job, but just because the process is operating doesn't mean that the plant is operating. A larger issue also needs to be addressed when using IT systems in manufacturing operations. That issue is a system architecture designed for IT failures.

The need for a robust system architecture was made obvious in 2003 when the MSBlast and LoveBug worms shut down IT systems around the world. These events revealed two types of manufacturing companies; those that continued production and those that had to shut down because of the worms or the IT response to the worms. Some companies were able to "raise the drawbridge" and continue production by disconnecting their operational systems from their business systems. Other companies could not separate their business and operational systems and had to shut down production. These responses illustrated the difference between tightly coupled systems and loosely coupled systems and between global systems and local systems.

Many IT departments have an unspoken bias to tightly coupled global systems because such systems are generally easier to build and maintain. For example, maintaining a single global instance of a document management system is easier than maintaining one instance per site. In addition, if connections are needed to other systems, it is easier to hardcode a tight synchronous connection than design and implement a standard-based asynchronous connection.

When everything is working, tightly coupled global systems are fine, but when things go wrong, the effect can be catastrophic. The best approach to system design should be to emulate the well-known saying, "Expect the best, but plan for the worst."

Companies that require absolutely reliable systems, such as banks and other financial institutions, will often set up separate servers, networks, and support organizations for the critical systems. These are separate from the normal business systems supporting HR, purchasing, and logistics. This same approach is required for manufacturing companies when the IT systems are critical to maintaining plant operations.

One of the first steps in this approach is to identify the systems critical to operations. Every system in use by operations needs to be examined. For example, if a company is using a centralized configuration management server for controlling all changes to PLC code, then when that system becomes unavailable, the maintenance group may be unable to make any emergency maintenance changes in PLC code during an IT outage. Another example of a critical system may be a global license manager that grants right-to-use for software packages. If an IT outage occurs, then users may be unable to access displays, reports, or recipe management systems because they can't obtain the right-to-use license.

When designing manufacturing systems to withstand IT failure, a good approach is to use local systems instead of global systems. This eliminates outages because of WAN failures, which were a major cause of plant shutdowns during the MSBlast attack. Multiple local systems are more expensive to maintain, but are much more robust in the face of failures.

Another good approach for designing robust systems is to use system interfaces that are asynchronous and buffered. This approach allows for temporary loss of communications or system failures, without causing cascaded system failures. Interfaces based on messaging systems are especially robust in the face of network and application failures. These interfaces should be the default choice for systems that do not need real-time synchronous communication.

Fortunately, not all solutions to robust operations in the face of IT outages need to be technology oriented. Phone lists and faxes can be used to collect critical decision information, and paper backup systems can be used to record critical information. However, the worst situation is to design an operational system based on the assumption that there will be no failures in network infrastructure or in IT applications.

Author Information

Dennis Brandl is the president of BR&L Consulting, a consulting firm focusing on manufacturing IT solutions, based in Cary, N.C.

No comments
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by...
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
The Engineering Leaders Under 40 program identifies and gives recognition to young engineers who...
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
Big Data and IIoT value; Monitoring Big Data; Robotics safety standards and programming; Learning about PID
Motor specification guidelines; Understanding multivariable control; Improving a safety instrumented system; 2017 Engineers' Choice Award Winners
Selecting the best controller from several viewpoints; System integrator advice for the IIoT; TSN and real-time Ethernet; Questions to ask when selecting a VFD; Action items for an aging PLC/DCS
This digital report will explore several aspects of how IIoT will transform manufacturing in the coming years.
Motion control advances and solutions can help with machine control, automated control on assembly lines, integration of robotics and automation, and machine safety.
This article collection contains several articles on the Industrial Internet of Things (IIoT) and how it is transforming manufacturing.

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

Future of oil and gas projects; Reservoir models; The importance of SCADA to oil and gas
Big Data and bigger solutions; Tablet technologies; SCADA developments
SCADA at the junction, Managing risk through maintenance, Moving at the speed of data
Automation Engineer; Wood Group
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
click me