Design for a fallible world

As control engineers, we tend to think that keeping the process running is our only job, but just because the process is operating doesn't mean that the plant is operating. A larger issue also needs to be addressed when using IT systems in manufacturing operations. That issue is a system architecture designed for IT failures.

By Dennis Brandl, BR&L Consulting January 1, 2004

As control engineers, we tend to think that keeping the process running is our only job, but just because the process is operating doesn’t mean that the plant is operating. A larger issue also needs to be addressed when using IT systems in manufacturing operations. That issue is a system architecture designed for IT failures.

The need for a robust system architecture was made obvious in 2003 when the MSBlast and LoveBug worms shut down IT systems around the world. These events revealed two types of manufacturing companies; those that continued production and those that had to shut down because of the worms or the IT response to the worms. Some companies were able to “raise the drawbridge” and continue production by disconnecting their operational systems from their business systems. Other companies could not separate their business and operational systems and had to shut down production. These responses illustrated the difference between tightly coupled systems and loosely coupled systems and between global systems and local systems.

Many IT departments have an unspoken bias to tightly coupled global systems because such systems are generally easier to build and maintain. For example, maintaining a single global instance of a document management system is easier than maintaining one instance per site. In addition, if connections are needed to other systems, it is easier to hardcode a tight synchronous connection than design and implement a standard-based asynchronous connection.

When everything is working, tightly coupled global systems are fine, but when things go wrong, the effect can be catastrophic. The best approach to system design should be to emulate the well-known saying, “Expect the best, but plan for the worst.”

Companies that require absolutely reliable systems, such as banks and other financial institutions, will often set up separate servers, networks, and support organizations for the critical systems. These are separate from the normal business systems supporting HR, purchasing, and logistics. This same approach is required for manufacturing companies when the IT systems are critical to maintaining plant operations.

One of the first steps in this approach is to identify the systems critical to operations. Every system in use by operations needs to be examined. For example, if a company is using a centralized configuration management server for controlling all changes to PLC code, then when that system becomes unavailable, the maintenance group may be unable to make any emergency maintenance changes in PLC code during an IT outage. Another example of a critical system may be a global license manager that grants right-to-use for software packages. If an IT outage occurs, then users may be unable to access displays, reports, or recipe management systems because they can’t obtain the right-to-use license.

When designing manufacturing systems to withstand IT failure, a good approach is to use local systems instead of global systems. This eliminates outages because of WAN failures, which were a major cause of plant shutdowns during the MSBlast attack. Multiple local systems are more expensive to maintain, but are much more robust in the face of failures.

Another good approach for designing robust systems is to use system interfaces that are asynchronous and buffered. This approach allows for temporary loss of communications or system failures, without causing cascaded system failures. Interfaces based on messaging systems are especially robust in the face of network and application failures. These interfaces should be the default choice for systems that do not need real-time synchronous communication.

Fortunately, not all solutions to robust operations in the face of IT outages need to be technology oriented. Phone lists and faxes can be used to collect critical decision information, and paper backup systems can be used to record critical information. However, the worst situation is to design an operational system based on the assumption that there will be no failures in network infrastructure or in IT applications.

Author Information
Dennis Brandl is the president of BR&L Consulting, a consulting firm focusing on manufacturing IT solutions, based in Cary, N.C. dbrandl@brlconsulting.com