Open Systems Reliability

Downtime is the last thing operations people want their control systems to experience, especially where system utilization and/or finished goods value is high. Good reliability and safety in process equipment and related controls is near the top of every control engineer's buyer's list, followed closely by budget constraints and a need to rapidly install and get the system operational.

By Dr. William M. Goble, October 1, 2001


Software and information integration

Open systems

Operating systems


Standards and regulations

Sidebars: How Linux addresses open What makes software reliable? DLL hell finally exorcised

Downtime is the last thing operations people want their control systems to experience, especially where system utilization and/or finished goods value is high.

Good reliability and safety in process equipment and related controls is near the top of every control engineer’s buyer’s list, followed closely by budget constraints and a need to rapidly install and get the system operational.

Many control engineers consider “open systems” to be at least part of the answer to cost issues, but do open systems present a higher safety or availability risk than proprietary or closed systems?

Most people understand purchase price shouldn’t be the sole criterion for selecting a system, but seldom is system reliability and availability factored into the decision making process to produce hard numbers of what’s an acceptable amount of unplanned downtime. And fewer still produce hard numbers for associated costs of lost production, wasted energy and raw materials, safety and environmental consequences, customer impact, and so on.

To avoid later surprises, both open and proprietary control systems should be evaluated to ensure everyone understands the sources, possibilities, and consequences of unplanned downtime resultant from system failures.

Promise of open systems

It has been more than 20 years since we first started hearing about “open systems.” Early presentations showed industrial controllers of various manufacturers connected using standard networks with computer systems and human-machine interface (HMI) consoles. Each piece of equipment contained standardized connectors and communications protocol and onboard intelligence to handle device identification and data needs. When connected and powered, a new piece of equipment would automatically identify itself and begin sharing data with other connected equipment.

The promise was that control engineers would spend most of their time working on process improvements and very little time worrying about the system and its communication needs. The promise was a highly flexible control system environment, where “best of breed” products could be selected, deployed, and used. During these visionary presentations, nobody explained control engineers needed to understand operating system nuances, dynamic link libraries (DLLs), and how to make multiple communication protocols share the same wire.

Although earlier attempts were made, most agree the first big push toward open systems came with the General Motors MAP (Manufacturing Automation Protocol) effort of the 1980s. Untold hours were spent defining protocols followed by a massive education effort. A groundswell was initiated when operating companies sent letters to equipment manufacturers stating that in the future, supplier systems would only be considered if they supported MAP. The vision was about to become reality.

Unfortunately, the communications standard evolved into an “everything for everybody” protocol and became quite complex. In fact, the underlying software became so complex it often required twice the computing power and several times the memory of the controller software it was supposed to support, and even when that level of computational power was made available, data rates seldom exceeded 500 variables per second-about the same performance as a simple serial bus.

Complexity of the software introduced higher system failure rates and unexplained computer crashes. Some were traced to undefined (proprietary) message content, others remained untraceable and never explained. A complex communications mechanism just did not work, even in the name of “open.”

By the early 1990s, once again it appeared the goal of “open” would be met-at least that’s what the advertising wars implied.

Most major DCS (distributed control system) and PLC (programmable logic controller) vendors were touting open as the main system attribute. While the offering was far from the ideal vision, control equipment vendors actually began to deliver on some of the promises. Using Ethernet technology and the defacto TCP/IP standard, third-party operator stations actually connected with some DCS and PLCs. But again, strange, nonrepeatable, and unexpected system failures appeared, oftentimes long after the initial startup. When these sorts of problems appeared; which vendor was responsible? Finger pointing became blatant and control engineers began to realize they did need to learn about operating system nuances, DLLs, and how to make multiple communication protocols share the same wire. In a few instances, the vision became a nightmare.

Approaches to open systems

Even today the entire open system vision remains elusive with two fundamentally different approaches being pursued by different groups.

One approach involves the concept of “open source” and adoption of defacto standards from the personal computer (PC) environment using either the Microsoft or the open source Linux operating systems. This leverages lower-cost or free software constructed on Microsoft or Linux foundations with commodity PCs to deliver the ultimate in low-purchase-cost solutions.

The second approach is the fieldbus approach, where boundaries are established around a piece of equipment or an operational unit and a standardized communication protocol is established within the boundary-much the same as the original MAP concept.

Despite years of fieldbus wars (and skirmishes still exist) as various factions pushed their version of a protocol, two versions are gaining widening acceptance in the marketplace-Profibus and FOUNDATION fieldbus. Both have a wide range of capabilities including data transfer, programming support, and network management. And both have a wide and growing range of certified available products.

While methods vary for each organization, certification testing helps users confidently select and install different manufacturers’ products on the same fieldbus network, reducing the uncertainty and finger pointing of previous open-solution attempts.

Open system reliability and safety

Open system offerings are starting to show promise. But is such a system reliable and safe enough? Reliability is defined as “the probability of successful completion of intended functions during an interval of time” and all sources of hardware and software failures count.

When considering all possibilities for failure in an open system, it is no wonder there have been transient unexplained problems. From reliability and safety perspective, the world of open systems is nothing like the world of proprietary DCS/PLC systems. Compared to the proprietary world, things are more flexible, more complex, and relatively uncontrolled, at least in terms of ensuring all the parts and pieces work in harmony (a goal of certification).

System failures typically occur when some unanticipated input enters the system. The rouge input can be thought of as “stress” on a software system. The software’s ability to respond to that stress without failure represents its “strength.” (See “What makes software reliable?” sidebar.) For software to be strong, the developers must spend a lot of time thinking about all the things that can go wrong and then do something to prevent the wrong from happening. This is difficult since most software developers have enough trouble thinking about getting things working with expected input conditions. In many cases, unanticipated input conditions cause unanticipated output responses. Things happen like writing to the wrong place in memory or starting a chain of events that can crash a computer, wipe out memory, or even transmit a wrong output.

In the single-vendor proprietary system, communication protocols were designed by one team, perhaps even one person. Specific messages needed for interaction between an operator station and a controller were clearly defined. Even though it was unlikely that a bad message would be sent, error-checking routines knew what the incoming messages would be and rejected everything else. All software from device driver to message handler was written and understood by one design team working from a single functional specification. In such an environment, it is relatively simple to build “strong” software.

Even so, many software failures have occurred, but when that happens, it’s clear who’s responsible, and the failures become the responsibility of the developers. In the proprietary DCS/PLC environment, one design team has clear responsibility for reliable and safe system operation. This is neither probable nor possible in an open-system environment.

In the open environment, it is more likely that unexpected data can be communicated. General-purpose communication protocols have become more complex, making it harder for software developers to filter unexpected data messages. It is unlikely that one design team even understands the whole design, and it’s very murky who is responsible when a failure occurs. As a result, systems get rebooted, failure events go unreported, and control and automation systems become less reliable.

Moving into an open-source PC environment, there is more flexibility and more opportunity for trouble. In this less-controlled environment, dynamic linking library’s developed by different companies are often incompatible and installation procedures aren’t always elegant enough to prevent one overwriting another. (See “How Linux addresses open systems” and “DLL hell finally exercised” sidebars.)

Versions of the operating system written for different countries do not always react identically. What happens when the input “1.000,” (number one with four significant figures) in the U.S. version of the operating system is interpreted as 1,000 by the European version? What happens when language dependency is not accounted for in the software design?

While many groups are working diligently to improve the quality of the software they produce, independent software quality audits have shown that many organizations still produce software at a level called “chaos.” Audits have revealed that software safety and reliability techniques are often not understood by software developers, and while many developers are good at what they do, variability is considerable. That alone introduces risk of control system failure.

Abandon open systems?

Is the risk of unsafe operation or downtime with open systems too great? With all this potential trouble, shouldn’t we abandon the idea of open systems? Of course not; the benefits will someday be spectacular. But until then, tread very carefully. Understand where such equipment can be safely used. Understand the cost of downtime and consider the entire life-cycle costs. Consider getting the details of each vendor’s software quality, safety, and reliability procedures, and if that information is unavailable, buyers should beware.

Avoid complexity by not allowing general-purpose use of personnel computers where control is performed. Do not install anything not required, no games, no office tools, no flight simulators. Perform system testing only after all hardware and software is installed. When upgrades or new hardware or software is installed, retest everything.

The promise of open systems will someday arrive, and the promise can be achieved without sacrificing safety or reliability. Eventually more and more developers will conduct system and software hazard analysis to identify potential problems. System developers and integrators will conduct failure modes and effects analysis on software and system designs. Standard interfaces will include message filtering and error checking to the level sufficient for process control.

Yes, the day is approaching when control engineers will spend their day improving the manufacturing process. Until that day arrives, ask questions and apply common sense in where and how you deploy open systems.

For more information circle 200, on line, at or visit . For software suppliers, go to .

Author Information

Dr. William Goble is co-founder and president of

How Linux addresses open

Pros and cons of Microsoft-based software appear openly and are frequently disseminated across the Internet; Linux pros and cons (especially the con’s) are much more difficult to locate.

Part of this disparity is likely the result of some users’ feelings when comparing giant Microsoft with the “Linux guy” in the next cubicle. recently announced opening of Open Controls Laboratory (Westborough, Mass.), headed by Dr. Peter Wurmsdobler a recognized expert on real-time Linux., involved with the LinuxPLC (PuffinPLC) project since early 2000, provided perspective on how an open-source operating system, like Linux, can be considered a viable candidate for use in high availability control and automation systems.

CE: Operating system failures have occurred because of incompatible dynamic linking library modules being loaded by different applications. Please explain how similar incompatibility issues are handled in Linux.

Dr. Wurmsdobler: The content of a Microsoft dynamic link library, for instance, is split into several layers and priority spaces in Linux. First, everything that concerns hardware becomes a module that can be directly compiled into the Linux kernel or separately compiled as a dynamically loadable kernel module. Each module and kernel contain version-control numbers. If an attempt is made to load an incompatible module (i.e., a module requiring function calls not supported by the kernel) or there is a version mismatch, the module insertion is rejected by the kernel.

Second, user space libraries also contain version control and are placed in predefined directories accessed by a dynamic link/loader. When an application is launched, library function calls are verified. If a mismatch occurs or a function call is missing the application aborts.

A significant benefit of Linux is that multiple versions of the same library functions can exist without application confusion. This is made possible by a symbolic link pointer set by the application to the library functions that are actually used.

CE: Microsoft Windows uses a somewhat standardized error handling dialogue box to notify users of problems. In an open source environment, how are error messages standardized and who’s the keeper of an error code library?

Dr. Wurmsdobler: Standard error messages of system and library calls are defined as numbers in a header file that are translated into POSIX (Portable Operating System Interface) compliant text and made available to any number of destinations including terminals, printers, and files.

This approach is significantly different from the Microsoft Windows Operating Systems where the operating system and the user interface are combined.

The Linux approach seems well-suited to control applications, such as might run in a LinuxPLC where no window system exists, and provides more flexibility because error messages can be directed to another network- connected device where a systems engineer is able to trace the error to its root cause.

CE: Control system failures have occurred as a result of commercially available operating systems that have been “personalized” for different world areas. For example, different interpretations of commas and periods in numbers, or different units of measure have been traced as root cause system failures. How are similar problems identified, handled, and/or avoided in the Linux world?

Dr. Wurmsdobler : Linux’s evolved from UNIX, which was internationally developed, thus language and number argument problems were addressed during early UNIX development days. Today, Linux tries to remain compliant to international and open standards, especially POSIX.

On the application side, most Linux applications expecting numbers will implement the distinction at the application layer.

CE: In an open-source environment how does one verify the quality of the software?

Dr. Wurmsdobler: The question can also be asked of closed software development environments. Having a documented software development process and/or “owning” all the software developers doesn’t ensure quality software.

The development process can be flawed or more likely, the process is not well understood or completely followed by the developers-often because of marketing driven time constraints.

In the open-source world, a person begins a project to solve a specific user need. Often the development community takes over parts of the project, makes suggestions and improvements, and self-monitors and evaluates the produced results. Most of the time, this allows the original user to remain involved in the evolution of the solution, a much more natural process.

For more information about the LinuxPLC project or the Open Control Laboratory initiative, , on line visit

Dave Harrold, senior

What makes software reliable?

Within the reliability engineering community there is a directly applicable notion known as the stress-strength concept that goes, all failures occur when some “stress” is greater than the associated “strength.”

This stress-strength concept appears throughout the mechanical and civil engineering community where the stress is a mechanical force and strength is the ability of a structure to resist that force.

Electronic hardware is exposed to many kinds of physical and environmental stress, including heat, humidity, chemicals, shock, vibration, electrical surge, electrostatic discharge, radio waves, and others.

Operational stress includes incorrect input commands, incorrect maintenance procedures, bad calibration, improper grounding, and others.

How are these stresses applicable to software, and what are the things that can stress software-based systems?

Software stress examples

The console of an industrial operator station had been functioning normally for two years. On one of the first shifts of a new operator, the console stopped updating the screen and would not respond to subsequent operator commands.

The unit was powered down and successfully restarted with no hardware failures.

With more than 400 units in the field and 8,000,000+ operating hours, the console manufacturer found it difficult to believe a software fault existed in such a mature product.

Despite doubts, the console manufacturer instructed software and hardware product development teams to conduct a thorough root-cause analysis.

One of the manufacturers test engineers visited the customer site and interviewed the new operator. During the visit the engineer noted, “This guy is very fast on the keyboard.”

That piece of information permitted the manufacturer development team to trace the source of the problem to the operator striking the alarm acknowledgment key within 32 milliseconds of the alarm silence key, causing a memory write problem followed by a processor lockup.

In another example, a computer stopped working after receipt of a network communication message. Testing and analysis eventually revealed the message came from a device with an incompatible operating system. While the sending device used the correct “frame” format, the operating system of the receiving device utilized different data formats. Because the receiving computer did not check for a compatible data format, the data bits within the frame were incorrectly interpreted. The situation caused the computer to fail within a few seconds.

Many other examples of software failure are documented. Most can be traced to some combination of events considered unlikely, rare, or even impossible.

Like hardware, a software failure is caused when stress is greater than strength. The strength of a software system is a function of several things, including the number of software faults (human design errors, “bugs”) present, the amount of software error checking, and data validation conducted.

Software system stress is the combination of inputs, timing, and stored data acted upon by the processor. Inputs and timing of inputs may be a function of other computer systems, operators, or both.

DLL hell finally exorcised

According to information obtained from Microsoft’s (Redmond, Wash.) web site the number one source of fragility in Microsoft’s Windows operating system, commonly known as DLL hell, has been exorcised in Windows 2000 Professional.

Dynamic link libraries (DLLs) appeared in early releases of Microsoft Windows with a goal of sharing as many resources as possible while conserving, at the time, valuable disk and memory space.

Today, disk space and memory are relatively inexpensive and the use and benefits of DLLs make less sense.

Microsoft’s solution is Windows File Protection (WPF), a two part mechanism for protecting critical system files.

The first WPF mechanism uses a digital signature cryptographic technology that verifies the source of a system file and alerts WPF when a critical system file modification is attempted.

The second WPF mechanism is the System File Checker, a tool that manages a catalog of file versions. If a cataloged file is missing or corrupted, WPF renames the affected catalog file and retrieves a cached version from the Dllcache folder. If a cached version is unavailable, WPF makes a request for a replacement copy.

Unfortunately, there is apparently no way to add these features to existing Windows 95/98 and NT installations, so upgrading to Windows 2000 Professional is the only solution. However, implementation of the new rules governing component installation is likely to be way off for some developers, possibly creating a new class of critical file management problems.