Managing Risk Don’t Fall Flat

Broad grins on the faces of skydivers in free fall come from two things. First, the exhilaration of free falling through space just can't be concealed; and second, skydivers are confident they have properly assessed and mitigated the risk before stepping out of an airplane. Careful inspection and preparation of equipment, hours of rigorous training, and reliable back-up systems permit sky...

KEY WORDS

Advanced process control and automation

Computer software

Open systems

Quality assurance

Safety

Standards and regulations

System analysis and design

Testing

Sidebars: Failing to follow best practices puts software at risk Managing risk improves production

Broad grins on the faces of skydivers in free fall come from two things. First, the exhilaration of free falling through space just can’t be concealed; and second, skydivers are confident they have properly assessed and mitigated the risk before stepping out of an airplane. Careful inspection and preparation of equipment, hours of rigorous training, and reliable back-up systems permit skydiving enthusiasts to make more than 3.25 million jumps in the U.S. in 1998 with less than 0.001% fatalities.

In process and manufacturing environments, risk assessment is too frequently reserved for formal hazard analysis and operability studies (HAZOP) conducted as part of a project. The reality is, hardly a day goes by when each of us is not faced with some form of risk assessment at home, in our travels, or at work. It may be choosing to use a chair with rollers instead of a ladder, accelerating through a traffic light that’s about to turn red, or trying to do a heroic act to prevent a plant shutdown, each is a risk that we have chosen to take. In more situations than we care to count, it’s blind luck that keeps us out of harm’s way. Most of the time, it shouldn’t be like that! Most of the time there is time to proactively assess risk and develop appropriate precautions.

Manage risk, reduce surprises

Skydivers are averse to surprises during a jump. Many choose to go beyond regulation compliance that requires a manually operated backup parachute and embrace risk management. For a $1,200 investment, skydivers move from complying with regulations to managed risk using a computerized automatic parachute activation system equipped with altitude and rate-of-decent sensors that trigger an automatic opening of the parachute when conditions become unsafe. .

Like skydivers, manufacturers have a choice between compliance and risk management.

In a Feb. 1999 AMR Research (Norwood, Mass.) report on environmental health and safety, senior analyst Leif Eriksen contrasted the no-win situation of compliance management with the competitive advantages offered by embracing risk management. According to Mr. Eriksen’s report, compliance management is a reactive game of ever-changing regulations, and an ad-hoc approach often results in point solutions that cost more to support than to purchase. Mr. Eriksen recommends following forward thinking, multiplant corporations choosing to use life-cycle-cost to gain competitive advantage. According to Mr. Eriksen, companies adopting a cradle-to-grave risk management philosophy spend less time focused on current regulations and more time ensuring they’re able to respond to any regulation.

When risk is proactively managed, fewer surprises occur, and in nearly every business arena fewer surprises equates to less worry and more profit. For example, U.S. Federal Reserve Chairman, Alan Greenspan recently informed the banking industry to expect “significant” banking supervision changes. Mr. Greenspan is rightly concerned that recent bank mergers, changes in banking strategy, and banks’ penchant for risks increase risk of a single bank failure significantly damaging U.S. or world economies.

For the first part of the 21st century, assembling mitigation strategies to avoid economic collapse will be important news, but it’s doubtful these will garner the TV and print coverage Y2K has received during the past several years.

Y2K has educated much of the world of the importance of risk assessment and mitigation. Perhaps risk assessment and mitigation aren’t the exact words in use, but that’s exactly what Y2K is all about. Anyone who has read even one article about Y2K is aware that software developers in the 60’s, 70’s, and even into the 80’s saved computer memory by using two digit numbers for years (i.e., 99 = 1999, 00 = 1900 or 2000). What we didn’t know was software written 20 and 30 years ago would still be around today.

When year 2000 appeared on the horizon, two Y2K mitigation efforts commenced. The first involved programmers reviewing billions of lines of software code to find and fix two-digit problems. The second mitigation effort is occurring among those who doubt the Y2K problem has been found and fixed. This second group is preparing for the worst by buying generators, stockpiling food and water, burying money in the backyard, and preparing for a return to primitive living.

Assuming the problem will be (or mostly) found and fixed, our very nature requires we identify additional benefits to help justify the billions of dollars and millions of hours expended on behalf of Y2K.

Rollover advantages

Significant among benefits manufacturing users associate with Y2K efforts include:

Most manufacturing sites and information technology groups were amazed at the quantity and variety of software and embedded devices on or near the plant floor affected by two-digit date rollover problems. Developing this unique, one-time inventory has forced most companies to reconsider how systems are specified and deployed;

Many Y2K teams used this one-time opportunity to replace or bring their manufacturing software and systems to the latest revision level. Where single copies of a vendor’s software existed among multiple copies of another vendor’s software, the “odd ball” software was replaced. Similarly Y2K teams replaced entire control systems, sometimes because the old system couldn’t be made Y2K compliant, and sometimes because the change facilitated easier future support; and

Much of the software requiring Y2K compliance review was monolithic, poorly and/or inconsistently constructed, and frequently void of meaningful comments. These deficiencies—found as frequently in plant floor systems as in corporate enterprise systems—highlight the importance of developing, using, and maintaining sound software design, implementation, testing, and change-control guidelines and standards.

At a recent AMR Research conference, senior analyst Kevin Prouty shared his past experience with poorly designed and documented software.

When it came time to add a new press to the plant floor, the software development engineer told Mr. Prouty it would take less time to write new logic from scratch than figure out and rework existing press logic.

Stories like this are too common and exist not because of Y2K, but because software developed for plant-floor use has been called everything except software and thus avoided the scrutiny of “real” software. That needs to change!

Fortunately, one way or the other, Y2K-related problems will soon be behind us. Assuming we have not returned to primitive living, we should apply what we already knew, but ignored, and what we have learned.

Avoiding a repeat of the past

In September ’99, NASA’s (U.S. National Aeronautics and Space Administration) Mars climate orbiter crashed onto the surface of Mars because two teams of software engineers used different programming standards. One team used metric units, the other used English units and the software program designed to establish Mars orbit “broke.”

In the future, the sophistication and number of devices with software touching the plant floor will increase and unless properly managed, incidents similar to NASA’s Mars orbiter will occur.

Like NASA, users increasingly rely on teams of user and system integrator personnel to configure, program, and integrate application specific solutions. To avoid NASA-like errors, users should review how they specify, select, purchase, accept, maintain, and support software-dependant products and application solutions.

Y2K has already demonstrated that software can exist for 20+ years—to ensure software is maintainable over long periods of time requires use of proven standards and procedures throughout the software’s life.

Some user companies have “preferred” integrators and engineering contractors that differ from one plant area to another. It’s fine to use different contract personnel, but risk increases if the end-user’s software development and programming standards are not harmonized with integrator’s standards (i.e., the metric and English units’ syndrome).

Among the questions users should ask themselves and their supplier/integrators are, do you have in-place and can you verify that you:

Follow established software development standards? Development standards identify individuals responsible for developing software specifications, specifies authority, and defines the sequence of steps to design, submit, review, and approve software specifications;

Follow established software-programming standards? Programming standards define the software’s logical structure, how pre-engineered and tested library modules are used, how new software is to be organized and documented, how variables are named, how revisions are managed, how and who is responsible for testing, and how errors are documented, corrected, and verified to be removed;

Have consistent procedures to document and manage changes throughout the software’s life?

Have consistent procedures on how system wide hardware and/or software upgrades will be managed, tested, and validated?

Maintain copies of every software revision you have every delivered to a customer (or used to make a customer product) and that duplicate copies exist off-site to protect against loss in event of a disaster?

Have provisions to make your software available in the event your company is merged or no longer exists? and

Have procedures to document, test, and verify hardware and/or software decommissioning does not adversely impact remaining devices and/or software programs?

Open software standards, such as Microsoft Foundation Class, OLE for process control (OPC), ActiveX, visual basic (VB) and others, are used in new control and automation systems and offer users greater flexibility and freedom of choice. But with flexibility comes the responsibility of taking the time and effort to establish sound software design, programming, testing, and support standards and procedures.

With flexibility comes responsibility

Traditional plant floor control and automation system programming has been mainly dominated by instrument and process engineers and technicians who understand the process and are willing to learn to “program.” Some have adopted, developed, or evolved their own programming standards and guidelines. But with few exceptions, control and automation programs have made little use of modular programming techniques, making it difficult to determine when undocumented or unauthorized changes occur.

Many newer software development environments include audit trail and revision management tools. Systems without this capability can take advantage of third-party tools that automate software collection, archiving, revision comparison, and reporting (search www.controleng.com/buyersguide for companies providing software revision tracking tools).

Today, new control and automation systems use object-oriented programming techniques.

Object-oriented programming permits development of a software object that mimics a physical entity (i.e., motor or flow controller). Each object has associated attributes (i.e., operator faceplate, tag, engineering units, commands, I/O channels, etc.). Objects can be created and tested once and used as templates to provide consistency and reduce implementation efforts.

New Unified (hybrid) Control Systems (UCSs) designed to recognize objects and use only the attributes necessary for each unified device’s mission eliminates repetitive programming and simplifies integration without compromising the openness to use products from other suppliers.

Being able to rely on the object oriented development environment provided by UCSs goes a long way toward managing software development risk, but it doesn’t eliminate it. The very standards used to develop the UCS (i.e., Microsoft Foundation, ActiveX, OPC, etc.) provide the flexibility for users and/or integrators to deploy custom application using VB or C++. During the sales cycle that sounds good, but it opens up the possibility (again) of having poorly constructed and poorly documented software as an integral part of the control and automation system. Users must understand and accept the responsibilities of selecting a flexible, open control system. Only after users have applied a skydiver’s risk management mentality to their control and automation systems will they too have broad grins on their faces knowing what to expect and exhilarated by the possibilities.

For more information about AMR Research, visit www.controleng.com/freeinfo :

Failing to follow best practices puts software at risk

Avoiding software errors requires following good software design, development, testing, and change management. When any of these elements are missing or skipped, the integrity of software is compromised and the risk to people and assets becomes scary.

The following software incidents emphasize the importance of following best practices to reduce software risk.

Nov. 24, 1997, U.S. Federal Aviation Administration faces massive computer shutdowns. Computer programmer finds an internal clock error due to rollover from 31 to 32 bit counter. Over 1 million lines of code are checked, 150,000 changes are made and the “fix” is completed before fateful day.

Nov. 26, 1996, computer software upgrade is incorrectly installed, causing computer to log off and lock out 2,000 directory assistance operators.

Feb. 19, 1996, Air Canada flight 899 drags tail section on liftoff. A recent software modification to the cargo loading program to include a new class of aircraft is not tested for interaction with existing aircraft loading calculations. Flight 899 is improperly loaded. Hand calculations verify incorrect computer loading data.

Jan. 11, 1997, employee reprograms fast-food cash register, pockets $3,600, gets caught, and receives 10-year jail sentence.

Nov. 13, 1995, owners of computers with specific BIOS hear “Happy Birthday” song over and over on boot-up. Discharged employee’s last act before leaving the company was to program the BIOS to sing Happy Birthday on his birthday.

April 28, 1994, computer analyst visiting a casino picks 19 of 20 numbers three times in a row and wins $620,000. The casino machine’s internal clock used to generate random numbers is missing. Each time the machine is reset, it generates the same number sequence.

June, 1985 to Jan. 1987, six patients are overdosed during radiation therapy. Software designed for use in one radiation therapy machine was modified for use in a different machine. Machine safety was controlled by software, but the machines used different computer processors with different interrupt cycles. Radiation therapists were so familiar with the machine’s setup sequence they could enter data faster than the computer could read and store setup data. The patient radiation dosage input was missed and the machine defaulted to 100% dosage level.

Software is everywhere, it affects our daily lives in ways we can’t even imagine. When those responsible for designing, developing, testing, and maintaining reliable, robust software skip one important step, just one time, millions of lives can hang in the balance, including yours and mine.

Managing risk improves production

Extreme time and cost savings resulted because GE Plastics Resin 2 plant (Ottawa, Ill.) completed the OSHA (U.S. Occupational Safety and Health Agency) 1910.119 – Process Safety Management (PSM) regulations. Reactor start-up times went from 40 min. to 20 min. and the time required to achieve on-spec product went from 8 hrs. to 2 hrs.

Challenged with implementing GE’s Six Sigma quality improvement process at the same time the OSHA-mandated, five-year PSM review came due was first viewed as a major resource drain—until GE personnel got the idea to apply the Six Sigma process to meet the OSHA-PSM mandate.

Team meetings determined three elements were required to achieve success: good team chemistry, agreed-upon team challenge, and use of the right tools.

To address team chemistry, Resin 2 personnel assembled representatives from technology, operations, engineering, and safety. Four core team members were selected based on their knowledge of the process under study. Additional resources, including drafting, maintenance, and operations, were used as needed. Three of the core team members had recently completed Six Sigma green belt training and a master black belt was assigned to provide overall project guidance.

To be successful, the team challenged themselves to avoid finger pointing and focus their investigations on process safety and improvement opportunities.

Following investigation, Fault Tree Analysis (FTA) was chosen as the modeling tool for processes used to control production quality. The team felt that because FTA’s are constructed from the top down using failure logic, being able to identify root causes would lead to quality and production improvements. To help develop and analyze the FTA’s, GE contracted with Triconix (LeMarque, Tex.) to facilitate the process and provide software tools to document and analyze findings.

Before starting, the team gathered piping and instrumentation diagrams (P&ID’s), operating procedures, reaction kinetic and thermodynamic reports, maintenance records, and instrumentation and equipment failure rate data.

Starting with a list of all reactants and additives, tables were constructed listing each ingredient’s purpose, product quality issues, and processing steps. Table information combined with P&ID information and equipment failure rate data became the information needed to construct the fault trees.

Using FTA tools, each fault tree was analyzed with quantified results being the final output. The core team compared analysis results with actual plant experience. Where discrepancies were discovered, adjustments were made and new analysis conducted. This process was repeated until the team was comfortable that all FTA results accurately represented operational experience.

Armed with accurately quantified information, the team proceeded to develop and prioritize recommendations to reduce risk and improve quality and throughput issues associated with process startup. Though some recommendations cost money to complete, management was so impressed with the methodology, results, and confidence of the team, funding was approved.

What the team learned, and later applied to another part of the plant, was that dissecting the process and the control system bit-by-bit enabled them to identify areas of improvement. These improvements likely would have been overlooked using a less rigorous risk management method.

For more information about Triconix, visit www.controleng.com/freeinfo.

To read an expanded version on how GE Plastics used Six Sigma and Fault Tree Analysis to improve the process and meet OSHA compliance visit

For more information on the Six Sigma quality improvement process, see CE, Jan 99, p62 and CE, Mar 99, p87.