Mission-critical should be the mission

For many applications, the cost of failure is too high not to have mission-critical capabilities in supervisory control and data acquisition systems. See three reasons control systems fail and five ways to build resiliency.

By Chris Little February 9, 2021


Learning Objectives

  • Why control systems fail: Architecture, cyber attacks and underestimating data recovery costs.
  • Potential problems can exist below the surface and can be unseen for long periods of time.
  • Five ways to build control system resiliency includes redundancy, backups in real time and integrated software platforms.

Explore the conditions that allow gaps to emerge in the most hardened process control systems

With many things in life, we are on the brink of failure and not even know it. This is never truer than in mission-critical control systems like supervisory control and data acquisition (SCADA) software. Be aware of how blind spots develop even when smart people actively look for them.

Sinkholes show how potentially disastrous gaps emerge unnoticed. They seemingly appear without warning, but they develop over long periods and leave plenty of clues. Water and gas utilities may experience pressure losses. Telecommunications companies may notice intermittent signal losses. City workers may fill pavement cracks. It’s very seldom that anyone unifies the data points into predictive information.

Control systems fail for three main reasons.

1. Why control systems fail: Architecture

Single points of failure: This is one of the most common reasons systems go down. This may be one hard drive, server, programmable logic controller (PLC), office location, or network. Trace the path from input/output (I/O) points, to the PLC, to the human-machine interface (HMI), to thin clients and alarm notifications. Identify individual components that can take down a system.

Limited levels of redundancy: SCADA specifications typically require server failover. The problem is not all redundancy is equal. Most platforms only support two redundant servers. Worse, most use third-party historians which require a different methodology for failover and synchronization.

Virtual redundancy: Virtualized servers are an important tool for IT departments to manage systems. Developers can create multiple server instances, each with its own OS, running on one physical computer. The obvious problem is the physical computer is a single point of failure. Complex virtualized designs also can make it harder to spot points of failure.

2. Why control systems fail: Cyber attacks

Distributed denial of service (DDoS) – For this common strategy, the attacker floods the target’s network with meaningless requests. One solution is to employ virtual private networks (VPNs) between servers and remote I/O devices and to avoid using public IP addresses. You also can configure a firewall to reject excessive requests and accept requests from whitelisted computers at specific times.

Ransomware – By tricking users into opening email links or inserting infected USB drives (beware of that nice camera found in the parking lot), attackers encrypt a company’s user’s data and sell them the decryption key. Avoiding ransomware requires training and vigilance. There is always a “first time” for new exploits.

Assume bad guys already are in – Develop ways to limit the damage intruders can do once they are past security. Recently, a company that provides IT solutions to U.S. businesses and governmental organizations discovered their software had been compromised. Hackers had ability to access the networks of more than 18,000 customers for weeks before being discovered.

3. Why control systems fail: Underestimated cost of data recovery

What happens if something does fail? In addition to the loss of real-time monitoring and control, what is the cost of recovering lost data? This can be hundreds of times more than the cost of the systems itself.

  • Manual syncing of data: Assuming that there are backups to work from, it is often a long and cumbersome process to manually synchronize secondary computers or backed up databases.
  • Data loss: Data may be permanently lost, resulting in inaccurate reporting. This may have a knock-on effect, as these reports may be assumed to be correct leading to operational inefficiencies for years.
  • Complexity of procedures: Restoration process complexity can lead to errors in the re-inputting of data.

Five ways to build control system resiliency

  1. System-wide redundancy: Many software platforms are limited to a primary and a backup server. Some products provide unlimited levels of redundancy. Ensure that there is robust failover for all components like alarm notifications (email, SMS text message, voice-to-speech call out), thin clients, networks, and other components. Also, if a redundant network is part of the design, is the backup alarmed for failure?
  2. Real-time system backup and bi-directional synchronization: Traditionally, SCADA systems are backed up offline or online. The former involves shutting down the system leaving operators blind and unable to manage alarms. The later can corrupt data during the process. Few platforms automatically sync historical data after failover. Often a separate backup methodology is required for third-party historians. Automating backups may require custom scripting. Manual backups are easily forgotten. Systems that support bi-directional synchronization provide real-time synchronization of all the services that make up SCADA systems. In addition to the historian, this includes events, alarms, security, and application settings. This means each SCADA server can be an up-to-the-second copy of the entire application without missed backups.
  3. Integrated software platforms: The sinkhole example shows how gaps emerge over time when disparate pieces are cobbled together. Many platforms use third-party products for core components, such as historians, alarm notifications, thin clients and scripting. Software should ensure that everything works together with new software versions and eliminates the risk that components are altered or discontinued by manufacturers. A unified approach requires one install, license agreement, training track and support contract.
  4. Application version control: Many system failures result from malicious acts by disgruntled workers or the unexpected consequences of innocent configuration. When things go wrong it is vital to identify who did what and immediately roll back to the last known working version. While some SCADA providers support third-party version control, there are benefits to this being a native component, such as the ability to automatically distribute the encrypted change list across all servers.
  5. Fast response to vulnerabilities from the vendor: Software platforms regularly release new versions and features and often connect to devices developed long after applications are deployed. This ensures that security gaps will appear over time. The Industrial Control Systems Cyber Emergency Response Team (ICS-CERT) regularly conducts vulnerability analysis on products used in critical infrastructure. When ICS-CERT identifies a potential security exploit, it contacts the vendor who then has time to patch the vulnerability and distribute the solution before the vulnerability (and hopefully the fix) is made public.

Chris Little is media relations with Trihedral. Edited by Mark T. Hoske, content manager, Control Engineering, CFE Media and Technology, mhoske@cfemedia.com.

KEYWORDS: Automation implementation advice, SCADA, control system resiliency


Have you looked at control system resiliency?

Online extra 

VTScadaLIGHT is free and can be used for applications with up to 50 I/O on up to 10 PCs, with free video training. https://www.vtscada.com/light 

Author Bio: Chris Little, media relations, Trihedral