Four ways spreadsheets limit data analytics

Tools needed for data cleansing, visualization, contextualization, and modeling.

Process industry firms have collected manufacturing data for decades. With each step-change advance in hardware and software, organizations generate and collect more data, characterizing process conditions, supply-chain metrics, and other production aspects.

Nevertheless, companies struggle to convert collected data volumes into useful information and insights. They do so to improve reliability, safety, and profitability of process units, plants, and businesses. But as data volumes grow, the challenges intensify.

An industrial revolution, driven by the Industrial Internet of Things (IIoT), is unfolding based on advanced computerization, sensor proliferation, and wireless technologies-dramatically expanding data types and volumes to store and analyze, and requiring a better analytics approach.

Historically, process manufacturers use spreadsheets to organize data collected in tabular form. Originally meant for accounting and finance, spreadsheets were never a great match for large volumes of time-series data. They did, however, allow software-enabled formula building, as well as calculations across multiple sheets.

Therefore, engineers adopted spreadsheets for data analytics projects, resulting in labor- and time-intensive processes. In addition, using spreadsheets, results sharing and collaboration with others was difficult. As companies amassed more data, they struggled to find efficient ways to share data-driven insights within the organization.

Advanced analytics software is the means to overcome these challenges and barriers. To better understand these advances, let’s look at four spreadsheet limitations, as well as how each is addressed by analytics solutions. 

Volumes of data

Process manufacturing and monitoring systems produce massive data amounts that collectively characterize process conditions, operation/product flows, and equipment condition. Data related to control systems are generated in varying forms. The general approach is to assemble all data related to an investigation into a spreadsheet, and then do the analysis. The sheer volume of collected data from multiple sources quickly erodes capabilities to conduct effective analyses.

Before doing analytics, data must be sorted and cleansed, and the number of data points in the spreadsheet reduced. Instrumentation signals are reformatted to fit the spreadsheet column/rows paradigm, as illustrated in Figure 1. The stated limit for a Microsoft Excel spreadsheet is about one million rows. A common process-system sensor sampling frequency is once per minute, which equates to a half million rows in Excel per year. If the sampling frequency is once every 30 seconds, or if the user wants to review two years of data, then it is impossible to look at all the data in proper resolution.

In addition, files that stretch the limits of spreadsheet capacity will experience performance issues. Layering in multiple sets of data and calculations, having numerous large files open at once, and linking to other applications and macros hinder spreadsheet usability. All these capabilities are commonly required by an engineer’s or scientist’s process data workflow. With spreadsheets, users must make concessions on the type and sampling of data segments. 

Data isolation

While related to volume limits, data isolation is a separate issue. For example, each time a team member accesses process data, they first download it into a separate and duplicative file. This is a one-time snapshot extraction. If the data changes or updates, then the query must be redone. This can have ramifications for subsequent calculations, cleansing, and insights. Large files are difficult to share across an organization and keep in sync, especially if multiple users are viewing the same data sets and sources.

Given IIoT and the cloud, creation of more and larger databases is a continuing trend. In addition, not all of the data, databases, and users are in one location. Remote databases and users further complicate the task of getting the proper data to users.

Once the relevant data is assembled in a spreadsheet, how do users find data-driven insights? Engineers are most interested in how data behaves over time and in relation to other system elements. For example, temperature, pressure, feedstock quality, and conversion rate all traverse time and have processing relationships.

As in any analysis, the user must first identify the process points of high interest, such as optimum steady-state conditions, critical-equipment vibration trends, shutdowns, emissions events, and other parameters. Time is a factor for each. Engineers analyze data aggregated across shifts, weeks, months, or years to identify trends and root causes.

To do this in a spreadsheet, users sort columns and rows to identify data points to consider. This sorting/cleansing is done systematically with spreadsheet functions, but 70% of the top 10 most-used functions Microsoft lists for Excel are for data-wrangling and not data analytics, which is where value is delivered.

Data manipulation comprises from 50% to 90% of the time spent developing spreadsheet applications, as illustrated in Figure 2. Spreadsheet algorithms can sort and slice data, but data manipulations/calculations approaches are not transparent, and they can be difficult to remember and share with colleagues.

For instance, in a monthly unit report or a quarterly emissions assessment, the data must be re-queried, and any manual elements must be reproduced or automated with macros. If the analysis is done infrequently, or by a different person, then it can take significant time to learn or re-learn the spreadsheet data machinations. Some teams have separate documentation to describe workflows, but the lack of transparency in developing macros hinders reproduction of any analysis. 

Limited collaboration, reporting

Once large data sets are sorted and sifted, how are data-driven insights shared and distributed? Limited capabilities for extracting information from a spreadsheet analysis and sharing it are available, but calculation opacity makes collaboration and results reproduction difficult. Also, for size and simplicity reasons, typically what is shared is a picture of the analysis, rather than the spreadsheet itself.

Further, due to data isolation and dense manipulation limitations, work done using spreadsheets must be centrally accessible and rigorously maintained. This becomes difficult once results are transferred into another form for distribution. Reporting and sharing often consists of copy and paste efforts, or an insertion/link/query/re-query (file crash) workflow.

Advanced analytics software addresses these four spreadsheet limitations, and others, to deliver quicker insights, as shown in the following example.

Energy for cereal processing

The energy management team at a cereal manufacturing facility was tasked with finding energy-reduction drivers. The batch-cook step in the cereal process consumed significant volumes of super-heated hot water. It was identified as a leading area for improvement.

This project required collaboration between the process operations engineer, Matt, and the energy management project leader, Lauren. Process operations and the energy management team were challenged to find energy-saving measures.

The batch-cook kettles consumed significant volumes of hot water to maintain proper temperatures. To optimize energy consumption, a new hot water addition manifold with software to control freshwater additions was proposed. With tighter temperature control strategies, the kettles could reduce the fluid drained from the previous batch, minimizing hot-water additions while maintaining required cook temperatures. The procedure would conserve energy by reducing the volume of new super-heated water additions.

To develop a data-driven solution to the above problem, Matt exported the batch-cook data from the process historian and execution system into individual spreadsheets. A new control system had been installed so Matt abandoned the old system’s data and started over with a new analysis. The volume of data limited export capabilities and required narrowing the analysis from a year to the most recent quarter. In addition, time-series data mismatches from the historian and execution system were a problem. Due to these differences, manual time synchronization of the data between the two systems was necessary. Using a brute-force effort, Matt created a condensed, low-resolution data view from a few batches, and extrapolated the results for a year.

Despite efforts by Matt to mine available batch-cook data, the results were mixed. The new spreadsheets were still too large to share easily with the energy management team. Due to size and complexity, the spreadsheets periodically crashed after inclusion of calculations and charts. Matt and Lauren invested significant time to sift and sort the data manipulations to find practical results, as illustrated in Figure 3.

A quick review showed Matt used most of his time to clean up, sort, sift, copy, and paste data of interest into spreadsheets. He spent little time on the actual analysis of the proposed operational changes, although this task was still quite onerous. During the project, Matt was promoted to a new position, and a new engineer took over this task. Differences in data preparation methods between the two process engineers introduced result discrepancies.

The case history illustrates challenges faced when analyzing process data using spreadsheets. Despite all their efforts, the process operations group and energy management team were continuously redoing the same steps to wrangle data. The problem was not lack of data, and it usually isn’t in any industrial facility. Rather, it was data not time-synchronized and in various locations and formats. Using spreadsheets to share data with different groups also proved difficult, and effective visualization and reporting efforts were nearly impossible. 

Applying advanced analytics

Advanced analytics software removes roadblocks found when interacting with large data sets. In the case history example, the teams were only interested in finding data concerning super-heated water fills for the batch kettles over an extended period. The data of interest was in different locations and varying formats. Outlier data, such as that gathered during downtime for cleaning and maintenance, had to be removed from the data sets.

Spreadsheets required brute-force efforts to remove or exclude nonessential information. Too much time was spent copying and pasting data to other spreadsheets for continuing calculations and visualizations.

On the other hand, advanced data analytics software accesses data where it resides. Copying and pasting isn’t required because the software combines high-resolution data from multiple sources (Figure 4). Simple query commands facilitate data targeting and support streamlined sorting, cleansing, and assembly of only that data needed from historians and other sources.

Seeq’s advanced analytics software also supports basic calculus and other math functions engineers use to convert data into visual summation of trends and other relevant data sorts. Data of interest is easily assembled and contextualized for future modeling. Required training for new users is minimal.

Documenting of tasks and results is incorporated within the work product, providing clarity. Equally important, data and results can be monitored and published without repeatability issues. Data visualization can be accomplished by exporting and publishing results in Excel or PowerPoint.

Final words

Advanced analytics software tools accelerate the process of data cleansing, visualization, search, contextualization, and modeling. Using such tools, engineers focus on knowledge capture, collaboration, and problem-solving—instead of data wrangling grunt work.

The right software tools enable users to focus on information and knowledge for more insightful decision making. These tools have widespread compatibility with databases and historians from multiple vendors. Advanced analytics software overcomes barriers and supports accessibility of process plant data to drive better decision-making.

Michael Risse is vice president and chief marketing officer at Seeq Corp., a company building advanced analytics applications for engineers and analysts to accelerate insights from process manufacturing data. He was formerly a consultant in big data applications and platforms, and prior to that worked with the Microsoft Corp. for 20 years. Risse is a graduate of the University of Wisconsin at Madison and lives in Seattle, Washington.

This article appears in the IIoT for Engineers supplement for Control Engineering and Plant Engineering

See other articles from the supplement below.