Taking the human element out of big data analysis

MIT researchers are working on taking the human element out of big data analysis with the Data Science Machine, which is designed to not only search for patterns but design the feature set.

11/19/2015


MIT researchers are working on taking the human element out of big-data analysis with the Data Science Machine, which is designed to not only search for patterns but design the feature set. Image Courtesy: Massachusetts Institute of TechnologyBig data analysis consists of searching for buried patterns that have some kind of predictive power. But choosing which "features" of the data to analyze usually requires some human intuition. In a database containing, for example, the beginning and end dates of various sales promotions and weekly profits, the crucial data may not be the dates themselves but the spans between them, or not the total profits but the averages across those spans.

MIT researchers aim to take the human element out of big data analysis, with a system that not only searches for patterns but designs the feature set, too. To test the first prototype of their system, they enrolled it in three data science competitions, in which it competed against human teams to find predictive patterns in unfamiliar data sets. Of the 906 teams participating in the three competitions, the researchers' "Data Science Machine" finished ahead of 615.

In two of the three competitions, the predictions made by the Data Science Machine were 94% and 96% as accurate as the winning submissions. In the third, the figure was a more modest 87%. But where the teams of humans typically labored over their prediction algorithms for months, the Data Science Machine took somewhere between two and 12 hours to produce each of its entries.

"We view the Data Science Machine as a natural complement to human intelligence," said Max Kanter, whose MIT master's thesis in computer science is the basis of the Data Science Machine. "There's so much data out there to be analyzed. And right now it's just sitting there not doing anything. So maybe we can come up with a solution that will at least get us started on it, at least get us moving."

Between the lines

Kaylan Veeramachaneni, a research scientist at MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL), said, "What we observed from our experience solving a number of data science problems for industry is that one of the very critical steps is called feature engineering. The first thing you have to do is identify what variables to extract from the database or compose, and for that, you have to come up with a lot of ideas." Veeramachaneni co-leads the Anyscale Learning for All group at CSAIL, which applies machine-learning techniques to practical problems in big-data analysis, such as determining the power-generation capacity of wind-farm sites or predicting which students are at risk for dropping out of online courses.

In predicting dropout, for instance, two crucial indicators proved to be how long before a deadline a student begins working on a problem set and how much time the student spends on the course website relative to his or her classmates. MIT's online-learning platform MITx doesn't record either of those statistics, but it does collect data from which they can be inferred.

Featured composition

Kanter and Veeramachaneni use a couple of tricks to manufacture candidate features for data analyses. One is to exploit structural relationships inherent in database design. Databases typically store different types of data in different tables, indicating the correlations between them using numerical identifiers. The Data Science Machine tracks these correlations, using them as a cue to feature construction.

For instance, one table might list retail items and their costs; another might list items included in individual customers' purchases. The Data Science Machine would begin by importing costs from the first table into the second. Then, taking its cue from the association of several different items in the second table with the same purchase number, it would execute a suite of operations to generate candidate features: total cost per order, average cost per order, minimum cost per order, and so on. As numerical identifiers proliferate across tables, the Data Science Machine layers operations on top of each other, finding minima of averages, averages of sums, and so on.

It also looks for so-called categorical data, which appear to be restricted to a limited range of values, such as days of the week or brand names. It then generates further feature candidates by dividing up existing features across categories.

Once it's produced an array of candidates, it reduces their number by identifying those whose values seem to be correlated. Then it starts testing its reduced set of features on sample data, recombining them in different ways to optimize the accuracy of the predictions they yield.

Massachusetts Institute of Technology (MIT)

www.mit.edu 

- Edited by Chris Vavra, production editor, Control Engineering, CFE Media, cvavra@cfemedia.com. See more Control Engineering manufacturing IT stories.



Engineers' Choice Awards
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by Control Engineering subscribers.
System Integrator Giants
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
How to Maximize Factory Automation Efficiency with Low Cost Machine Vision
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Wireless Reliability in Harsh Environments
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
Human Factors and the Impact on Plant Safety
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
July 2018
Ladder logic best practices and object-oriented programming, safety instrumented systems, enclosure design issues and challenges, process control advice
June 2018
Discrete and process sensor fundamentals, autotuning controls, system integrator roundtable
May 2018
Salary and Career Survey, IT and OT convergence, robotic standards and safety, secure circuit protection
Edge Computing
This article collection contains several articles on how today's technologies heap benefits onto an edge-computing architecture such as faster computing, better networking, more memory, smarter analytics, cloud-based intelligence, and lower costs.
Data Center Design
Data centers, data closets, edge and cloud computing, co-location facilities, and similar topics are among the fastest-changing in the industry.
PLCs
Programmable logic controllers (PLCs) represent the logic (decision) part of the control loop of sense, decide, and actuate. Featured articles in this digital report compare PLCs and programmable automation controllers (PACs), industrial PCs, and robotic controllers.
SIDB

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

June 2018
Machine learning, produced water benefits, progressive cavity pumps
April 2018
ROVs, rigs, and the real time; wellsite valve manifolds; AI on a chip; analytics use for pipelines
February 2018
Focus on power systems, process safety, electrical and power systems, edge computing in the oil & gas industry
John O. Ayuk, PE, CFSE, PMP, CAP
Automation Engineer; Wood Group
Doug Baker
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
Engineers' Choice Awards
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by Control Engineering subscribers.
System Integrator Giants
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
How to Maximize Factory Automation Efficiency with Low Cost Machine Vision
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Wireless Reliability in Harsh Environments
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
Human Factors and the Impact on Plant Safety
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
July 2018
Ladder logic best practices and object-oriented programming, safety instrumented systems, enclosure design issues and challenges, process control advice
June 2018
Discrete and process sensor fundamentals, autotuning controls, system integrator roundtable
May 2018
Salary and Career Survey, IT and OT convergence, robotic standards and safety, secure circuit protection
Edge Computing
This article collection contains several articles on how today's technologies heap benefits onto an edge-computing architecture such as faster computing, better networking, more memory, smarter analytics, cloud-based intelligence, and lower costs.
Data Center Design
Data centers, data closets, edge and cloud computing, co-location facilities, and similar topics are among the fastest-changing in the industry.
PLCs
Programmable logic controllers (PLCs) represent the logic (decision) part of the control loop of sense, decide, and actuate. Featured articles in this digital report compare PLCs and programmable automation controllers (PACs), industrial PCs, and robotic controllers.
SIDB

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

June 2018
Machine learning, produced water benefits, progressive cavity pumps
April 2018
ROVs, rigs, and the real time; wellsite valve manifolds; AI on a chip; analytics use for pipelines
February 2018
Focus on power systems, process safety, electrical and power systems, edge computing in the oil & gas industry
John O. Ayuk, PE, CFSE, PMP, CAP
Automation Engineer; Wood Group
Doug Baker
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
Engineers' Choice Awards
The Engineers' Choice Awards highlight some of the best new control, instrumentation and automation products as chosen by Control Engineering subscribers.
System Integrator Giants
The System Integrator Giants program lists the top 100 system integrators among companies listed in CFE Media's Global System Integrator Database.
System Integrator of the Year
Each year, a panel of Control Engineering and Plant Engineering editors and industry expert judges select the System Integrator of the Year Award winners in three categories.
How to Maximize Factory Automation Efficiency with Low Cost Machine Vision
This eGuide illustrates solutions, applications and benefits of machine vision systems.
Wireless Reliability in Harsh Environments
Learn how to increase device reliability in harsh environments and decrease unplanned system downtime.
Human Factors and the Impact on Plant Safety
This eGuide contains a series of articles and videos that considers theoretical and practical; immediate needs and a look into the future.
July 2018
Ladder logic best practices and object-oriented programming, safety instrumented systems, enclosure design issues and challenges, process control advice
June 2018
Discrete and process sensor fundamentals, autotuning controls, system integrator roundtable
May 2018
Salary and Career Survey, IT and OT convergence, robotic standards and safety, secure circuit protection
Edge Computing
This article collection contains several articles on how today's technologies heap benefits onto an edge-computing architecture such as faster computing, better networking, more memory, smarter analytics, cloud-based intelligence, and lower costs.
Data Center Design
Data centers, data closets, edge and cloud computing, co-location facilities, and similar topics are among the fastest-changing in the industry.
PLCs
Programmable logic controllers (PLCs) represent the logic (decision) part of the control loop of sense, decide, and actuate. Featured articles in this digital report compare PLCs and programmable automation controllers (PACs), industrial PCs, and robotic controllers.
SIDB

Find and connect with the most suitable service provider for your unique application. Start searching the Global System Integrator Database Now!

June 2018
Machine learning, produced water benefits, progressive cavity pumps
April 2018
ROVs, rigs, and the real time; wellsite valve manifolds; AI on a chip; analytics use for pipelines
February 2018
Focus on power systems, process safety, electrical and power systems, edge computing in the oil & gas industry
John O. Ayuk, PE, CFSE, PMP, CAP
Automation Engineer; Wood Group
Doug Baker
System Integrator; Cross Integrated Systems Group
Jose S. Vasquez, Jr.
Jose S. Vasquez, Jr.
Fire & Life Safety Engineer; Technip USA Inc.
Data Centers: Impacts of Climate and Cooling Technology
This course focuses on climate analysis, appropriateness of cooling system selection, and combining cooling systems.
Safety First: Arc Flash 101
This course will help identify and reveal electrical hazards and identify the solutions to implementing and maintaining a safe work environment.
Critical Power: Hospital Electrical Systems
This course explains how maintaining power and communication systems through emergency power-generation systems is critical.
click me