Big Data privacy for machine learning made cheaper

Rice University computer scientists have discovered an inexpensive way for tech companies to implement a rigorous form of personal data privacy when using or sharing large databases for machine learning.

By Jade Boyd November 28, 2021
Rice University computer scientist Ashumali Shrivastava (left) and graduate student Ben Coleman discovered an inexpensive way to implement rigorous personal data privacy when using or sharing large databases for machine learning. Courtesy: Jeff Fitlow, Rice University

Rice University computer scientists have discovered an inexpensive way for tech companies to implement a rigorous form of personal data privacy when using or sharing large databases for machine learning (ML).

“There are many cases where machine learning could benefit society if data privacy could be ensured,” said Anshumali Shrivastava, an associate professor of computer science at Rice. “There’s huge potential for improving medical treatments or finding patterns of discrimination, for example, if we could train machine learning systems to search for patterns in large databases of medical or financial records. Today, that’s essentially impossible because data privacy methods do not scale.”

Shrivastava and Rice graduate student Ben Coleman hope to change that with a new method called locality sensitive hashing, Shirvastava and Coleman found they could create a small summary of an enormous database of sensitive records. Dubbed RACE, their method draws its name from these summaries, or “repeated array of count estimators” sketches.

Coleman said RACE sketches are both safe to make publicly available and useful for algorithms that use kernel sums, one of the basic building blocks of machine learning, and for machine-learning programs that perform common tasks like classification, ranking and regression analysis. He said RACE could allow companies to both reap the benefits of large-scale, distributed machine learning and uphold a rigorous form of data privacy called differential privacy. Differential privacy is based on the idea of adding random noise to obscure individual information.

“There are elegant and powerful techniques to meet differential privacy standards today, but none of them scale,” Coleman said. “The computational overhead and the memory requirements grow exponentially as data becomes more dimensional.”

Data is increasingly high-dimensional, meaning it contains both many observations and many individual features about each observation.

RACE sketching scales for high-dimensional data, he said. The sketches are small and the computational and memory requirements for constructing them are also easy to distribute.

“Engineers today must either sacrifice their budget or the privacy of their users if they wish to use kernel sums,” Shrivastava said. “RACE changes the economics of releasing high-dimensional information with differential privacy. It’s simple, fast and 100 times less expensive to run than existing methods.”

– Edited by Chris Vavra, web content manager, Control Engineering, CFE Media and Technology, cvavra@cfemedia.com.


Author Bio: Jade Boyd, Rice University.