Google Has Your Data, But How Do They Use It?
By David Magerman, PhD
TL;DR - Our data is everywhere. We can’t stop companies from having it. But we have a right to know how it is used. By encrypting all human behavioral data and creating an audit trail when it is decrypted and used, we can achieve enforceable data privacy policies.
[Note: During the drafting of this blog post, California real estate developer Alistair Mactaggart, one of the authors of California’s CCPA law, succeeded in getting a new privacy law on the ballot for the coming November elections (described here). The proposed measure would require the enforcement of “purpose limitation,” effectively requiring companies to allow users to limit the ways in which their behavioral data is used in machine learning algorithms. The law would require implementation of some variant of what is proposed here].
When it comes to data privacy, those of us who have been tilting at the windmill of protecting people from the abuse of their data have been getting it all wrong. We have been overly focused on the ubiquity of human behavioral data, how people have the right to be anonymized and forgotten, when we should be concentrating our efforts on clarifying and constraining how the data is used, shining a light on the machine learning algorithms that weaponize it.
Digital human behavioral data is everywhere. Our browsers track our movements around the internet. Retailers, credit card companies, and banks collect our spending behavior. Cell phone carriers and app developers track our physical location using GPS services on our phones. Alexa and Siri hear virtually all of our conversations, in our house and whenever we are near our phone, tablet, or computer. Social media companies track every aspect of our lives, via our posts and the posts in which we are tagged. Cameras everywhere use facial recognition to find us out in the world. Our DNA is collected by various parties, either because we give it to 23AndMe, or because we have to provide it to our employer, or someone collects it without our knowledge. The list goes on and on.
The technology exists to capture nearly every second of our lives no matter where we are. (Do you bring your cell phone into the bathroom?) Any desire we might have to prevent people from collecting, storing, and aggregating our human behavioral data is a hopeless cause. By deploying devices that can track our every move everywhere in society, including in our own ears (earpods now come with Alexa pre-installed!), we have opened Pandora’s Box of data promiscuity, and we will never be able to close it again.
Two years ago, the European Union came out with their General Data Protection Regulation (GDPR). California followed earlier this year with its California Consumer Privacy Act. Other jurisdictions have added their attempts to regulate human behavioral data. One of the main focuses of these regulations is the “right to be forgotten”. Another aspect of these regulations is the anonymization of personally identifiable information (PII). These are all attempts to regulate the availability of human behavioral data to limit the ability of companies using this data against the desires of the presumed owners of this data, the human beings that generated it. I have been a proponent of these efforts, and I still support the development of more thought around these regulations. However, I think all of these regulatory efforts miss a key point: enforcement.
Putting it bluntly, all of these regulatory efforts are effectively operating on the honor system when it comes to enforcement. We are trusting that companies are abiding by data collection rules. We are assuming they are deleting all copies of data they aren’t allowed to keep (and not dropping secret backups in data vaults hidden in the Rocky Mountains). We trust that they are anonymizing data before they are using it, and anonymizing it in ways that truly hide the identities of the humans behind the data, as opposed to applying some transformation that allows the valuable parts of the human’s identity to be easily recovered by algorithms. And, while there are ways of auditing compliance of these regulations, it is extraordinarily hard, if not impossible, to prove compliance, and violations are easily attributable to human error, software bugs, or other incidental mishaps. For all of the fanfare around their announcement and implementation, GDPR, CCPA, and their relatives are largely nuisances that can be easily worked around.
As I embraced this demoralizing view, and I started to confront the reality that the current approaches to data privacy aren’t going to work, I had an epiphany that has transformed my view of how to approach this problem going forward. Ultimately, the data isn’t the problem. The data exists. Frankly, the data exists whether we collect it or not. We have DNA. We have spending histories. We have location. There are images of us everywhere. We exist, and our data is just the residue of our existence.
The problem is how we USE that data: the algorithms. Machine learning has become so powerful and accurate, it is able to use the data that we feed it to model us in ways that can harm us. Facial recognition algorithms can build models to identify us when we don’t want to be found by people we don’t want to find us. Marketing data analytics tools can build models to predict how companies can convince us to buy things we don’t want to buy or to convince us to pay more for things than we should. Social media companies deploy machine learning to determine what information to show us to keep us glued to their platforms. Machine learning algorithms can use data from a relatively small sample of people and extrapolate from that data to predict the behavior of whole swaths of society.
It isn’t the data itself that is the problem. It’s the way the data is used that can harm us. And that observation can lead us to a solution to protecting us from the damage of the misuse of human behavioral data, by good and bad actors. And the answer is counterintuitive. Right now, there is a movement to anonymize data, to disconnect it from the source. I think the beginning of the solution is the opposite: to attach real identity to every piece of data that is collected anywhere on any computer system. Then, you need to force everyone to ask permission to use it, based on how they plan to use it. Here is what I mean.
Right now, everyone is being asked, by various regulations, to identify human behavioral data and to strip it of identity, anonymize it. But let’s say we did the opposite. We require everyone who has even a scrap of human behavioral data to attach the identity of the person who could be identified by that data. Then encrypt that data with a key associated with that person. Everyone has a key. And every time an algorithm needs to use that key to decrypt the information, it needs to get permission from the human being to decrypt the data.
[Note: In order to be impactful, this permissioning needs to extend to derivative uses of data as well. If a person’s data is decrypted and used in an algorithm that produces summary statistics, and those summary statistics are later used in another machine learning algorithm, that use needs to be permissioned as well. The permissioning rules need to tag along with all derivative data sets, and the auditing needs to include those uses, otherwise companies could easily work around these safeguards the way they do now with weak attempts at anonymizing data].
This isn’t as arduous or unfeasible as it might sound. People could give blanket permission for certain kinds of use of the data. All the company needs to do is ask. Companies could build automated or semi-automated tools to respond to these requests based on an individual’s preferences. Companies are already doing this to support responding to requests for CCPA and GDPR compliance, and they are doing it quite effectively. The key feature of this solution is that the “user” of the data is an algorithm, and the company deploying each algorithm that uses the data would have to describe the purpose of the algorithm and get permission from the human represented by the data before they could use it.
You might be happy to have your DNA used for medical research. You might not be as happy if it is being used for investigating crimes, or to build models for pricing health insurance. You might be willing to have your spending data used for banks that want to build risk models for making loans if they compensate you for providing the information. You might want retailers to use your shopping data to help you identify products that you might want to buy. Or you might not.
If you sell your data or simply let people use it, you can’t control HOW it is being used. And that’s the key to protecting ourselves from the abuse of human behavioral data in machine learning algorithms. It’s the algorithms that use the data that matter, and having the ability to control what algorithms are allowed to use our data, and what goals those algorithms achieve for companies that deploy them, is the key to defusing the dangers of the ubiquity of human behavioral data.
And there’s one more significant advantage to the decryption-by-use approach to protecting human behavioral data: accountability. By virtue of forcing everyone to ask for permission to use data every time they use it, you can create an audit trail that systematically catalogs every use of every piece of data by every algorithm. In real-time, the audit trail is as boring as a box of accounting receipts. But if there is ever an accusation of misuse of human behavioral data against a suspected bad actor, the audit trail of how human behavioral data was used by algorithms within a company would create a roadmap to accountability. Retrospectively, a company could be forced to justify each use, to explain each algorithm, to prove the validity and permissibility of each use of a person’s data.
If we could get to the point that we had an audit trail for every decryption of every piece of human behavioral data used by machine learning algorithms, then, for the first time, we could create an enforcement system for protecting human behavioral data that would have some teeth. The onus would be on the user of the data to prove that their use was valid and permitted. We would no longer be operating on the Honor System. Wouldn’t that be a better world to live in?