Date:
Location:
Kirsten Morehouse, PhD student, Harvard
Topic: Recent discovery of a hidden peril of open science
Description: I recently submitted a paper with Brian Nosek and Benedek Kurdi that may be broadly interesting to our social area. It explores an unintended consequence of the open data revolution: re-identification, or the ability to combine demographic information to reveal a person’s identity without direct identifiers (such as email addresses or IP addresses). For example, Sweeney (2000) demonstrated that just three variables from the United States Census – gender, zip code, and date of birth – uniquely characterized 87% of the U.S. population. That is, 87% of the U.S. population had a unique combination of these three variables (e.g., woman born on 1/18/1996 who lives in 02142). By consequence, those individuals – including the only woman born on 1/18/1996 who lives in 02142 – can be identified using this minimal but publicly available information.
Crucially, this risk of re-identification is especially relevant to psychological science (and social psych, in particular) because (a) datasets often include a host of additional demographic information (e.g., race/ethnicity, level of education), which heightens the risk of re-identification; (b) the demographic information collected by psychological scientists exists in other public datasets, allowing the data to be linked to reveal sensitive information (e.g., health information); and (c) data sharing is becoming the norm across subdisciplines.
In the manuscript, we (a) introduce psychologists to the issue of re-identification risk, and (b) provide a complete pipeline for assessing re-identification risk, identifying appropriate risk mitigation and data sharing strategies, and implementing those strategies.