Data reidentification is a potential threat to individuals and entities who are represented in a dataset, especially when the data are sensitive in nature. Threat is a function of the probability of reidentification and the consequences of being reidentified. Identifiers are features of the dataset that may be used to reidentify an individual, also referred to as [[Personally Identifiable Information]] (PII). Direct identifiers can be used to identify an individual directly, such as their name, address or social security number. Indirect identifiers can be used to identify an individual in combination or by context. A threat actor may attempt to reidentify through case-by-case matching or computer-based approaches like fuzzy or probabilistic matching. They may also mosaic multiple datasets together to reidentify individuals. ## RAMPUP Process Use the RAMPUP process to address reidentification risk in datasets. 1. Risk Assessment: identify the risk and probability of reidentification 2. Mitigation Plan: apply technique(s) for mitigation the risk to individuals 3. Utility Profile: verify that the final mitigated dataset is representative of the original data ## Mitigation strategies ### Categorical * **Recode:** re-code data, especially be grouping (e.g., recode town name to county) * **Local suppression:** redact any categories with fewer than some threshold number of individuals ### Numeric - **Micro-aggregation:** round values or create bins (e.g., exact age to age ranges) * **Add noise:** add noise in the data to (e.g., "jitter" coordinates by some distance in a random direction). When adding noise to a variable that might be summarized (e.g., age), add noise to maintain the mean across the entire dataset. * **Top/bottom coding:** recode outlies, or any values above or below specific numeric thresholds (e.g., the highest and lowest cost houses in an area) ### Overall - **Aggregation:** - **Redaction:** remove the record or variable from the dataset. ## K-anonymity K-anonymity is a technique to reduce the chance of data reidentification. Consider the variables in the dataset that might be used to reidentify individuals or may reveal sensitive information, such as demographic data, variables that others could observe easily or would be expected to know, and sensitive variables (e.g., HIV status, sexual preference). Count the number of individuals in each group represented by the unique combination of these variables. All groups must have at least $k$ individuals. $k$ can be any number but $3$, $5$, and $10$ are common values, the higher the threat the larger $k$ should be, recognizing that data utility may decrease with higher values of $k$. Also consider the sample size relative to the population size when considering values for $k$, if the sample represents a significant proportion of the population, higher values of $k$ are warranted. If any group has fewer than $k$ individuals, apply the mitigation strategies above. > [!Tip]+ Additional Resources > - [De-Identifying Governement Datasets: Techniques & Governance NIST SP 800-188](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-188.pdf) > - [Exposed! A Survey of Attacks on Private Data](https://privacytools.seas.harvard.edu/files/privacytools/files/pdf_02.pdf) > - [Guidelines for Evaluating Differential Privacy Guarantees NIST SP 800-226 ipd ](https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-226.ipd.pdf) > - [A Brief History of Data Anonymization](https://aircloak.com/history-of-data-anonymization/) See also: [[Differential privacy]]