Polygenic Risk Scores Can Expose Hidden Genetic Data

Polygenic risk scores have become a standard tool for summarising genetic predisposition to disease, widely used by researchers, clinicians, and consumer DNA testing companies. New research shows they carry a privacy vulnerability that their mathematical reputation for opacity has largely obscured.

Contents

How the Attack Chains Together
Where the Risk Sits

Gamze Gürsoy and Kirill Nikitin at Columbia University demonstrated that the scores can be reverse-engineered to reconstruct an individual’s underlying genetic data with 94.6 percent accuracy, correctly predicting 2,450 single-nucleotide polymorphisms (SNPs) per person. The finding, according to the announcement, has direct implications for insurance contexts, anonymous data sharing, and research involving vulnerable populations.

The attack works because of a specific structural feature of how polygenic risk scores are built. Each SNP contributing to a score is multiplied by a weight that can be up to 16 digits long. That precision sharply constrains the number of genetic combinations that could produce any given final score. The researchers describe the theoretical challenge as analogous to the knapsack problem in mathematics — computationally hard in general, but tractable when the weights are sufficiently precise and the model sufficiently small.

How the Attack Chains Together

Gürsoy and Nikitin ran 298 polygenic risk models using 50 SNPs or fewer against genetic data from 2,353 individuals. Because a single SNP often appears across multiple models, the researchers could use SNPs recovered from smaller models to help solve larger ones — a daisy-chain method that extends the reach of each successful reconstruction.

The downstream identification risk is concrete. Just 27 SNPs proved sufficient to identify an individual within a pool of half a million samples. Family members could be predicted with up to 90 percent precision. The researchers also noted that individuals of African and East Asian descent were more easily identified, a direct consequence of their underrepresentation in existing genetic reference databases.

Of 447 small, high-precision models in a public polygenic score database, Gürsoy says all are vulnerable to this class of attack. The practical threat scenarios include health insurers reconstructing genetic data from a summary report submitted by a patient, and individuals sharing scores anonymously being re-identified through public genealogy databases.

Where the Risk Sits

“Because the final polygenic risk score is constrained by a finite number of ways you could arrive at that number, and a statistically likely arrangement of the underlying SNPs, it can be deduced with a high degree of accuracy,” Gürsoy said. The researcher framed the overall risk as low but real under specific conditions, calling for consideration in research study design — particularly when vulnerable populations are involved.

Ying Wang at Massachusetts General Hospital, responding to the findings, said existing data protections and computational bottlenecks constrain the practical exploitability of the method. Wang added that the results “may serve as a caution that small models should be treated as potentially sensitive data in clinical reporting and informed consent discussions.”

The work sits at an intersection that genetic privacy frameworks have not fully addressed: the gap between sharing a summary statistic and sharing the underlying data it encodes.

Photo by Pixabay

This article is a curated summary based on third-party sources. Source: Read the original article

How the Attack Chains Together

More Read

Where the Risk Sits

All the latest Foxiz news straight to your inbox​

All the latest Foxiz news straight to your inbox