Yes. Given the circumstances, it seems pretty reasonable.
Some points:
(1). He had access to the 250k sample made available in the leak. (*We do not).
(2). He plotted population counts on the Y-axis by age bins along the X-axis to get a chart of population structure.
(3). He ran at least two data quality checks. One, using the demographic information (name, age, address) associated with each individual named in the leak, he grouped them by county and observed that the sample is sufficiently distributed geographically to include individuals from "almost every county". Two, he ran some percentage calculations on the surnames represented in the sample and compared that against the 2010 census data he has access to in terms of surname shares, concluding they are similar (e.g. that the surname Li is X% of the sample versus a similar X% share in the census). Given these checks it seems reasonable to treat the sample as a random subset of the entire database, which is a good thing. The final population structure curve of the sample should be reasonably close to the one for the whole 1B person data set.
Can you guys comment on Yi’s analysis of the leaked police data? Do you think his methodology is valid?
Yes. Given the circumstances, it seems pretty reasonable.
Some points:
(1). He had access to the 250k sample made available in the leak. (*We do not).
(2). He plotted population counts on the Y-axis by age bins along the X-axis to get a chart of population structure.
(3). He ran at least two data quality checks. One, using the demographic information (name, age, address) associated with each individual named in the leak, he grouped them by county and observed that the sample is sufficiently distributed geographically to include individuals from "almost every county". Two, he ran some percentage calculations on the surnames represented in the sample and compared that against the 2010 census data he has access to in terms of surname shares, concluding they are similar (e.g. that the surname Li is X% of the sample versus a similar X% share in the census). Given these checks it seems reasonable to treat the sample as a random subset of the entire database, which is a good thing. The final population structure curve of the sample should be reasonably close to the one for the whole 1B person data set.
Thanks for the question, Lee!
Thanks!