What are the statistical consequences of using one- or low-dimensional representations of high-dimensional phenotypic traits as the basis for genomic-wide association studies (GWAS)? We recently completed a simulation study to explore this issue, and our results, currently submitted for publication, indicate that, in fact, statistical power is maximized and false discovery rate (FDR) minimized when the number of measured dimensions equates the true dimensionality of the traits under study. The cost in statistical performance caused by under-sampling dimensions can be dramatic and difficult to rescue by increasing sample size. In contrast, the performance loss caused by adding error variation when over-sampling dimensions is relatively minor. Moreover, our simulations show that matching true dimensionality when measuring traits boosts statistical power even in the presence of strong linkage disequilibrium (LD), suggesting that when compared within groups of SNPs in disequilibrium, P-values alone can be a reliable predictor of causality among a group of candidate SNPs.
Our simulation study suggests: (1) that the mismatch between sampled and true dimensionality is an essential component of power, through rarely if ever included in pre-study power analysis; (2) that in the absence of estimates of dimensionality it is preferable to over- than under-sample dimensions (e.g., principal components) to study; and (3) that in the presence of LD and sufficient statistical power, validation is more likely if P-values are used to choose among candidate SNPs that have been grouped into correlated sets-irrespective of their distance in the genome.
|
Present and Past Projects
|