Supplementary Material for: Transmission Disequilibrium Tests Based on Read Counts for Low-Coverage Next-Generation Sequence Data
datasetposted on 12.08.2015 by Kim W.
Datasets usually provide raw data for analysis. This raw data often comes in spreadsheet form, but can be any collection of data, on which analysis can be performed.
The purpose of this paper is the introduction of new statistical methods for case-parent trio association studies based on the read counts that can be obtained from next-generation sequencing (NGS) experiments. This work focuses on the inclusion of low-coverage data into the case-parent trio design without genotype classification or imputation. Two different approaches are considered: (1) a likelihood-based approach implementing a 15-component parametric mixture model and (2) a model-free approach that applies non-parametric statistical methods to the ratios of the read counts to coverage. Simulation studies are conducted to evaluate the performances of the proposed tests. In addition, the non-centrality parameters of the mixture likelihood-based tests are derived to determine sample sizes and coverage for a NGS experimental design. As an example, the sample sizes to maintain specified powers of a published adolescent idiopathic scoliosis (AIS) study are presented. The simulation results show that the tests using the genotypes classified by the maximum Bayesian posterior probability have significantly inflated type I error rates for low-coverage data. The tests using the posterior probabilities instead of the classified genotypes show lower power than the proposed tests. Generally, power for the likelihood-based approach is higher than that for the non-parametric ratio-based approach. For the AIS example, approximately 654 trios with 4× coverage are necessary to maintain 90% power when detecting an association of odds ratio 2 at a locus with a minor allele frequency of 0.35 at the level of significance α = 5 × 10-8. By comparison, approximately 416 trios with 25× coverage are required to maintain the same power with the same settings. The R and C source codes to calculate the proposed test statistics, the sample sizes and power can be obtained by contacting the author (firstname.lastname@example.org).