||Yaqian (Angela) Zhu
|| Assessing and Comparing the Application of Bootstrap Methods
Bootstrapping is a method for estimating or approximating the sampling distribution of a statistic, and it is often considered a resampling procedure that employs numerical approximations. Constructing confidence intervals requires knowledge of the sampling distribution or its percentiles.
When the sampling distribution of a statistic and its characteristics are unknown, they need to be estimated from data, which can be done using bootstrap approaches. The purpose of this project is to assess which bootstrapping methods perform well with respect to confidence interval coverage.
We will numerically survey bootstrap approaches to various statistical estimators and different underlying distributions to assess and compare accuracy. Furthermore, we will consider research that has already been done in this area and compare them with our study.
- Week 1:
- I met with my mentor, Dr. Kolassa, to discuss the goals and direction of this project. He gave me a paper as well as two books that introduced me to bootstrapping and the various approaches. I also used R to run code to create Percentile and Residual confidence intervals for common distributions, such as the normal and uniform distributions to get an idea of bootstrapping procedures.
- Week 2:
- On Monday, I gave my presentation introducing my project. I read a section regarding empirical comparisons of bootstrap confidence sets to evaluate coverage accuracy and performance. To better understand the basis for the methods used, I looked over background information on the Asymptotic theory and Edgeworth expansions. Furthermore, to see the coverage rate of regular confidence intervals for the mean, I simulated data from symmetric distributions--normal, uniform, Cauchy, and Laplace.
- Week 3:
- To assess the performance of the various bootstrap methods, we drew samples of sizes 10, 20, and 50 from several symmetrical continuous probability distributions: the Normal, Uniform, Cauchy, and Laplace. Using R, we generated bootstrap replicates of the mean and median and calculated the corresponding confidence intervals. Using Monte Carlo simulations, this was done numerous times to consider the percentage of the intervals generated that cover the true value of the parameter of the distribution. These percentages will be used to compare the bootstrap methods while taking into account characteristics of the underlying distribution.
- Week 4:
- I made graphs for the coverage percentages and average confidence interval length in order to compare the performances of different bootstrap methods for the distributions of interest. These two measures are used to assess the performance of each bootstrap method. Furthermore, I ran simulations for the gamma distribution, which is an example of a non-symmetrical distribution, to compare its behavior with that of the symmetrical distributions. I have began writing up our methodology and results.
- Week 5:
- We found that while the standard t-test method and Studentized bootstrap method produce confidene intervals that have coverage percentages closest to 95%, these methods also tend to produce confidence intervals that have greater lengths. Thus, to consider both measures simultaneously, I made scatter plots of average confidence interval length vs. coverage percentage residual. My mentor also gave me a book to read to introduce me to survival analysis, and I familiarized myself with the "survival" package in R.
- Week 6:
- I ran the Cox regression for several of the survival data sets available in R and computed the hazard ratio for each. Furthermore, I simulated exponential data and computed the hazard ratios for them. I made confidence intervals for bootstrap replicates of the hazard ratio. For some of the confidence interval generated by the boot.ci function, the bounds were very large in magnitude, so instead I used the logarithm of the ratio.
- Week 7:
- Intead of bootstrapping the hazard ratio, we decided to assess the bootstrap methods for the coefficients from the Cox proportional hazards model--that is, the natural logarithm of the hazard ratio--to ensure reasonable confidence bounds. For the data sets, coverage percentages were close to 1, but this may be due to the small sizes of the data set. We are more interested in the results from simulating from the exponential distribution. Furthermore, I prepared and gave a presentation on the results obtained thus far for my project on Friday.
- Week 8:
- Dr. Kolassa gave me a section to read regarding the accelerated failure time model, which is a parametric model that provides information on the rate of progression of a disease. I simulated data from the Weibull, Lognormal, and Exponential distributions to construct bootstrap confidence intervals for the natural logarithm of the acceleration factor. Similar procedures as those used previously for our project were used to calculate coverage percentages.
- Week 9:
- This week I made a poster and wrote the final research paper for my project. I would like to thank my mentor, Dr. Kolassa, and DIMACS for an amazing REU experience and NSF for funding my project.
-  Booth, J. G., & Sarkar, S. (1998). Monte Carlo Approximation of Bootstrap Variances. The American Statistician, 52(4), 354-357. doi:10.1080/00031305.1998.10480596
-  Introduction to Modern Nonparametric Statistics, by James J. Higgins. 2003.
-  Davison, A. C., & Hinkley, D. V. (1997). Bootstrap Methods and Their Application. Cambridge: Cambridge University Press.
-  Higgins, J. J. (2004). An Introduction to Modern Nonparametric Statistics. Pacific Grove, CA: Brooks/Cole.
-  R Core Team (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
-  Shao, J. and Tu, D. (1995). The Jackknife and Bootstrap. New York: Springer.