DIMACS
DIMACS REU 2023

General Information

Student: Lily Gebhart
Mentor: John Kolassa
School: Occidental College
E-mail: gebhart (at) oxy (dot) edu
Project: Approximations for Kurtosis and Continuity on the Prentice Test

Project Description

Nonparametric statistics is a sub-field of statistics involving minimal assumptions about the distribution of data, making it applicable to the analysis of real-world phenomena. The test of Prentice is a non-parametric statistical test for the two-way analysis of variance using ranks. The null distribution of this test is approximated using the Chi-square distribution. However, the exact null distribution deviates from the Chi-square approximation in certain cases commonly found in applications, motivating adjustments to the distribution. This summer, we presented adjustments to this null distribution, and that of related tests with non-polynomial scoring systems, correcting for continuity, skewness, and kurtosis in the multivariate case.


Weekly Log

Week 1: May 31 - June 2

This week, I spent most of my time building an understanding of the two problems in nonparametric statistics I will be working on this summer. I read papers [1], [2], [3], and [4] in the list below, sections of [5], and other miscellaneous resources. This helped me understand how to implement code to tackle the impact of adjustments of kurtosis in the Kruskal-Wallis and Friedman tests. I was also able to find code corresponding to improvements made to the Friedman Test in [3] below.

Lastly, I prepared my presentation for the introductory presentations hosted next Monday. Here, the background I had built over the course of the week paid off in helping me create the presentation more quickly than I thought.


Week 2: June 5 - June 9

On Monday, I gave my introductory presentation to the REU cohort. It was exciting to see what everybody would be working on over the summer!

Over the course of the week, I continued to review background material for the Kruskal-Wallis test project. I continued to review [2] to understand how to replicate the results so that I can build off of them later this summer. I also reviewed Chapters 1-3 and portions of Chapter 6 of [6] below which built my background understanding of asymptotics and Edgeworth Series in multivariate and univariate forms. I also spent some time reviewing content from Real Analysis, Measure Theory, Probability Theory, and Complex Analysis using online resources.

Towards the end of the week, I began looking at how to calculate the 1st - 4th cumulants for a multivariate rank sum distribution. I will likely continue with this next week and hopefully get code running for the results in [2].


Week 3: June 12 - June 16

This week, I started to code up the approximation from [2] for the Kruskal-Wallis test distribution under the null hypothesis using R. By the end of the week, I had this written up and was able to start making comparisons to the known Kruskal-Wallis test distribution and the Chi-Square distribution that the Kruskal-Wallis distribution is often approximated using. This next week, I will continue working on the comparisons to other tests to ensure that my code for the Yarnold Approximation is correct, and implement a few other measures to make the code faster.


Week 4: June 19 - June 23

This week, we were able to successfully implement the code to make the calculation of central moments faster for use in our approximation. This means that we can run larger examples more quickly, which is largely beneficial for the project. However, a lot of the week was spent making corrections to this code so that it runs correctly.

I was also able to make plots to compare the Yarnold approximation and Chi-Square distributions to the Kruskal-Wallis test statistic distribution under the null hypothesis. These plots demonstrated that the Yarnold approximation is a better fit for the Kruskal-Wallis test statistic distribution than the Chi-Square distribution, which is what we expected to find.

I also ran plots to compare the Yarnold approximation and Chi-Square distributions to the Friedman test statistic distribution under the null hypothesis. However, we are still working out some kinks in the code for this plot, as it is not correctly computing the Yarnold approximation.

Lastly, I uploaded my code to GitHub so that we could better collaborate on the code for the rest of the summer.


Week 5: June 26 - June 30

This week, I wrote up a comparison plot for the Iman and Davenport approximation [3], the Chi-Square distribution commonly used as an approximation, the Friedman test statistic [1], and our approximation to determine how our approximation compares to previous work. I also wrote up a relative and absolute error plot function so that the differences between the tests were more easily discernible.

With this, and a lot of help from Professor Kolassa, we were able to debug our approximation for applications to approximating the Friedman test statistic! We were also able to work out bugs in other approximations that we made comparisons to, including the Iman and Davenport approximation.

I was also able to start writing up our results in a paper this week, and I plan to continue with it this week after working out some details of another case we will address in the paper.


Week 6: July 3 - July 7

This week, I continued to write up the paper and produced most of the figures for the paper. I also organized and added documentation to our code on GitHub.

I also modified our code to work for the unbalanced case, where different group and block combinations can have different numbers of replicates (or participants) as opposed to the balanced case where every group and block combination has the same number of participants. We also started working on one of the last extensions of the current project that we will be considering, which is expanding the scoring systems to work for more than just rank sums. This will enable the implemented Yarnold approximation to be expanded to more tests that rely on the Chi-square distribution for approximation of the test statistic.


Week 7: July 10 - July 14

My main objective this week was expanding the possible scoring systems of the Yarnold approximation. Last week, I wrote up my own version of the Prentice test for the balanced case that we didn't use at the time. But, in order to implement the scoring systems, I needed to adjust the function to work for both the unbalanced and balanced cases which required more work than the simple formula I had implemented last week. The function was completed early in the week and performed correctly using the original scoring system but is performing oddly for other scoring systems. My debugging will likely carry into next week.

I also made more edits to the paper and finished the first draft of my slides for the final presentations next week. Aside from research, the REU visited IBM this past Thursday, where we heard some interesting presentations on the work of IBM and got a small tour of the facilities.


Week 8: July 17 - July 21

This week, I finished debugging and we fixed the option for alternative scores in my Prentice test function! I was able to run an example plot and add results on the more continuous results of the non-polynomial scores to our paper.

I spent much of the week preparing for my final presentation that I gave on Thursday. It was great to see what everybody had been working on over the summer during the final presentations. After I gave the presentation, we shifted our attention to the manuscript and I continued making edits to prepare it for publication. I also began working on my final papers for the completion of the REU.


Week 9: July 24 - July 28

This week, I made many edits to our paper, primarily for cleanliness (which involved rerunning and reformatting plots). I also added background information to justify the traditional use of the Chi-square distribution to approximate the Prentice test statistic under the null hypothesis and to describe bounds on previous approximations to the test statistic. We also expanded the applicability of the Iman-Davenport approximation to make comparisons to the Kruskal-Wallis test statistic [7] as well as the Friedman test statistic[3]. As a result, I was able to add a few more examples regarding the Kruskal-Wallis test statistic to the paper. I spent some time thinking about extending the Iman-Davenport approximation to the Prentice test as no results were published for this extension of the approximation, but was not able to make much progress.


References & Links

Relevant Papers & References:
  1. [1] The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance, Friedman 1937 - Taylor & Francis .
  2. [2] Asymptotic Approximations for the Probability that a Sum of Lattice Random Vectors Lies in a Convex Set, Yarnold 1972 - Project Euclid.
  3. [3] Approximations of the Critical Region of the Friedman Statistic, Iman & Davenport 1980 - Research Gate.
  4. [4] Various Improved Approximations to Distributions of Quadratic Test Statistics for Dependent Rank Sums, Chen & Kolassa 2018 - BioMedical.
  5. [5] An Introduction to Nonparametric Statistics , Kolassa 2020 - Taylor & Francis .
  6. [6] Series Approximation Methods in Statistics, Third Edition , Kolassa 2006 - Springer .
  7. [7] New Approximations to the Exact Distribution of the Kruskal-Wallis Test Statistic , Iman & Davenport 1976 - Taylor & Francis .
Websites: Presentations:

Acknowledgements

This work was carried out while the author, Lily Gebhart, was a participant in the 2023 DIMACS REU program at Rutgers University, supported by NSF grant CNS-2150186.