Some statistics related things

collected here to share and for reference

R & Shiny

Add a load spinner in Shiny with shinycssloaders in two lines of code. Package on CRAN (Mar 2019)

Your don't have to know JavaScript to use JavaScript libraries in Shiny. A guide to htmlwidget. (Mar 2019)

To use output values in a conditionalPanel, you must render it in UI first. See this post. (Feb 2019)

CV errors of selected models are biased, for the same reason that naive post-selection inference/estimation of effect sizes are. See Varma & Simon (2006), Cawley & Talbot (2010), and Bergmeir & Benítez (2012) (Feb 2020).

The OLS solution can be written as a weighted average of all n-choose-(p+1) slopes defined by all subsets of (p+1) data points that form a plane. See a theorem of Jacobi and its generalization by Mark Berman (1988). (I got it from Ken Rice's BIOST 571 notes, Nov 2019)

The ridge solution can be written as a weighted average of 2^p regression coefficients on all possible subsets of variables. See Leamer and Chaimberlain (1976). (Nov 2019)

A brainteaser that I couldn't solve during an interview. Not really stats-related. (Oct 2019)

Don't use AUROC when evaluating the predictive performance for rare events. The precision-recall curve handles multiple testing (through FDR) and better reflects the difficulty of the problem. See also, a post by Jason Brownlee. (Jul 2019)

My second favorite statistical head-scratcher (after Stein's paradox) is the sub-optimality of thresholding procedures (partly because we found it). (Nov 2018)

The Alternating Conditional Expectation algorithm proposed in Breiman and Friedman (1985). See a quick replicate of the first example. I am surprised that I have never heard of it until now. (Feb 2018)

A useful variant of the Davis–Kahan theorem for statisticians which bounds the distances between subspaces spaned by eigenvectors of a perturbed covariance matrix and the its counterpart of the uncontaminated one. Related is a variational characterization of distances between subspaces.

An iterative method for Optimization on Stiefel manifolds by Hemant D. Tagare. This method happens to fail when optimizating tr(XX'S) over X in the Steifel manifold when S is real symmetric (which is a common problem is statistics). See this review paper for a comparison of procedures for eigen-problems on data streams. One of the ideas is based on rank-one modification of the symmetric eigen-problem.

Some thoughts on trees and distance metrics. You can recover the tree structure from the distance matrix of its nodes. (see an R script)

You cannot arbitrarily design full conditional distributions and expect them to be compatible (only when they are compatible can you use Brook's lemma). I was suprised how this is glossed over by many people. (numerical example, theory)

There is little reason to unconditionally favor Fisher's method (sum of negative log p-values) over, say, Edginton's Method (sum of p-values) for combining p-values (although it is better known that Fisher's method and Bonferroni's method are good at picking out different alternatives). In some cases of small and distributed effects, the latter wins; see numerical example, theory. In particular, Edginton's Method is optimal when the p-values under the alternative are truncated exponentials.