Statistics 406 : Homework Solution 4


All Splus functions are in hw4.S, available by source("hw4.S").

4.1 IQR, alpha-fractile, and trimmed mean

Naive method generates all order statistic, either directly or by inversion. These methods need sorting generated random samples, but sorting cost is at least of order n*ln(n). This becomes very slow as the sample size n increases. Therefore other approaches such as sequential method or exponential spacings should be considered for this problem.

The theoretical IQR (which is 2) is obtained by solving next integral equation:

> solve(int(1/(Pi*(1+x^2)),x=-IQR/2..IQR/2)=1/2, IQR);
The theoretical trimmed mean is 0 as the p.d.f is symmetric about 0. For simulation, we should set the sample size n as well as simulation size N. I will default n=100 and N=1000 in functions. Three experiments are investigated.
  1. Naive direct generation -- iqr.cauchy.naive(), trim.cauchy.naive0(), trim.cauchy.naive()
  2. (Modified) Sequential Method -- iqr.cauchy.seq(), trim.cauchy.seq()
  3. (Modified) Exponential Spacings -- iqr.cauchy.exp(), trim.cauchy.exp()
Note that in modified sequential method, the correct algorithm is

Vi = Vi+1 * Beta( ki, ki+1 - ki ) , for i = p-1, ..., 1.

that is, the addition '+1' should be taken out in the second parameter. Therefore the third argument of the beta() is not 2 but 1. The class handout is mistaken. Be careful.

For generating random samples from cauchy distribution, see one claim in Afternote 9. Splus builtin function rcauchy() is used in all codes here. Few people use the fact that the Cauchy pdf is a special case of the tea-pdf with degrees of freedom one.

Since the simulation size is large, a program design with vectorized operation is an important issue. Specifically, we should avoid for() and apply() if possible when the algorithm takes large objects as arguments. I put some non-vectorized commands by # commenting in the beginning of lines. For example the line in iqr.cauchy.exp()

> v <- apply(w,2,cumsum)
should be replaced by
> v <- lower.tri(matrix(1,3,3),diag=T) %*% w
It's a technical issue but see how lower.tri works:
> lower.tri(matrix(1,3,3),diag=T)
     [,1] [,2] [,3] 
[1,]    T    F    F
[2,]    T    T    F
[3,]    T    T    T
Consult help(), help.start() and see help(lower.tri) for details. In general, matrix operations are much faster and reliable in Splus computation than iterative loop.

(a) IQR

> source("hw4.S')
> motif()
> lets.compare(seed=987)
            CPUtime Min. 1st Qu. Median Mean 3rd Qu. Max. 
     Naive 3.589999 1.18    1.81   2.02 2.05    2.25 3.45
Sequential 0.029999 1.24    1.82   2.05 2.06    2.25 3.46
    Exp.Sp 0.040001 1.23    1.82   2.04 2.06    2.25 3.36

> printgraph(file="hist1.eps")
> par(mfrow=c(1,1))
> boxplot(iqrout.n,iqrout.s,iqrout.e,names=c("naive","seqen","expsp"))
> printgraph(file="box1.eps")
The distributions are very much alike, The histograms (naive, sequential, exponential spacings from the top) show unimodal distribution centered around two with slight skewness to the right. The boxplot indicates several outliers on both tails. See the considerable difference of computational speed. This becomes conspicuous once we increase the sample size to n=1000.
> options(object.size=11000000)
> lets.compare(n=1000,N=1000,seed=987)
             CPUtime Min. 1st Qu. Median Mean 3rd Qu. Max. 
     Naive 24.070004 1.74    1.94   2.00 2.01    2.07 2.32
Sequential  0.049994 1.69    1.94   2.01 2.01    2.07 2.30
    Exp.Sp  0.060003 1.67    1.94   2.00 2.00    2.07 2.32
The naive method takes about six times longer while other methods take time less than twice. The histograms and the boxplot reveal that the distributions are more clustered around the center 2 but still slightly skewed to the right.

(b) Alpha fractile, trimmed mean

The comutational speed issue is much more serious in trimmed mean case as it needs to generate many order statistics at a time. First the naive case, I designed two functions:
  • trim.cauchy.naive0() uses mean(trim= ) option.
  • trim.cauchy.naive() uses sort() function and matrix operations for averaging.
  • I had expected the former one is quicker, but the reality was opposite.
    > cpu.time(out <- trim.cauchy.naive0(n=100,N=1000,seed=987))
    [1] 4.74
    > cpu.time(out <- trim.cauchy.naive(n=100,N=1000,seed=987))
    [1] 3.78
    
    The reason is the trim= option just add many Splus internal operations to mean(), which can be programmed rather efficiently outside (or even use C-subroutines). So I stick to trim.cauchy.naive() for comparison.
    > lets.compare2(n=100,N=1000,seed=987)
               CPUtime   Min. 1st Qu.    Median     Mean 3rd Qu.  Max. 
         Naive 3.81999 -0.520  -0.117 -5.53e-05  0.00324   0.116 0.589
    Sequential 1.51999 -0.593  -0.108  6.51e-03  0.00437   0.113 0.553
        Exp.Sp 0.89001 -0.691  -0.121 -1.01e-03 -0.00418   0.113 0.512
    
    The distributions are very much alike again. The histograms show unimodal symmetric distribution centered around zero. The boxplot indicates several outliers on both tails. Again, naive method takes time but the difference is not huge compared to the IQR case. Sequential method takes about 70 percent more time to exponential spacing. An interesting result comes out once we increase the sample size to n=1000.
    > lets.compare2(n=1000,N=1000,seed=987)
               CPUtime   Min. 1st Qu.    Median     Mean 3rd Qu.  Max. 
         Naive   24.30 -0.169 -0.0368  0.001130 0.001320  0.0405 0.189
    Sequential   16.03 -0.169 -0.0354  0.001280 0.000589  0.0362 0.196
        Exp.Sp  113.95 -0.165 -0.0365 -0.000726 0.000785  0.0371 0.182
    
    It is surprising how slow exponential spacing for computation
    > cpu.time(out <- trim.cauchy.exp(200,1000,0.2,seed=987))
    [1] 2.43
    > cpu.time(out <- trim.cauchy.exp(400,1000,0.2,seed=987))
    [1] 11.36
    
    So the order of time increment is more than the size of n2.
    CORRECTION : By close investigation, I discovered that the slow execution by the function trim.cauchy.exp() was simply by the matrix multiplication of the line:
    v <- lower.tri(matrix(1,p+1,p+1),diag=T) %*% w
    
    and by replacing it into more efficient code, exponential spacing is actually the quickest among the three! Since S+ is so slow in execution of large size of matrix, I made a C-code for cumulative sum of the matrix : colcumsum.c, and Splus calling function : colcumsum.q by dynamic loading. See how it works.
    > !Splus COMPILE colcumsum.c
    > dyn.load("colcumsum.o")
    > print(X <- matrix(1:20,nrow=4))
         [,1] [,2] [,3] [,4] [,5] 
    [1,]    1    5    9   13   17
    [2,]    2    6   10   14   18
    [3,]    3    7   11   15   19
    [4,]    4    8   12   16   20
    
    > matrix(colcumsum(X),nrow=nrow(X))
         [,1] [,2] [,3] [,4] [,5] 
    [1,]    1    5    9   13   17
    [2,]    3   11   19   27   35
    [3,]    6   18   30   42   54
    [4,]   10   26   42   58   74
    
    Now I modify the function in the original file and execute it.
    > options("object.size"=10e7)
    > lets.compare2(n=1000,N=1000,seed=987)
               CPUtime   Min. 1st Qu.    Median     Mean 3rd Qu.  Max.
         Naive   27.47 -0.169 -0.0368  0.001130 0.001320  0.0405 0.189
    Sequential   19.47 -0.169 -0.0354  0.001280 0.000589  0.0362 0.196
        Exp.Sp    7.82 -0.165 -0.0365 -0.000726 0.000785  0.0371 0.182
    
    The cpu time is a bit slower for Naive/Sequential as I ran it on a different machine. But the quantitative result is exactly the same. Notice the improvement for CPU on Exponential Spacing. This is truly based on the speed of "Spacing" algorithm, and now it is absolutely quicker.
    The histograms and the boxplot reveal that the distributions are more centered around the 0 almost symmetrically.

    Summary

    The generated sample distributions are alike among the methods. The results are consistent to the theoretical result. (Otherwise, why we study these methods?) Therefore, the comparative study mainly focuses on the speed of generation and the programming issues. Naive generation and inversion method needs sorts which costs computational time for large n. In IQR generation, exponential spacing is the fastest. In alpha-fractile, sequential method is the quickest but the difference to naive approch is minor. Exponential spacing is definitely inefficient in its speed. Algorithm can be much efficient by introducing vectorized calculation. Avoid for(), apply() if the iteration runs for long index run.

    Credit 80 points : Exact Result 10. Naive/Inversion Approach 20. Sequential Method 20. Exponential Spacings 20. Summary/Discussion 10.

    4.2 Variance Reduction Techniques

    We have great many possibilities for this problem, by combining several techniques altogether. However I will investigate only single technique at a time for fair comparison. Read hw42.ps for technical derivations and hw42maple.ps for maple output.
    Here is a summary for relative efficiencies.
    MethodExact Variance (*N)Relative Efficiency
    Naive 0.248*1001
    Symmetry 0.225*10-111.0
    Anal. Reexpression 0.475*10-15.21
    Anal. Reexpression with symmetry 0.225*10-111.0
    Change of Variable    : y=1/x 0.367*10-267.4
    : y=2/x2 0.114*10-2217
    : y=exp(-x2/2) 0.268*10-3925
    Importance Sampling : Uniform 0.249*10-19.95
    : Inverse Quadratic 0.367*10-267.4
    : Exponential 0.119*10-2209
    : Normal 0.125*10-2198
    Control Variable (N=10) 0.336*10-573667
    Antithetic Variable (Simple) 0.266*10-3932

    First we can easily reduce the variance by looking at the tail probability P(X > c), instead of P(0 < X < c). In our case, for symmery method, we can reduce the variance by (1-theta)*(1-2*(1-theta))/2, which is much smaller than the above result. Notice the remarkable efficiency with exponential or shifted Gaussian. The Antithetic variable method also achieve ten times larger efficiency than Naive method. From these, we expect further reduction by combination of these two.

    The Control variable method achieves highest efficiency. This is what is expected because the five terms polynomial approximation leads great accuracy in approximation. The bad news is its heavy load in computation.

    Practically, terms for contol variables are determined by the Taylor expansion. We may pick only first three terms of expansion -- constant, x2, and x4 to approximate exp(-x2/2). The variance reduction is still significant.

    Credit 80 points : Naive/Symmetry/AnalReex/Changeofv 20, Importance Sampling 10, Control Variables 20, Antithetic Variables 10, Exact/Summary/Discussion 20.