Introduction

The purpose of this how-to is to get you up-and-running with RUV as quickly and productively as possible. Thus, the focus of the this how-to is on a script, ruv_starter_analysis, that runs several RUV analyses for you, and formats the output nicely in the form of a web page, and is a quick and easy way to get a "first look" at your data.

Some brief background: RUV is distributed as a collection of R packages. The packages are: ruv, ruv.extras, ruv.htmllatex, and several data packages. The ruv package contains the core statistical routines. The ruv.extras package includes routines for making plots and example scripts (including ruv_starter_analysis). The ruv.htmllatex package is a dependency of ruv.extras that generates html.

The core statistical routines provided by the ruv package are RUV-2, RUV-4, RUV-inv, RUV-rinv, and a few others. These routines are meant for use by you, the end-user, but they are not the focus of this how-to. For more information on these routines, please consult the standard R package documentation. For more information on the statistical methodology behind these routines, please consult the references at http://www-personal.umich.edu/~johanngb/ruv.

IMPORTANT: The analyses performed by ruv_starter_analysis should not be considered complete, final analyses, but rather as a starting point for further investigation. Moreover, the ruv_starter_analysis script should not be considered part of the "core" of RUV. It may be more properly thought of as a demo of what one can do with the ruv package (albeit an elaborate and particularly useful demo). Thus, you are encouraged to modify this script to fit your own individual needs (see "Going Further" below). Moreover, in future releases of RUV, this script may be modified, perhaps in a way that is not backwards-compatible, perhaps beyond recognition, and may even be omitted entirely. For this reason, ruv.extras is not on CRAN.

Example

The most simple usage of ruv_starter_analysis is as follows:
ruv_starter_analysis(Y, X, ctl)
All that is absolutely required is a data matrix Y, a factor of interest X, and a set of control features (control genes). For example, you could try running:
library(ruv.extras)
library(ruv.data.gender.sm)
data(gender.sm)
ruv_starter_analysis(gender.sm$Y, gender.sm$X, gender.sm$hkctl)
However, for our first example we will make use of a few addional features:
library(ruv.extras)
library(ruv.data.gender.sm)
data(gender.sm)
ruv_starter_analysis(gender.sm$Y, gender.sm$X, gender.sm$hkctl, 
                     pctl=gender.sm$pctl, geneinfo=gender.sm$geneinfo, kset=c(1,5,10),
                     outdir="gender1", webtitle="Gender Example 1")
Here, pctl is a set of positive controls, geneinfo is a matrix including gene names and chromosome numbers, kset is a set of the values of K that we wish to consider, outdir is the directory where the web page will be written, and webtitle is the html title. The webpage output by running the above commands looks like this: Gender Example 1

As you can see, the web page is divided up into 9 sections: General Info, Unadjusted, RUV-2 Combined Analyses, RUV-2 Individual Analyses, RUV-4 Combined Analyses, RUV-4 Individual Analyses, RUV-inv, RUV-rinv, and Projection Plot Table. Each of these sections can be hidden / unhidden by clicking on the section title heading. This makes navigating the web page a bit easier (since it can be quite big), and also allows you to more easily compare different analyses (e.g. by hiding the RUV-2 and RUV-4 analyses, the Unadjusted and RUV-inv analyses will be next to each other). Individual subsections (RLE plots, p-value plots, etc.) can be hidden as well.

Now let's go through and see what each of these sections has to offer.

General Info

The first information is how many samples, features (genes), and control features there are, respectively:

m = 84, n = 12600, n_c = 799

After this there are scree plots:

  Rows 1
  Columns 1 2
Y Y_c
image image

On the left is a scree plot in which all features (genes) are considered; on the right is one in which only control features are used. Note that the log of the eigenvalues are plotted, not the eigenvalues themselves. Note also that it is possible to hide rows / columns of the table by unchecking the boxes. This is not very helpful here, but it can be quite helpful for some of the larger tables later.

After the Scree plots come RLE plots:

  Rows 1
  Columns 1 2
Y Y_c
image image

Note that the vertical scale of the RLE plots is from -0.5 to 0.5. This default is used throughout, in order to make all of the RLE plots comparable. Usually this default is fine, but in this case it is not, because the (unnormalized) gender dataset has a huge amount of unwanted variation.

After the RLE plots comes a canonical correlation plot.

  Rows 1
  Columns 1
image

This plot shows, for each value of K, the square of the first canonical correlation between X and the first K left singular values of Y (black) and Y_c (green). If Y were just IID noise, we would expect these plots to form a diagonal line from (0,0) to (m,1). The fact that the curves lie above this line are evidence that either 1) the unwanted variation is systematically correlated with the factor of interest or 2) the negative control features (genes) are not good negative controls (and are in fact influenced by the factor of interest). However, the fact that the green curve stays relatively low when the black curve jumps up (around K = 20) suggests that the negative controls are relatively uninfluenced by X, and therefore probably are in fact good negative controls. Finally, by looking at this plot we can also conclude that we are probably safe to make K as large as 12 or so without worrying about strong correlation between W and X. Of course, we may still wish to make K even larger, if other evidence suggests that this is a good idea.

After the canonical correlation plot comes a table of principal component plots. An SVD is taken of Y (and also Y_c), and the left singular vectors are referred to as factors. We denote this matrix of factors by W_all (or W_ctl in the case of Y_c). In these plots, we plot one factor against another. This is helpful for seeing if there are any clusters or other interesting variation in the data, and also whether the variation in the control features (genes) is representative of the variation in the entire dataset. The factors of Y are plotted on the right, and the factors of Y_c are plotted on the left.

  Rows 1
  Columns 1
image

Note from the plots of Factor 1 vs Factor 2 that two clear clusters are visible. These clusters turn out to be the two chip types.

Finally, the General Info section concludes with a table of alpha plots. These are similar to the PC plots, except that instead of plotting the factors (columns of W) against one another, we plot the rows of alpha against one another. Here, alpha is equal to W_all'Y in the plots on the right, and W_ctl' Y in the plots on the left.

  Rows 1
  Columns 1
image

Note that the negative controls are plotted in green, the positive controls are plotted in purple, and everything else is plotted in gray (this is the default coloring scheme, but can be changed). These plots are useful for seeing if the control features (genes) look representative of the other features, or if there are any outliers or any other notable surprises in the data. In this case, the controls look fairly representative, and there don't appear to be any serious outliers or other surprises.

Unadjusted

The Unadjusted section provides RLE plots, plots of p-value distributions, variance plots, and tables of top-ranked features (genes) for an "unadjusted" analysis in which no RUV methods are applied.

The first plots are RLE plots. These plots differ from the RLE plots in the General Info section only in that here, the data has first been adjusted by known covariates Z, if in fact a Z matrix has been supplied.

Following the RLE plots are a table of p-value plots:

  Rows 1 2 3
  Columns 1 2 3 4 5
standard ebayes rsvar rsvar ebayes evar
image image image image image
image image image image image
image image image image image

P-values have been computed in 5 different ways -- standard, ebayes (empirical bayes -- i.e. using the methods of Limma), rsvar (rescaled variances), rsvar ebayes, and evar (empirical variances). These 5 sets of p-values have each been plotted in 3 different ways. First, as a histogram; second, by their rank (effectively a qq plot); and third by their rank, but on a log-log scale.

A second table of p-value plots is then shown, in which only p-values of the negative controls are plotted. Ideally, the histograms will be flat, and the qq plots will be straight lines. The extent to which the histograms are not flat and the extent to which the qq plots are not straight lines give us some indication of how much unwanted variation is present in the data.

Following the p-value plots are variance plots.

  Rows 1
  Columns 1 2 3 4 5 6
Standard Coloring standard ebayes rsvar rsvar ebayes evar
image image image image image image

In these plots, the squared betahat values are plotted against the estimates of sigma squared. The axes are transformed to be on a fourth-root scale. In addition, the estimated variance of betahat is plotted in black as a function of the estimate of sigma squared. There are six plots. In the first, the variance of betahat is estimated in the usual way, and the features (genes) are colored as they were in the alpha plots and the p-value plots. In the remaining five plots, the variance of betahat is estimated in several different ways, and the features are colored based on the rank of their p-values. The features with the 15 smallest p-values are colored purple, and their rank is plotted as a number. The next 25 most highly ranked features are plotted in blue; the next 35 in cyan; the next 75 in orange; and the next 150 in brown. These plots help us to visualize the consequences of the different methods of estimating the variance of betahat. Also, these plots may help us identify any features that have been "suspiciously" labelled as differentially expressed. For example, if a feature has a very small p-value, but is also an outlier with respect to its estimated value of sigma squared, we might be suspicious that the feature is not truly differentially expressed, but rather that its variance of betahat has simply been poorly estimated, or that the feature is influenced by unwanted variation that has not been properly adjusted for.

After the variance plots are a set of tables showing how many positive controls are found in the top N most highly-ranked features (genes). By default, values of N = 20, 40, 60, 80, and 100 are shown, but this can be changed using the topcount_threshold variable. (Note: If no positive controls are specified, this section will not exist.)

  Columns 1 2 3 4 5
standard ebayes rsvar rsvar ebayes evar
20 40 60 80 100
7 7 7 8 10
20 40 60 80 100
6 7 7 7 11
20 40 60 80 100
7 7 7 8 10
20 40 60 80 100
6 7 7 7 11
20 40 60 80 100
7 7 9 10 10

We see, for example, that when we calculate p-values the standard way, of the twenty genes with the smallest p-values, 7 are positive controls.

Finally, the Unadjusted section concludes with tables of the top N most highly ranked features (genes). By default N is 40, but this can be changed using the topN variable. Tables are provided for each of the 5 methods of computing p-values.

  Columns 1 2 3 4 5
standard ebayes rsvar rsvar ebayes evar
rank p BH beta chrom sym index
1 0.001 1 1.2 Y RPS4Y1 11296
2 0.02 1 0.82 Y DDX3Y 8410
3 0.07 1 -0.66 11 HBB 2045
4 0.1 1 -0.58 NA NA 1513
5 0.2 1 -0.55 11 HBB 1676
6 0.3 1 0.38 19 CIRBP 9932
7 0.3 1 0.36 Y KDM5D 7630
8 0.4 1 0.33 Y USP9Y 5916
9 0.4 1 0.34 2 GAD1 7226
10 0.4 1 0.28 X Y SLC25A6 10509
11 0.4 1 0.35 12 GAPDH 12563
12 0.4 1 -0.3 X XIST 8501
13 0.4 1 0.32 16 CRYM 8339
14 0.4 1 0.29 3 PFN2 8898
15 0.5 1 0.23 14 RPL36AL 9924
16 0.5 1 0.26 Y UTY 4494
17 0.5 1 0.21 4 SLC25A4 2823
18 0.5 1 0.21 NA NA 3856
19 0.5 1 0.23 6 FBXO9 9050
20 0.5 1 0.26 20 SNAP25 8539
21 0.5 1 0.22 2 RPL31 3685
22 0.5 1 0.3 7 ACTB 12559
23 0.5 1 0.24 3 RHOA 7354
24 0.5 1 0.25 5 RPS23 4820
25 0.5 1 0.3 14 CALM1 11225
26 0.5 1 0.24 6 RPL10A 6826
27 0.5 1 -0.26 4 SPP1 4358
28 0.5 1 -0.25 17 GFAP 10256
29 0.5 1 0.23 12 UBC 395
30 0.5 1 0.23 20 CHGB 3433
31 0.5 1 0.27 3 GAP43 7763
32 0.5 1 0.25 20 RPS21 2744
33 0.5 1 0.24 1 NMNAT2 2083
34 0.5 1 0.23 14 CALM1 11369
35 0.5 1 0.23 12 UBC 2330
36 0.5 1 0.24 12 RAN 912
37 0.5 1 0.28 17 PRKAR1A 1207
38 0.5 1 0.22 7 CYCS 5849
39 0.5 1 0.23 12 ATP2A2 9858
40 0.5 1 0.28 12 GAPDH 12565
rank p BH beta chrom sym index
1 6e-04 1 1.2 Y RPS4Y1 11296
2 0.02 1 0.82 Y DDX3Y 8410
3 0.06 1 -0.66 11 HBB 2045
4 0.1 1 -0.58 NA NA 1513
5 0.1 1 -0.55 11 HBB 1676
6 0.3 1 0.38 19 CIRBP 9932
7 0.3 1 0.36 Y KDM5D 7630
8 0.3 1 0.35 12 GAPDH 12563
9 0.3 1 0.34 2 GAD1 7226
10 0.3 1 0.33 Y USP9Y 5916
11 0.4 1 0.32 16 CRYM 8339
12 0.4 1 -0.3 X XIST 8501
13 0.4 1 0.3 14 CALM1 11225
14 0.4 1 0.3 7 ACTB 12559
15 0.4 1 0.29 3 PFN2 8898
16 0.4 1 0.28 17 PRKAR1A 1207
17 0.4 1 0.28 7 ACTB 12557
18 0.4 1 0.28 X Y SLC25A6 10509
19 0.4 1 0.28 12 GAPDH 12565
20 0.4 1 0.28 20 GNAS 7496
21 0.4 1 0.27 3 GAP43 7763
22 0.5 1 0.26 20 GNAS 7495
23 0.5 1 -0.26 4 SPP1 4358
24 0.5 1 0.26 20 SNAP25 8539
25 0.5 1 0.26 Y UTY 4494
26 0.5 1 0.25 19 SLC17A7 6605
27 0.5 1 0.25 5 RPS23 4820
28 0.5 1 0.25 20 RPS21 2744
29 0.5 1 0.25 6 HSP90AB1 178
30 0.5 1 0.25 18 ATP5A1 10166
31 0.5 1 -0.25 17 GFAP 10256
32 0.5 1 0.24 3 RHOA 7354
33 0.5 1 0.24 8 YWHAZ 256
34 0.5 1 0.24 6 RPL10A 6826
35 0.5 1 0.24 13 DCLK1 9017
36 0.5 1 0.24 12 RAN 912
37 0.5 1 0.24 1 NMNAT2 2083
38 0.5 1 0.23 12 UBC 395
39 0.5 1 0.23 22 ATXN10 9752
40 0.5 1 0.23 12 ATP2A2 9858
rank p BH beta chrom sym index
1 3e-24 3e-20 1.2 Y RPS4Y1 11296
2 3e-16 2e-12 0.82 Y DDX3Y 8410
3 6e-12 2e-08 -0.66 11 HBB 2045
4 1e-09 4e-06 -0.58 NA NA 1513
5 2e-08 6e-05 -0.55 11 HBB 1676
6 1e-05 0.03 0.38 19 CIRBP 9932
7 2e-05 0.04 0.36 Y KDM5D 7630
8 2e-04 0.2 0.33 Y USP9Y 5916
9 3e-04 0.4 0.34 2 GAD1 7226
10 4e-04 0.6 0.28 X Y SLC25A6 10509
11 5e-04 0.6 0.35 12 GAPDH 12563
12 5e-04 0.6 -0.3 X XIST 8501
13 7e-04 0.7 0.32 16 CRYM 8339
14 0.001 1 0.29 3 PFN2 8898
15 0.002 1 0.23 14 RPL36AL 9924
16 0.002 1 0.26 Y UTY 4494
17 0.002 1 0.21 4 SLC25A4 2823
18 0.003 1 0.21 NA NA 3856
19 0.003 1 0.23 6 FBXO9 9050
20 0.003 1 0.26 20 SNAP25 8539
21 0.003 1 0.22 2 RPL31 3685
22 0.003 1 0.3 7 ACTB 12559
23 0.004 1 0.24 3 RHOA 7354
24 0.004 1 0.25 5 RPS23 4820
25 0.004 1 0.3 14 CALM1 11225
26 0.004 1 0.24 6 RPL10A 6826
27 0.004 1 -0.26 4 SPP1 4358
28 0.004 1 -0.25 17 GFAP 10256
29 0.005 1 0.23 12 UBC 395
30 0.005 1 0.23 20 CHGB 3433
31 0.005 1 0.27 3 GAP43 7763
32 0.006 1 0.25 20 RPS21 2744
33 0.006 1 0.24 1 NMNAT2 2083
34 0.006 1 0.23 14 CALM1 11369
35 0.006 1 0.23 12 UBC 2330
36 0.006 1 0.24 12 RAN 912
37 0.007 1 0.28 17 PRKAR1A 1207
38 0.007 1 0.22 7 CYCS 5849
39 0.007 1 0.23 12 ATP2A2 9858
40 0.008 1 0.28 12 GAPDH 12565
rank p BH beta chrom sym index
1 1e-45 1e-41 1.2 Y RPS4Y1 11296
2 9e-22 6e-18 0.82 Y DDX3Y 8410
3 1e-14 5e-11 -0.66 11 HBB 2045
4 6e-12 2e-08 -0.58 NA NA 1513
5 7e-11 2e-07 -0.55 11 HBB 1676
6 8e-06 0.02 0.38 19 CIRBP 9932
7 2e-05 0.04 0.36 Y KDM5D 7630
8 5e-05 0.07 0.35 12 GAPDH 12563
9 7e-05 0.1 0.34 2 GAD1 7226
10 9e-05 0.1 0.33 Y USP9Y 5916
11 1e-04 0.2 0.32 16 CRYM 8339
12 4e-04 0.4 -0.3 X XIST 8501
13 5e-04 0.4 0.3 14 CALM1 11225
14 5e-04 0.4 0.3 7 ACTB 12559
15 7e-04 0.6 0.29 3 PFN2 8898
16 9e-04 0.6 0.28 17 PRKAR1A 1207
17 0.001 0.6 0.28 7 ACTB 12557
18 0.001 0.6 0.28 X Y SLC25A6 10509
19 0.001 0.6 0.28 12 GAPDH 12565
20 0.001 0.7 0.28 20 GNAS 7496
21 0.001 0.8 0.27 3 GAP43 7763
22 0.002 1 0.26 20 GNAS 7495
23 0.002 1 -0.26 4 SPP1 4358
24 0.002 1 0.26 20 SNAP25 8539
25 0.003 1 0.26 Y UTY 4494
26 0.003 1 0.25 19 SLC17A7 6605
27 0.003 1 0.25 5 RPS23 4820
28 0.004 1 0.25 20 RPS21 2744
29 0.004 1 0.25 6 HSP90AB1 178
30 0.004 1 0.25 18 ATP5A1 10166
31 0.004 1 -0.25 17 GFAP 10256
32 0.004 1 0.24 3 RHOA 7354
33 0.005 1 0.24 8 YWHAZ 256
34 0.005 1 0.24 6 RPL10A 6826
35 0.005 1 0.24 13 DCLK1 9017
36 0.005 1 0.24 12 RAN 912
37 0.006 1 0.24 1 NMNAT2 2083
38 0.006 1 0.23 12 UBC 395
39 0.006 1 0.23 22 ATXN10 9752
40 0.006 1 0.23 12 ATP2A2 9858
rank p BH beta chrom sym index
1 5e-276 6e-272 1.2 Y RPS4Y1 11296
2 8e-128 5e-124 0.82 Y DDX3Y 8410
3 1e-83 4e-80 -0.66 11 HBB 2045
4 3e-29 9e-26 0.38 19 CIRBP 9932
5 2e-26 5e-23 0.36 Y KDM5D 7630
6 3e-25 6e-22 -0.58 NA NA 1513
7 6e-21 1e-17 0.33 Y USP9Y 5916
8 7e-19 1e-15 -0.3 X XIST 8501
9 1e-16 2e-13 0.28 X Y SLC25A6 10509
10 5e-14 6e-11 0.26 Y UTY 4494
11 3e-13 3e-10 0.26 20 SNAP25 8539
12 5e-13 5e-10 -0.25 17 GFAP 10256
13 7e-13 7e-10 0.24 3 RHOA 7354
14 2e-12 2e-09 0.25 5 RPS23 4820
15 2e-12 2e-09 0.24 6 RPL10A 6826
16 3e-12 3e-09 0.24 1 NMNAT2 2083
17 4e-12 3e-09 0.23 12 UBC 395
18 6e-12 4e-09 0.23 12 UBC 2330
19 1e-11 9e-09 0.23 20 CHGB 3433
20 1e-11 9e-09 0.23 14 RPL36AL 9924
21 2e-11 1e-08 0.23 6 FBXO9 9050
22 3e-11 2e-08 0.23 14 CALM1 11369
23 5e-11 3e-08 0.22 2 RPL31 3685
24 6e-11 3e-08 0.22 7 CYCS 5849
25 1e-10 6e-08 0.22 NA NA 1450
26 3e-10 1e-07 0.21 1 GABRD 5121
27 4e-10 2e-07 0.21 NA NA 3856
28 5e-10 2e-07 0.21 2 NCL 2588
29 7e-10 3e-07 0.21 4 SLC25A4 2823
30 7e-10 3e-07 -0.55 11 HBB 1676
31 4e-09 1e-06 0.2 17 ATP5H 5790
32 6e-09 3e-06 0.2 2 ATP5G3 4832
33 8e-09 3e-06 0.2 NA NA 11297
34 9e-09 3e-06 0.19 19 GPX4 3943
35 1e-08 4e-06 0.19 10 ATP5C1 10186
36 2e-08 8e-06 0.19 19 CA11 4292
37 3e-08 1e-05 0.19 6 EEF1A1 10965
38 3e-08 1e-05 0.19 12 HNRNPA1 10283
39 4e-08 1e-05 0.19 6 ATP6V1G2 3036
40 5e-08 1e-05 0.19 5 HINT1 10

You may choose to hide some of the tables to make side-by-side comparisons of two tables. Each table includes a p-value, a FDR-adjusted p-value (BH), and any information included in the "geneinfo" matrix. The index column tells us the index of the feature (gene).

If you click on an entry of the table, you will get the results of a google search for that entry. This is useful for quickly googling the genes.

RUV-2 Combined Analyses

The RUV-2 Combined Analyses section is intended to help you choose a good value of K. This section includes two tables of plots. The first shows the number of top-ranked positive controls as a function of K:

  Rows 1
  Columns 1 2 3 4 5
standard ebayes rsvar rsvar ebayes evar
image image image image image

Of course, if no positive controls are supplied, this table won't exist.

The second table of plots (not shown) includes RLE plots, a projection plot, and an extensive variety of p-value plots for each value of K. These plots are all duplicated in the RUV-2 Individual Analyses section (below), but are included here in one giant table so that an easy comparison between different values of K can be made. It will be very helpful to hide various rows / columns when viewing this table.

RUV-2 Individual Analyses

There is an individual analysis for each value of K. Each individual analysis is similar to the Unadjusted analysis, except that it also contains a table of projection plots:

  Rows 1
  Columns 1 2 3 4 5 6
Standard Coloring standard ebayes rsvar rsvar ebayes evar
image image image image image image

The layout of these plots is analogous to that of the variance plots. In the first plot the features (genes) are colored as they were in the alpha plots and p-value plots. In the remaining 5 plots, the features are colored according to their rank, just as with the variance plots.

RUV-4 Combined and Individual Analyses

These are just like the RUV-2 Combined and Individual Analyses

RUV-inv and RUV-rinv

These are just like the RUV-2 / RUV-4 individual analyses, except that there are no RLE plots.

Projection Plot table

TODO: Describe Projection plot table

Additional Options

The full list of arguments to ruv_starter_analysis is as follows:

Y

The data. A m by n matrix, where m is the number of samples and n is the number of features.

X

The factor of interest. A m by 1 matrix, where m is the number of samples.

ctl

The negative controls. A logical vector of length n.

Z

Any additional covariates to include in the model. Either a m by q matrix of covariates, or simply 1 (the default) for an intercept term.

eta

Gene-wise (as oposed to sample-wise) covariates. These covariates are adjusted for by RUV-1 before any further analysis proceeds. A matrix with n columns.

pctl

Positive controls. A logical vector of length n.

genecoloring

A vector of length n. The colors to use when plotting genes.

samplecoloring

A vector of length m. The colors to use when plotting samples.

genetexts

A vector of length n. Any text to be used in place of symbols, when plotting genes. Elements that are NA are plotted as symbols.

sampletexts

A vector of length m. Any text to be used in place of symbols, when plotting samples. Elements that are NA are plotted as symbols.

genesymbols

A vector of length n. The plot symbols to use when plotting genes.

samplesymbols

A vector of length m. The plot symbols to use when plotting symbols.

geneinfo

A matrix with n rows. Each column should contain some information about the genes (such as their names) for use in tables.

rankbybeta

Should the analysis include a ranking of the features based on the absolue value of estimated effect size (betahat)?

topN

The number of top-ranked genes to include in tables.

topcount_thresholds

The thresholds to use when counting the number of top-ranked positive controls.

rankset

The genes to be considered when determining which are top-ranked. A logical vector. NULL implies all genes.

kset

Which values of K should be considered.

factorset

Which factors should be included in the projection plot table.

bin

The bin size in the method of empirical variances.

do_general

Should the "general" analysis be performed?

do_unadjusted

Should the "unadjusted" analysis be performed?

do_ruv2

Should the RUV-2 analysis be performed?

do_ruv4

Should the RUV-4 analysis be performed?

do_ruvinv

Should the RUV-inv analysis be performed?

do_ruvrinv

Should the RUV-rinv analysis be performed?

do_pptable

Should the factor projection plot table be created?

outdir

Directory where the web page should be written.

initialize_collapsed

Should the web page be created so that only headers are shown, and must be manually expanded?

webtitle

The title of the web page.

inputcheck

Perform a basic sanity check on the inputs, and issue a warning if there is a problem.

verbose

Verbose output.

A few of these arguments warrant further comment.

First note that X must consist of only a single column. Although RUV-2, RUV-4, etc. support an X with more than one column, ruv_starter_analysis does not. This is because many of the plots (e.g. p-value histograms, projection plots) only make sense in the context of a single-column X. If you have several factors of interest, the easiest way to handle this situation is to run ruv_starter_analysis several times, each time setting X to be just one of the factors of interest. If desired, the remaining factors of interest can be included in the model by including them in the Z matrix.

The do_unadjusted, do_ruv2, do_ruv4, do_ruvinv, do_ruvrinv, and do_pptable arguments can be set to FALSE in order to omit these sections of the analysis and speed up execution.

The genecoloring and samplecoloring arguments can be used to specify the colors used in the plots. If there are any NAs in the coloring vector, those samples / features will be plotted in light gray. If a coloring vector is not specified at all, by default, all features are plotted in light gray except for negative controls, which are plotted in green, and positive controls (if they are given), which are plotted in purple. NOTE: The plots are done in a special way, so that points of "rare" colors are plotted last, to ensure they are visible. So, for example, if there are 10,000 gray points, 1,000 green points, and 100 purple points, all 10,000 gray points will be plotted first, then all 1,000 green points, and finally all 100 purple points.

The genesymbols and samplesymbols arguments can be used to specify the symbols used in the plots. If there are any NAs in the symbol vector, those samples / features will be plotted by default as a circle.

The genetexts and sampletexts arguments can be used to specify text that should be plotted instead of a symbol. If there are any NAs in the text vector, those samples / features will be plotted by a symbol instead.

The initialize_collapsed argument can be used to create the web page so that all of the plots / tables are initially hidden. This is particularly useful if the web page will actually be posted on a web server and viewed over the internet, since the page can then load much more quickly.

To see some these featurs in action, consider a second example using the gender data:

library(ruv.extras)
library(ruv.data.gender.sm)
data(gender.sm)
genetexts = rep(NA,ncol(gender.sm$Y))
ygenes = which(gender.sm$geneinfo[,1]=="Y")
genetexts[ygenes] = gender.sm$geneinfo[ygenes,2]
ruv_starter_analysis(gender.sm$Y, gender.sm$X, gender.sm$hkctl, 
                     pctl=gender.sm$pctl, geneinfo=gender.sm$geneinfo, kset=c(1,10),
                     genecoloring = gender.sm$genecoloring, samplecoloring=gender.sm$samplecoloring,
                     samplesymbols = gender.sm$X + 1,
                     genetexts = genetexts,
                     do_unadjusted = FALSE, do_ruv2 = FALSE, do_ruvinv = FALSE, do_ruvrinv = FALSE,
                     outdir="gender2", webtitle="Gender Example 2")
The output looks like this: Gender Example 2

In this example, samples are colored by lab / chiptype: Red -- site A, HG-U95A; yellow -- site A, HG-U95Av2; black -- site B, HG-U95A; gray -- site B, HG-U95Av2; cyan -- site C, HG-U95Av2. Males are plotted as triangles, and females are plotted as circles (see PC Plots). Genes are colored as follows: Green -- negative controls; pink -- on X chromosome; blue -- on Y chromosome; purple -- on X and Y chromosomes; gray -- everything else. Moreover, genes from the Y chromosome are plotted as using their gene name, instead of the standard circle symbol.

Going Further

Eventually, you will probably want to modify the plots in various ways, generate plots of your own, or simply want to know in more detail what ruv_starter_analysis does. To help you, there are 4 files in the my_ruv sub-folder of this how-to:

my_ruv.R

my_ruv_simpler.R

my_ruv_simplest.R

my_ruv_plots_and_tables.R

The file my_ruv_plots_and_tables.R contains all of the plot routines in the ruv.extras package, but the names of the routines have been given the prefix "my_". For example, "ruv_scree" is renamed "my_ruv_scree." Therefore, you can easily edit the routines in any way you wish, source the file, and then use your version of the plot routines simply by adding the prefix "my_" in any of the code that calls the routines.

The file my_ruv.R is similar in nature. This file contains the script my_ruv_starter_analysis (and supporting subscripts). The only difference is the prefix "my." Thus, if you source the files my_ruv_plots_and_tables.R and my_ruv.R you will have all of the functionality of the ruv.extras package, just all the routines now have a "my" prefix. Of course, now you can edit these files however you like.

my_ruv.R is a rather complicated file, especially when you first look at it. Thus, before tackling this file, it is recommended that you first examine the file my_ruv_simplest.R. This script also contains a version of my_ruv_starter_analysis, but it is greatly simplified. This version does not create a web page. Instead, it simply outputs text and plots to the screen. Also, some of the less important options have been omitted. This script is great for understanding what my_ruv_starter_analysis does. It is also a great script to edit when you wish to create your own analysis.

Finally, my_ruv_simpler.R is somewhere in between. This version does not create a web page, but it does at least save the plots to disk. Once you have an understanding of my_ruv_simplest.R you may wish to examine this file, either as a next step in understanding my_ruv.R, or simply as a convenient way to save any plots you create to disk.