Statistics in Seattle

Sunday, June 1, 2014

Weekend at Bernoulli's

I have been doing some background reading on group sequential trials for an upcoming project. One of the papers I came across in terms of the required numerical methods is "Repeated Significance Tests on Accumulating Data" (Armitage, et al, 1969). In this paper, the scenario is considered of observing an outcome and then testing whether this outcome is significant with respect to a null hypothesis. If the outcome is not significant, we accrue an additional observation and consider whether the sum of the two is significant. This process continues until either the cumulative sum is extreme enough such that the null hypothesis is rejected or the maximal sample size is attained. For my work, I am most interested in the normal distribution, but Armitage, et al tabulated results for binomial, normal and exponential distributions in their paper. (To activate 'read-along' mode for what follows, I suggest doing a quick search for the paper online.)

For the binomial distribution, Armitage et al tabulated exact calculations of rejection probabilities in their Table 1. Since I have too much free time (or want to do something easier than my homework), I decided to try reproducing the aforementioned Table 1 by simulation in R.

The scenario we consider is that for a selection of maximal sample sizes from $N =$ 10 to 150, we are interested in seeing how frequently under the null hypothesis $p = p_0 = \frac{1}{2}$ we reject the null hypothesis (make a type I error) at or before the maximal sample size when the nominal level/type I error rate is equal to $2 \alpha$. At each interim analysis $j = 1, \ldots, N$, we consider the sum $S_j$ of our binary outcomes and compare $S_j$ to the upper and lower $\alpha$-th percentiles of the Binomial$(j,p_0)$ distribution. If $S_j$ is more extreme than either bound, we reject the null hypothesis and conclude that the true proportion $p$ is not equal to $\frac{1}{2}$.

Averaging the results of 10,000 simulations, nearly all of my values agree with the first significant figure of the exact calculations in Table 1. Beyond that there are no guarantees of concordance, but that is to be expected in simulations. My code is far from optimized, so I apologize if you decide to run it for all 10,000 simulations yourself. It took somewhere between 30-60 minutes to run on my laptop.

Here's the code:

## RepSigTests.R
## 6/1/2014

## Exploring Armitage, McPherson and Rowe (1969)
## Repeated Significance Tests on Accumulating Data

## For Binomial Distributions

# We fix the sequences of max sample sizes and 2*alpha, the nominal level
n = c(seq(10,50,5),seq(60,100,10),c(120,140,150))
talpha = seq(.01,.05,.01)

# We'll consider sequential sampling under the null hypothesis H0: p = 0.5
p0 = 1/2

## After each new observation at time m of Xm = 0 or 1,
## we compare the cumulative sum Sm to bounds depending on twice alpha:
# am = qbinom(p = talpha[1]/2, size = n[1], prob = p0, lower.tail = TRUE)
# bm = qbinom(p = talpha[1]/2, size = n[1], prob = p0, lower.tail = FALSE)
## If Sm < am or Sm > bm, then we reject p = p0.

## The following function carries this process, starting with 1 observation
## and continuing the sequential analysis until either Sm is outside of the
## bounds (am,bm) or m is at least max.
seq.binom = function(talpha,p0,max){
Sm = 0
m = 0
continue = TRUE
while (continue) {
m = m + 1
am = qbinom(p = talpha/2, size = m, prob = p0, lower.tail = TRUE)
bm = qbinom(p = talpha/2, size = m, prob = p0, lower.tail = FALSE)
Xm = rbinom(n = 1, size = 1, prob = p0)
Sm = Sm + Xm
if ( Sm < am || Sm > bm || m > max) {continue = FALSE}
}
return(c(M=m,aM=am,Sum=Sm,bM=bm))
}

## This next function checks whether we rejected p = p0 or failed to reject
## at sample size m (the mth analysis).
check.binom = function(talpha,p0,max){
res = seq.binom(talpha,p0,max)
reject = res[2] > res[3] || res[3] > res[4]
}

## Next we reproduce Table 1 of Armitage, McPherson and Rowe (1969)
## using our simulations rather than exact calculations of binomial probabilities.
## First, we set the number of repetitions/seed for simulation:
B = 10000
set.seed(1990)

## Table 1
# "The probability of being absorbed at or before the nth observation in
# binomial sampling with repeated tests at a nominal two-sided significance
# level 2*alpha."

## This takes a while to run. The inner loop could be removed if the seq.binom()
## were to be written with a comparison of m to each maximal sample size in n.
tab1 = matrix(0,length(n),length(talpha))
for (j in 1:length(talpha)){
for (i in 1:length(n)){
tab1[i,j] = sum(replicate(B, check.binom(talpha[j],p0,n[i])))/B
}
}

#> tab1
#[,1] [,2] [,3] [,4] [,5]
#[1,] 0.0076 0.0215 0.0277 0.0522 0.0566
#[2,] 0.0144 0.0297 0.0454 0.0827 0.0820
#[3,] 0.0193 0.0374 0.0593 0.0911 0.1003
#[4,] 0.0272 0.0507 0.0700 0.1092 0.1218
#[5,] 0.0268 0.0550 0.0841 0.1214 0.1368
#[6,] 0.0305 0.0611 0.0934 0.1323 0.1397
#[7,] 0.0331 0.0648 0.0982 0.1284 0.1555
#[8,] 0.0374 0.0655 0.1007 0.1470 0.1657
#[9,] 0.0388 0.0715 0.1056 0.1520 0.1734
#[10,] 0.0452 0.0842 0.1154 0.1645 0.1905
#[11,] 0.0496 0.0814 0.1260 0.1812 0.2001
#[12,] 0.0521 0.0970 0.1336 0.1830 0.2206
#[13,] 0.0555 0.0994 0.1355 0.1964 0.2181
#[14,] 0.0557 0.1039 0.1508 0.2037 0.2268
#[15,] 0.0567 0.1105 0.1577 0.2094 0.2362
#[16,] 0.0636 0.1226 0.1666 0.2235 0.2534
#[17,] 0.0664 0.1181 0.1692 0.2310 0.2635

## Because everyone loves a good heatmap:
## Credit goes to this stack exchange thread for the labels:
## http://stackoverflow.com/questions/17538830/x-axis-and-y-axis-labels-in-pheatmap-in-r
library(pheatmap)
library(grid)
rownames(tab1) = n
colnames(tab1) = talpha

setHook("grid.newpage", function() pushViewport(viewport(x=1,y=1,width=0.9, height=0.9, name="vp", just=c("right","top"))), action="prepend")

pheatmap(
tab1,
cluster_cols=FALSE,
cluster_rows=FALSE,
show_rownames=T,
show_colnames=T,
border_color = "white",
main = "Table 1: Sequential Binomial Stopping Probabilities",
)

setHook("grid.newpage", NULL, "replace")
grid.text("Nominal Level (2*Alpha)", y=-0.07, gp=gpar(fontsize=16))
grid.text("Maximal Sample Size", x=-0.07, rot=90, gp=gpar(fontsize=16))

And of course, here's the heatmap:

Reference
P. Armitage, C. K. McPherson, B. C. Rowe (1969). Repeated Significance Tests on Accumulating Data. Journal of the Royal Statistical Society. Series A (General), Vol. 132, No. 2 (1969), pp. 235-244.

Tuesday, August 20, 2013

Weak Learners

Tonight's session of R bore just enough fruit that I finally am writing my sophomore entry. I was browsing the slides from a machine learning lecture given by Leo Breiman and came across a relatively simple example he used to introduce the notions of weak and strong learners en route to bagging.

Suppose we are interested in approximating a function $f(x)$ based on observed data $(x_1, y_1), \ldots, (x_n,y_n)$. Given some error in measurement due to outside influence or physical limitations, we'd like to find some method to approximate $f(x)$.

From the set of observed data, resample $m$ points in the interior. Taking these points along with the boundary values gives us a total of $m+2$ coordinate pairs. We now connect consecutive points $(x_i,y_i)$ and $(x_{i+1},y_{i+1})$ by calculating the unique line through that pair. We restrict each such line's domain to values between $x_i$ and $x_{i+1}$. This process will ultimately construct a piecewise function of the form:
$$
WL(x) = \sum_{i=1}^{m+1} \left[ \left( \frac{y_{i+1}-y_i}{x_{i+1}-x_i} \right) (x-x_i)+y_i \right] \times I_{[x_i,x_{i+1}]}(x)
$$
Where $I_{[x_i,x_{i+1}]}(x)$ is the indicator function, equal to 1 on the designated set and 0 elsewhere.

Each line segment connects a pair of consecutive points from our random subset.

The above piecewise function will be called a weak learner. We can generate $k-1$ further weak learners by resampling a different $m$ interior points. To smooth the $k$ weak learners, we average their predicted values. My code appears below for you to test, play with, and critique. How does the smooth learner behave as we increase the number of weak learners? Do you notice/expect any restrictions?

(My LaTeX converter caused me some grief, so you'll have to remove the "\" from each "\$" if you want to run my code.)

#Weak and Smoothed Learners
weak.learner = function(x, y) {
n = length(x)
m = max(1, ceiling(n/5) )
s = sample(2:(n-1), size = m, replace = FALSE, prob = NULL) s = c(1,sort(s),n)
sx = x[s]; sy = y[s]
wl = matrix(nrow = (m+1), ncol = 4)
for ( i in 1:(m+1)) {
      c1 = (sy[i+1]-sy[i])/(sx[i+1]-sx[i]) #slope
      c0 = sy[i] - c1*sx[i] #intercept
      wl[i,1] = c0; wl[i,2] = c1
      wl[i,3] = sx[i]; wl[i,4] = sx[i+1] #end points for line
}
  wl = data.frame(c0 = wl[,1], c1 = wl[,2], a = wl[,3], b = wl[,4])
  wl.hat = matrix(nrow = n, ncol = 1)
  for (i in 1:n) {
      wl.hat[i] = sum((wl\$c0 + wl\$c1*x[i])*ifelse(wl\$a <= x[i] & x[i] < wl\$b, 1, 0))
  }
  wl.hat[n] = sum((wl\$c0 + wl\$c1*x[n])*(ifelse(wl\$a < x[i] & x[i] <= wl\$b, 1, 0)))
  return(wl.hat)
}
smooth.learner = function(x,y,k) {
  n = length(x)
  learners = matrix(nrow = k, ncol = n)
  for (i in 1:k) {learners[i,] = weak.learner(x,y)}
      smooth = apply(X = learners, MARGIN = 2, FUN = mean)
  return(smooth)
}

# Example Usage:
set.seed(1)
#f = function(x){ifelse(x<.5,0,1)}
f = function(x){cos(2*pi*x)}
n = 100
x = sort(runif(n))
y = f(x) + rnorm(n, mean = 0, sd = .25)
yhat = smooth.learner(x,y,k = 50)
plot(x,y,col="BLACK",ylim=c(-2,2) )
points(x,f(x),col="RED",type="l")
points(x,yhat,col="BLUE",type="l")

This example uses a cosine function with normal errors at each observation.

Thursday, June 27, 2013

Fun with Fremont Bridge Bicyclists

Given the title of this post and its proximity to the Solstice, you will be disappointed to know that I am not writing about naked bicyclists. I apologize for any false hope I may have instilled in you.

On October 11th, 2012, the city of Seattle, WA began collecting hourly counts of cyclists crossing the Fremont bridge. They have a dedicated webpage here with interactive graphs that allow you to view the counts aggregated at daily, weekly and monthly levels, as well as explore the effects of various weather on the counts. A spreadsheet here collects the hourly counts for both Northbound and Southbound traffic. The Seattle Department of Transportation (SDOT) uses this traffic counter to help evaluate their return on investment for bicycle-related facilities, but I won't be doing anything quite as sophisticated.

My first inquiry upon seeing the separate counts for North- and Southbound cyclists is, "Gosh. I wonder what the relationship between Northbound (NB) and Southbound (SB) bike traffic is?" To explore this matter, the first thing to do is fire up R and read in the CSV file downloaded from the spreadsheet link (found above). There may be a smooth way to use the API to read in the data from URL, but I am not savvy enough for that. Implementing archaic "click and save" methodology to procure the data set, the following action happens in R:

> bikes = read.table(
+ "Fremont_Bridge_Hourly_Bicycle_Counts_by_Month_October_2012_to_present.csv",
+ header = TRUE,
+ sep = ","
+ )
> head(bikes)
Date Fremont.Bridge.NB Fremont.Bridge.SB
1 10/02/2012 12:00:00 AM 0 0
2 10/02/2012 01:00:00 AM 0 0
3 10/02/2012 02:00:00 AM 0 0
4 10/02/2012 03:00:00 AM 0 0
5 10/02/2012 04:00:00 AM 0 0
6 10/02/2012 05:00:00 AM 0 0

At a glance, note that we've got three variables: Date, Fremont.Bridge.NB, and Fremont.Bridge.SB. All those zeros in the NB and SB columns may seem a bit worrisome, but I assure you that after the first several hours we register some cyclists. In fact, take a moment to admire the data's tail. Hopefully you will agree that something isn't quite right.

> tail(bikes)
Date Fremont.Bridge.NB Fremont.Bridge.SB
15211 05/31/2013 06:00:00 PM 259 159
15212 05/31/2013 07:00:00 PM 122 117
15213 05/31/2013 08:00:00 PM 92 65
15214 05/31/2013 09:00:00 PM 41 43
15215 05/31/2013 10:00:00 PM 20 39

15216 05/31/2013 11:00:00 PM 20 18

If each row represents an hour, then we have 15,216/24 = 634 days of data. Considering the counts date back a mere 8 months at present, it seems as though there may be some evil twins (triplets, even!) in the file. Fortunately, R has a built-in for sorting out duplicates.

> bikes = unique(bikes)
> length(bikes$Date)
[1] 5809

Our 5809 unique hourly records equate to just over 242 days, or roughly 8 months.

Now we're ready to do our first plot. We'll go with NB bike counts on the horizontal axis and SB bike counts on the vertical axis.

> plot(
+ x = bikes$Fremont.Bridge.NB,
+ y = bikes$Fremont.Bridge.SB,
+ xlab = "NB Bikes in 1 Hour",
+ ylab = "SB Bikes in 1 Hour",
+ main = "Fremont Bridge Hourly Bike Counts"
+ )

It appears that linear trends are on sale two-for-the-price-of-one. You could go as far as to say something cute here like, "The two trends are different as night and day." If you did say that, Congratulations! Splitting the data by AM and PM times seems like a reasonable next step, since bicyclists who travel across the bridge in the morning will theoretically return later that day. This assumption won't catch certain individuals who are just out on a short excursion, work odd hours, return across a different bridge, or take the bus for one of their crossings, etc. Despite these shortcomings, I march blindly forward in the name of science.

Identifying said AM and PM times is as easy as one, two, substring, replace.

> n = nchar(as.character(bikes$Date[1])) #all Date entries have the same length
> am.pm = substring(bikes$Date, first = n-1)
> am = which(am.pm == "AM")
> pm = which(am.pm == "PM")
> am.pm[am] = 0
> am.pm[pm] = 1

Now for plots that reflect this split:

> plot(
+ x = bikes$Fremont.Bridge.NB[am.pm == 0],
+ y = bikes$Fremont.Bridge.SB[am.pm == 0],
+ xlab = "NB Bikes in 1 Hour",
+ ylab = "SB Bikes in 1 Hour",
+ main = "Fremont Bridge AM Hourly Bike Counts"
+ )
> plot(
+ x = bikes$Fremont.Bridge.NB[am.pm == 1],
+ y = bikes$Fremont.Bridge.SB[am.pm == 1],
+ xlab = "NB Bikes in 1 Hour",
+ ylab = "SB Bikes in 1 Hour",
+ main = "Fremont Bridge PM Hourly Bike Counts"
+ )

Plotting these two groups together (with the same scale) will help further untangle the two trends.

> plot(

+ x = bikes$Fremont.Bridge.NB[am.pm == 1],

+ y = bikes$Fremont.Bridge.SB[am.pm == 1],

+ xlab = "NB Bikes in 1 Hour",

+ ylab = "SB Bikes in 1 Hour",

+ main = "Fremont Bridge AM/PM Hourly Bike Counts",

+ xlim = c(0,700),

+ ylim = c(0,700)

+ )

> points(

+ x = bikes$Fremont.Bridge.NB[am.pm == 0],

+ y = bikes$Fremont.Bridge.SB[am.pm == 0],

+ col = "RED"

+ )

Note that the AM data points are colored red in the combined plot. The density of the points obscures the fact that there is a lot of overlap in the rectangle x: [0,100], y: [0,200].

At this point, it may be instructive to pull out the another classic line. Namely, y = x.

> points(

+ x = 0:700,

+ y = 0:700,

+ type = "l",

+ col = "BLUE"

+ )

This line divides the quadrant into regions where y>x and y<x. Points satisfying y>x are predominantly in the AM camp and those where y<x are mainly in camp PM. One possible explanation for this trend is that more workers bike to work in the South Lake Union/Downtown Seattle Area from Fremont/surrounding neighborhoods than vice versa.

As mentioned above, the split into morning and afternoon groups still leaves a region of overlapping points. Next time I'll look at this classification more carefully and then take a look at the two trends. Based on the scatterplot, I expect the regression lines to be (nearly) inverse functions.

(Now that we've gotten this far, you may wonder what it is I hope to accomplish by investigating the relationship between Northbound and Southbound bike traffic in Seattle. At this point I'm not expecting anything revolutionary to come of this analysis. It has proven to be a nice way to spend a summer evening, though.)