Tuesday, August 20, 2013

Weak Learners

Tonight's session of R bore just enough fruit that I finally am writing my sophomore entry. I was browsing the slides from a machine learning lecture given by Leo Breiman and came across a relatively simple example he used to introduce the notions of weak and strong learners en route to bagging.

Suppose we are interested in approximating a function $f(x)$ based on observed data $(x_1, y_1), \ldots, (x_n,y_n)$. Given some error in measurement due to outside influence or physical limitations, we'd like to find some method to approximate $f(x)$.

From the set of observed data, resample $m$ points in the interior. Taking these points along with the boundary values gives us a total of $m+2$ coordinate pairs. We now connect consecutive points $(x_i,y_i)$ and $(x_{i+1},y_{i+1})$ by calculating the unique line through that pair. We restrict each such line's domain to values between $x_i$ and $x_{i+1}$. This process will ultimately construct a piecewise function of the form:
$$WL(x) = \sum_{i=1}^{m+1} \left[ \left( \frac{y_{i+1}-y_i}{x_{i+1}-x_i} \right) (x-x_i)+y_i \right] \times I_{[x_i,x_{i+1}]}(x)$$
Where $I_{[x_i,x_{i+1}]}(x)$ is the indicator function, equal to 1 on the designated set and 0 elsewhere.
 Each line segment connects a pair of consecutive points from our random subset.

The above piecewise function will be called a weak learner. We can generate $k-1$ further weak learners by resampling a different $m$ interior points. To smooth the $k$ weak learners, we average their predicted values. My code appears below for you to test, play with, and critique. How does the smooth learner behave as we increase the number of weak learners? Do you notice/expect any restrictions?

(My LaTeX converter caused me some grief, so you'll have to remove the "\" from each "\$" if you want to run my code.) #Weak and Smoothed Learners weak.learner = function(x, y) { n = length(x) m = max(1, ceiling(n/5) ) s = sample(2:(n-1), size = m, replace = FALSE, prob = NULL) s = c(1,sort(s),n) sx = x[s]; sy = y[s] wl = matrix(nrow = (m+1), ncol = 4) for ( i in 1:(m+1)) { c1 = (sy[i+1]-sy[i])/(sx[i+1]-sx[i]) #slope c0 = sy[i] - c1*sx[i] #intercept wl[i,1] = c0; wl[i,2] = c1 wl[i,3] = sx[i]; wl[i,4] = sx[i+1] #end points for line } wl = data.frame(c0 = wl[,1], c1 = wl[,2], a = wl[,3], b = wl[,4]) wl.hat = matrix(nrow = n, ncol = 1) for (i in 1:n) { wl.hat[i] = sum((wl\$c0 + wl\$c1*x[i])*ifelse(wl\$a <= x[i] & x[i] < wl\$b, 1, 0)) } wl.hat[n] = sum((wl\$c0 + wl\$c1*x[n])*(ifelse(wl\$a < x[i] & x[i] <= wl\$b, 1, 0))) return(wl.hat) } smooth.learner = function(x,y,k) { n = length(x) learners = matrix(nrow = k, ncol = n) for (i in 1:k) {learners[i,] = weak.learner(x,y)} smooth = apply(X = learners, MARGIN = 2, FUN = mean) return(smooth) } # Example Usage: set.seed(1) #f = function(x){ifelse(x<.5,0,1)} f = function(x){cos(2*pi*x)} n = 100 x = sort(runif(n)) y = f(x) + rnorm(n, mean = 0, sd = .25) yhat = smooth.learner(x,y,k = 50) plot(x,y,col="BLACK",ylim=c(-2,2) ) points(x,f(x),col="RED",type="l") points(x,yhat,col="BLUE",type="l")  This example uses a cosine function with normal errors at each observation. Thursday, June 27, 2013 Fun with Fremont Bridge Bicyclists Given the title of this post and its proximity to the Solstice, you will be disappointed to know that I am not writing about naked bicyclists. I apologize for any false hope I may have instilled in you. On October 11th, 2012, the city of Seattle, WA began collecting hourly counts of cyclists crossing the Fremont bridge. They have a dedicated webpage here with interactive graphs that allow you to view the counts aggregated at daily, weekly and monthly levels, as well as explore the effects of various weather on the counts. A spreadsheet here collects the hourly counts for both Northbound and Southbound traffic. The Seattle Department of Transportation (SDOT) uses this traffic counter to help evaluate their return on investment for bicycle-related facilities, but I won't be doing anything quite as sophisticated. My first inquiry upon seeing the separate counts for North- and Southbound cyclists is, "Gosh. I wonder what the relationship between Northbound (NB) and Southbound (SB) bike traffic is?" To explore this matter, the first thing to do is fire up R and read in the CSV file downloaded from the spreadsheet link (found above). There may be a smooth way to use the API to read in the data from URL, but I am not savvy enough for that. Implementing archaic "click and save" methodology to procure the data set, the following action happens in R: > bikes = read.table( + "Fremont_Bridge_Hourly_Bicycle_Counts_by_Month_October_2012_to_present.csv", + header = TRUE, + sep = "," + ) > head(bikes) Date Fremont.Bridge.NB Fremont.Bridge.SB 1 10/02/2012 12:00:00 AM 0 0 2 10/02/2012 01:00:00 AM 0 0 3 10/02/2012 02:00:00 AM 0 0 4 10/02/2012 03:00:00 AM 0 0 5 10/02/2012 04:00:00 AM 0 0 6 10/02/2012 05:00:00 AM 0 0 At a glance, note that we've got three variables: Date, Fremont.Bridge.NB, and Fremont.Bridge.SB. All those zeros in the NB and SB columns may seem a bit worrisome, but I assure you that after the first several hours we register some cyclists. In fact, take a moment to admire the data's tail. Hopefully you will agree that something isn't quite right. > tail(bikes) Date Fremont.Bridge.NB Fremont.Bridge.SB 15211 05/31/2013 06:00:00 PM 259 159 15212 05/31/2013 07:00:00 PM 122 117 15213 05/31/2013 08:00:00 PM 92 65 15214 05/31/2013 09:00:00 PM 41 43 15215 05/31/2013 10:00:00 PM 20 39 15216 05/31/2013 11:00:00 PM 20 18 If each row represents an hour, then we have 15,216/24 = 634 days of data. Considering the counts date back a mere 8 months at present, it seems as though there may be some evil twins (triplets, even!) in the file. Fortunately, R has a built-in for sorting out duplicates. > bikes = unique(bikes) > length(bikes$Date)
[1] 5809

Our 5809 unique hourly records equate to just over 242 days, or roughly 8 months.

Now we're ready to do our first plot. We'll go with NB bike counts on the horizontal axis and SB bike counts on the vertical axis.

> plot(
+ x = bikes$Fremont.Bridge.NB, + y = bikes$Fremont.Bridge.SB,
+ xlab = "NB Bikes in 1 Hour",
+ ylab = "SB Bikes in 1 Hour",
+ main = "Fremont Bridge Hourly Bike Counts"
+ )

It appears that linear trends are on sale two-for-the-price-of-one. You could go as far as to say something cute here like, "The two trends are different as night and day." If you did say that, Congratulations! Splitting the data by AM and PM times seems like a reasonable next step, since bicyclists who travel across the bridge in the morning will theoretically return later that day. This assumption won't catch certain individuals who are just out on a short excursion, work odd hours, return across a different bridge, or take the bus for one of their crossings, etc. Despite these shortcomings, I march blindly forward in the name of science.

Identifying said AM and PM times is as easy as one, two, substring, replace.

> n = nchar(as.character(bikes$Date[1])) #all Date entries have the same length > am.pm = substring(bikes$Date, first = n-1)
> am = which(am.pm == "AM")
> pm = which(am.pm == "PM")
> am.pm[am] = 0
> am.pm[pm] = 1

Now for plots that reflect this split:

> plot(
+ x = bikes$Fremont.Bridge.NB[am.pm == 0], + y = bikes$Fremont.Bridge.SB[am.pm == 0],
+ xlab = "NB Bikes in 1 Hour",
+ ylab = "SB Bikes in 1 Hour",
+ main = "Fremont Bridge AM Hourly Bike Counts"
+ )
> plot(
+ x = bikes$Fremont.Bridge.NB[am.pm == 1], + y = bikes$Fremont.Bridge.SB[am.pm == 1],
+ xlab = "NB Bikes in 1 Hour",
+ ylab = "SB Bikes in 1 Hour",
+ main = "Fremont Bridge PM Hourly Bike Counts"
+ )

Plotting these two groups together (with the same scale) will help further untangle the two trends.

> plot(
+ x = bikes$Fremont.Bridge.NB[am.pm == 1], + y = bikes$Fremont.Bridge.SB[am.pm == 1],
+ xlab = "NB Bikes in 1 Hour",
+ ylab = "SB Bikes in 1 Hour",
+ main = "Fremont Bridge AM/PM Hourly Bike Counts",
+ xlim = c(0,700),
+ ylim = c(0,700)
+ )
> points(
+ x = bikes$Fremont.Bridge.NB[am.pm == 0], + y = bikes$Fremont.Bridge.SB[am.pm == 0],
+ col = "RED"
+ )
Note that the AM data points are colored red in the combined plot. The density of the points obscures the fact that there is a lot of overlap in the rectangle x: [0,100], y: [0,200].

At this point, it may be instructive to pull out the another classic line. Namely, y = x.

> points(
+ x = 0:700,
+ y = 0:700,
+ type = "l",
+ col = "BLUE"
+ )
This line divides the quadrant into regions where y>x and y<x. Points satisfying y>x are predominantly in the AM camp and those where y<x are mainly in camp PM. One possible explanation for this trend is that more workers bike to work in the South Lake Union/Downtown Seattle Area from Fremont/surrounding neighborhoods than vice versa.

As mentioned above, the split into morning and afternoon groups still leaves a region of overlapping points. Next time I'll look at this classification more carefully and then take a look at the two trends. Based on the scatterplot, I expect the regression lines to be (nearly) inverse functions.

(Now that we've gotten this far, you may wonder what it is I hope to accomplish by investigating the relationship between Northbound and Southbound bike traffic in Seattle. At this point I'm not expecting anything revolutionary to come of this analysis. It has proven to be a nice way to spend a summer evening, though.)