This is my first Knitr document, which lets the user combine R code and text in a single formatted document.
I wanted to have an accessible example that illustrates the bias variance tradeoff.
An illustration of the Bias Variance Tradeoff
by Gene Leynes
http://geneorama.com/
http://www.linkedin.com/in/geneleynes
Summary
The Bias Variance Tradeoff is an important concept in machine learning. This concept helps you evaluate which model will work the best.
When most people think of fitting a model, something like this comes to mind:
Where you basically just draw the best straight line though some points. This paradigm makes it hard to imagine what some one would mean by “model selection”.
The bais varance problem arises when you start to use non linear models that don't have to follow straight lines.
If you consider this data fit with two different smoothing parameters:
you can get a sense of the problem.
Intuitively the plot on the left seems to do a better job at representing the information contained in the data… However the model on the right has absolutely no error.
This is the bias variance tradeoff.
In mathematical terms the model on the right has too much variance because it only works for that set of points. If you gave it a different set of points generated in the same way, you would get a significantly different model. So, when your model “Chases the points” of your observed data, you have too much “variance”.
Bias, not shown here, is when the model is too inflexible for new input.
Demo using R to illustrate the Bias Variance tradeoff
Getting Started
I always start by clearing the workspace and loading libraries.
I'll be using ggplot to draw the graphs, and I like to use my own library geneorama. You don't need it, but you can learn more about it here:
http://geneorama.com/geneorama-package-now-available/
rm(list = ls())
library(ggplot2)
library(geneorama)
Example Function
To start with I made an example function that represents some true, but unobservable relationship.
This plot could illustrate my enjoyment of a 12 hour long documentary on snail farming. I might have been somewhat excited initially, but then gotten bored, and then really excited towards the end.
set.seed(5)
dat = data.frame(x = runif(1000, 0, 12))
dat$x = sort(dat$x)
dat$y = with(dat, sin(x * 1.3) * 15 + 3 * (x - 4)^2)
ggplot(dat, aes(x = x, y = y)) + geom_point() + opts(plot.title = theme_text(size = 16,
face = "bold")) + opts(title = "Some Function Representing the \"Real\" Model\n")
The Simulated Noisy “Real World” Data
Distribution of the “noise”
For whatever reason, there is noise in the data. Maybe I was recording my enjoyment of the documentary through electrodes attached my scalp, and there was some measurement error. Maybe it was sun spot activitiy… in any event there was noise.
Anyway, the assumed distribution of the noise is shown below. You can see clearly that the stronger positive signals have lower variance. Maybe that's happenstance, or maybe those signals come though more clearly.
sigma = with(dat, (exp(x - 5)/(1 + exp(x - 5)) - exp(x - 7)/(1 + exp(x - 7)) *
2) + 1.4) * 6
ggplot(dat, aes(x = x, y = sigma)) + geom_point() + opts(plot.title = theme_text(size = 16,
face = "bold")) + opts(title = paste("The Sigma of the Errors in the Real Data\n",
"(Probably overly complicated)\n")) + opts(panel.background = theme_rect(fill = "antiquewhite3",
colour = "black"))
This sigma distribution is based on a function that I completly invented to have a shape that I wanted for this example:
\[ \sigma^2(x) = ((\frac{ e^{x-5} } { 1 + e^{x-5} } -
\frac{ e^{x-7} } {1+e^{x-7}}*2)+1.4)*6
\]
Creating the observed noisy y
This is a pretty easy part, just add the noise to the distribution based on sigma. I make the noise using the built in rnorm
function, and set the sd
paramater to sigma
.
You can see the lower variance in the upper right part of the graph, where the observed values are closer to the true function.
dat$yobs = dat$y + rnorm(nrow(dat), mean = 0, sd = sigma)
Now, plot the result
p = ggplot(dat, aes(x = x, y = yobs)) + geom_point(alpha = 0.65) + opts(plot.title = theme_text(size = 16,
face = "bold"))
p + geom_line(aes(x = dat$x, y = dat$y), size = 2) + opts(title = "Observed Data\n(With the truth model as a solid line)\n")
Now, let's look at some fits using
Local Polynomial Regression Fitting (the loess function in R)
Using the loess function I can show some examples of fits where I'm chasing the points at one end of the spectrum, and another end I'm not using the data enough to represent the true curve.
This is spectrum is shown using a range of colors in the RColorBrewer package, I can show a range of fits where the lighter fits are clearly over fitting, and the darker fits are too smooth. With real data it's always a major challenge to decide how smooth to make your fits.
library(RColorBrewer)
if (require(geneorama)) {
plot(dat$x, dat$yobs, panel.first = bgfun("honeydew2"), pch = 16, cex = 0.5,
main = "Several fits using the loess function:")
} else {
plot(yobs ~ x, dat, pch = 16, cex = 0.5, main = "Several fits using the loess function:")
}
spans = seq(0.01, 1, length.out = 9)
colors = brewer.pal(length(spans), "YlGnBu")
for (i in 1:length(spans)) {
lines(dat$x, predict(loess(yobs ~ x, dat, span = spans[i])), lwd = 3, col = colors[i])
}
Some of the fits shown separately
It's helpful to see the two extremes by themselves, and an example of something “reasonable”.
Too much bias
This model is definitely too smooth, and doesn't capture the nuances of the curve. You would say this one has too much bias.
p + geom_line(aes(x = dat$x, y = predict(loess(yobs ~ x, dat, span = max(spans)))),
size = 2) + opts(title = "Too Smooth (too baised)\n")
Too much variance
This model chases the points, and would vary from one set of data to another.
p + geom_line(aes(x = dat$x, y = predict(loess(yobs ~ x, dat, span = min(spans)))),
size = 2) + opts(title = "Too much variance\n")
A tradeoff of bias and variance
This example is somewhere in between, and does a better job. But the trick is figuring out how to get to this picture automatically.
Note, in this example the code is a little more clear and you can see the manual choice of span=.2 in the loess function.
p + geom_line(aes(x = dat$x, y = predict(loess(yobs ~ x, dat, span = 0.2))),
size = 2) + opts(title = "Just Right?\n")
Conclusion
Of course, the problem that I manually picked the “good” example. It's easy for a human to pick a resonable example, but you also want to be able to tell a machine how to pick the resonable example. Also, just shooting for resonable is one thing, how does one find the “best” model? This is the whole question about machine learning.
Easy to train:
[1]
Hard to train:
[2]
[1]: A Regular Joe (it's actually Joe the Plumber)
[2]: A computer
Next post: Machine learning basics
This is my first Knitr document, which lets the user combine R code and text in a single formatted document.
I wanted to have an accessible example that illustrates the bias variance tradeoff.
An illustration of the Bias Variance Tradeoff
by Gene Leynes
http://geneorama.com/
http://www.linkedin.com/in/geneleynes
Summary
The Bias Variance Tradeoff is an important concept in machine learning. This concept helps you evaluate which model will work the best.
When most people think of fitting a model, something like this comes to mind:
Where you basically just draw the best straight line though some points. This paradigm makes it hard to imagine what some one would mean by “model selection”.
The bais varance problem arises when you start to use non linear models that don't have to follow straight lines.
If you consider this data fit with two different smoothing parameters:
you can get a sense of the problem.
Intuitively the plot on the left seems to do a better job at representing the information contained in the data… However the model on the right has absolutely no error.
This is the bias variance tradeoff.
+qyB0GBy9JkX3ue50mdTNoQGE6NHUlaXB336PGfhEBpMjez/H/rWzakzhAbTI3vCxe5Z4zOEBpMjgs+H2X2G0GBiRC+0D88z5xsdHrwJJsf5/DJ/eO7waGQwNTQ8z+8zhAYTok65z57nTzf0LkBoMBWqAVouGZ670UIL/fy2vNHIoYFFdCF7XsbnkUKL8IDN8UDozUHC80I+Q2gwHYvWBi3PC42UA2jE4umG3otRQl8KD4u9oFIILLsmfK7RyuGfi7X8HROYzjX1Ovc+iyX3B812YAJ2ofF5yfDc3ZFyZNNkvwg59KYh4Xlhn8dXCrNB+CoQeivY+6yUzkv7DKHBszQUnrt7rhTe6zOE3gZt+TxK6PLjvCH01iHXUprwGQ+vB09AGzea0BkPrwePs2vQ5wma7coPr1/6u4GqBJ0b8hmtHOAxwqXBZtJnDYQGj9CozhAaPESUbcil94YyQQ4NobdGHJ5lUyca43KAO6Ed63qd2/IZQoM7iX22QzO2w8gc+t50A0IzJdZ5r2Ru6zyPjdD3p9AQmiE717FO+3xsKzhr7kg57nUad6wwm5Irg9rns2hjv6LpfRH6LqPb++sFz5BcSjmemzzByKHBKHYDn5feozwLt3JA+pUQkmdTGzyemrqcEhgltM42alxYaazJBxRIovP+2KzP4zr4k/+mFRoReg0QnV14Pjd2OSVw1x0rd7ZyLP3dwCSEto3Ohefe51Z/WxeO0KB1MtlG73PXqs8jc2hxZThdCM2YYbahwrOKz62eXfTlAGXS8Kx97lr2Gf2hQRFyYdA0PR+1z31NvuFTi2EMQJ5ctmHT55bP7D0R+k4KX7ul+3VAifTCoArP2mfZeFvr7Dn0vjw+ZdMHalMMdDY+n79lq9dTPPfcgjVJylEecbXtn7INkfaqI+FZNn+O7ngkxTTNdof0CXaic7G59WO1ETLJs/b5Wy1s/hzN/9CgeGQS3QKE2NwOO9IN6UB07tON9m3ulhD6xbUD6eKNy2bV5vMz/qQtdcFnnW6swOjnH7x5p9Av5K+/i37CZPs1Dubssj4rnbXPjCL0Q+RvwSI/aId9tLw/Zg3cwrPdaaRzlD3r1g2x9P6Nm87ebGdTtJdB5VAJ7UL0GkIBM7JNGy48i/X8ds5/pTCpdfj5Mgi9imSNFflkw4XnNZ2P5x9JcfdTsMSORmnf3CHvjdDrOcitQ6+jWJ2dz+fzSiqDjqc7Jz3wnEJrdLhHTc2LhI5XJ//S2Ws6zC0zjM7CRefe55Z71uWYX+jTaah0dyVC24Y9c1ijBRUOR+NU+MqlZMP6XKfQijw9+ui9Qp8skdGHvY3Qxug4KBiTxXDB5pj+62eSje7os43zChtSJ3jGyp059OkUO+2UJkJ3g0jtTuWmfS5//QcPSyY6e517n79bG1l0DLM3232fT6nSrleXFVp0mUg98ankxWORO5M7C5ptlOo0bVNBaN/Kly3wbImMtuP/6SModPaRBuTCGdt4DmJJDsKoQ5KLziIOz9sU2g5CMzrlkOdz5PSORmljdEe7Kbrc2Xf5eODk8eJWe8+IP/KXnM5RtvG91q4IUwgtnNijhFbH6RxZTZRWYUKvImzPAb26IC0dm8+mb7f33Dw0w6uCzmaTbdgukGv0eX6hO+lFlWryraymeYeqXauBeYy+Zj1hggU5k4MwxEPwG98iHIDo+98OpWTl3RWdTXgWtJP62pjg0ch3Cm2OlBfaQq4e6oOqcKvJLpPOpXkji2R6+C2KyUTcTTFvtBj8+b9kde6ozrb+stbDOUGl0AxDk5lfKlIdXZtWuLY6GTWJ7r3TZ2/8UOhwuvw/q2fwu5Okx5lPlITWv3FJghbpbFs25DFp2lh3i/8SA804o4Vp0NCzkquH+vAmUqdbSS4ismBQQyBfLfsthY0KXbqeXhB9KFMTdO10VmfasrHWY7rIyEkmi+5CzcPGkt0uhGmjtHP6TK8j6n9luKF+rcc+R6GG0HlDkzXlIEKHvwlJLoy8RDqbXMPF5pBsrLRlg7LkUGDCNl/4/3ch8VBKJ06fv79tO7VasZA4rknu3L6Ka4voV9YNEZ2rYEi6ln2l5+dyDdpjgyQbQv9xTPLNFmT2O1bCtD8l77YhTpA7DkiU3h/dJcXg9Le6r+X93Zytd3Um39/Dw2vUaW7jzonbU/W10/nX99+Ya9+rgydoDePdbEDf+PPu69v9isOa4HFos9of9fnlj8tz0+UitL9FzaQbIWt0TR7K6eNAaeX0N2kekV2p8t88+TAs8ktd01A0b1hl9rUSd4S8zbvoxqpg8+lMD/7q6yOLVApttmHfaZvJkZTB6J1KPLJOR6dy/ach4FPgzOW/QZolh0a7N4nOh4HO7pCGi1XrbXwmLCC0SEMy+VdDGqZVNk2UjuO0tdpddimUNsVRmpM4QkddxM3XNR24umBtSegXqnOUOLuDKWmjM4+wsFiEvoI7R/50HEtO2zN8pW6+jrNEG+eiHY6bdry8tr2TpByJ0H3VsJg5+4MokgO3hiN1k8UHPM+t5U5VqCAeS07LQTV/uP1mT5QPxlTiaG/dNyP9AGSqMhXazY+CMx0AyR69/sB1xZaiVbO00KUI6gKNz6bVCRGngdM+m76/iMUJOzasANpkwH41FZCdrCrXIp27aPdyzfEY6oG7YHM4aH2dmuTZM3/n6iwtdLbLsz2V9h11WsqS0qOKaIpMXPaXmcwPVGhLlvKcEgSOKNjskjTbCKLbnK9XPlbK4kKHPRHxLzA51lGYlt3Q6SD0ms4Q6YNiL+K7BMP/6KhJ3uOhylpn2krnHopy/o6bheJ2DWa0IzSp3mc+SZUmcTpEabNaGqoLv+ntEHXecL80Tj2aWyRpRVbmpFWDbkr3x7WvQrcuhjQktNujUsobfkv7KJ2G6W89enGUFYqu1KjbEoL8R8Ky/B5tcv/J/T61Of3b6DondMPHYgraE7ocPPYk9VDvkzDdRcOJmTQxZDEP7swM6FZ56S6c6MrfUOVsTBbm0a4vxObdbk+uekdCu84gDR+KCWhQ6CtQpXtSpenYHiH70PGvXaNVFVCQKuCITPm4dxwOkcw7d+Na+KmStA+daP+REs+yLqF7DuQEqkAdKx1+XpOxENo6j2aP3A661LbTOfM4k53M1Gb6d1sQmueNa5TVCd1z2FGlkwqibbdNx0JoqvXD3oNKnDPty97mKxpbl43NRGfaV7yjV2RIp9Chz/wC9hqF7g5E6R1px7NKO5/J72tjVxBcG510U2LzVZXtaOSJzN2wOUdkg/LglLDzeT1Cx1vTpzU4nSjdibh+2OYlMSndRUBv87WoHMkc26xIm3PiHqH8xC3RrtDxx4c/jv0Zjk5r5DTt4HGj+1INos5G2eX2T66fOJtvZBjBZZo270Ixg06LZGcYphYlFrxj5cadByL+fOZODn2mSRaZ5tLmmS2+4Wq+OyeMP/q9SL+HXe5qgqnNBZWHgbnn1vEhUzHL925hupYIncfU9ck5pkE60xI7F8UI7Toeu15xeujKvMyHVOZY52uHZwOpcpF2hR6JPvPRqXZOLyh0IE4EXBcK67NupMvI3A1djr7hjat9W2jMKLJ6oa3SUcOHV7qLOjMsAB05xPxndkPr7HKNQlzOymy2eqvUmzP4wkBoIZwKaaBW7R1e6GVCdIjQ3m3Ta8OlGteTjEIVEBRZv9DaEO9DLIAf55gIPW+0Ip0nhPNayPPA5psu3y3zhqIyZf1CuzMXtIg82J3je1rmzSfN5R1ntM2g07x5epm7beXNFAZCB4gesRBWaOHuMKy/a7YEXW4UoZNU46bLD+cZ2/SZldDq1zy6DhGLEQbjNH0pKu+KntJUR82KG+hqubxlOAltJE0urmUVsaPqVdtDabfuUh3TT+hEdR7IvIPMU8BKaPONhLgh9c6sVifNtHeeCD/QrbQ9WQcyw+Ua8BPaxN1hippas6uSZrohuzoidP9fqAbC5bpwETpXWKYRYWD17srn7ykrWmj7hdqLOnaMfNdWXlIZLk8CE6GH2YN5n20ZG1q9uyf7uL1u3HPf6pzvkpG6vNG2ielgIvRAhGBd6UpcRuuRUXJMhHbPW5QmNpdM1v0ysrsNHoOL0MMvFl7KQl+Jotap2oVvMpwtybC3qgdSMSSbEnwfE3Flm+Au5n94/QKoBKDUA+i61lo8EjYH0TT6ku4qyjWNFbSV2o1v3c7RWjeTPBqZTlsU2iAjqROrB7dQj+Bq/M2imvBIRPZX5Cs1IW6RCkILRxN3METT3p/9/qrWh8EdqFOgbxkR79LfiWPuIAl30ogbd5xgOtsdK21H6DQ5VRW182mfktF6XDpylYMeXi7X/CLCk7zC8M9gCnjn0OmIpu5ixzl7W3XZ6yxB3CF7NbrclQMQ59CoC04H21YOQxyhfROEPJ9ztz4FuoyktwmfP56kvD3s1nJ3hvGFudAUERqJe9e+z+V7rUcF7/x6amun09ibviD05GxHaNd7VLo+ytEd1+O0voLdxsnZDKGXYTtCuwFX/JhyegzmzPBb7ibF8RYfMzaPuisXQk/OhoRWhAvTZhqG+0zcPKZ3t7qeGfbyiexk6vLpZNbqbA+7XNkJEHpyNiF0vCemuuaSj2gI29TqI/14iL2DuHw6qTsXaQE0Qtu/jUzbHISenC0InZokbUuj1dM+Ju3qWPlKbZmZe/I20wY4aQbzTUTOGl35q28P1kKLZEoX6Gw6cZoMnF/QOheao7vKbQeP3uhMYF7+iPCHs9BXx+4Pg46bxNg/zpI8DaKotRsW0j4Os4sfgCGi27uS8hc/KszhLHTp6Sq+vufTDvU2xOnBs6ec2uS9b8oQ+rklcSWwUG5Xng2mgrXQmqHPNAewPocqYvzg4dRrFZaj52y75m0xeAo8IvQi8Bd6CN0x02Dsb2jVrXLflvgRrufwtNqB0Mn2o6dAICTPyhaF9thWaREitO371vm2t2CuzAnt1h5uN/8G1GbLQrsr4aQvp71/RHTk6ojJtMlj4qzgnV+sZy/1LUDEloV2D3wjF07MbBKhzVvSMOcqkp1/EqKws2fffZCh2WeszD9Vmob33mF9i4l5o9cLj8yO1scdJ21MNx2hY6KWNztSun7VkfuzO/dGp92z7yO4BYQOxH46a8k78houNwqELu2mbZOjb3OvQVtsXmjSSJHZYTKL3Px3Veh1fG22bF7o0Ejhe8XlVyPPs7omNJo7lgVCG0KEjq/zpS8RoRsHQgdE3Amji/rrUc+RQ7cLhPZkemXQfnM0nYbQzQKhA+mIicX7A9AA3S4Qusjt6t3avyFHIHSZmz6jQaM9IPQT8P+G6wNCA1ZAaMAKCA1YAaEfgfFXWzsQ+gHQvNEuEPoRGH+1tYNbsDBlNUWEBqyA0IAVEBqwAkIDVkBowIqaQgMwP/WELnk+/SbnLoHBV+BQwiMFQOhFCkAJtQqA0IsUgBJqFQChFykAJdQqAEIvUgBKqFUAhF6kAJRQqwAIvUgBKKFWARWEBmA5IDRgBYQGrIDQgBUQGrACQgNWQGjACggNWAGhASsgNGDF5ELfvKVgBSVUv6i7/oMkapfyaAFTCy0utXWoX8Kltm7rP0jmCFUs5eEC1ie0K6Xm5mcQun4JFUsRl8pCi7BdCD3B1qsLvfrfgNoReltC1/etcgFihh+ZGXLoOYTmn0NXL2AG2y5rz6Eh9NRl1Nz87cEfni3hAqFHFfDA5lcodP0UHRF6VAEzCH3/1tfXDl0/gKIdesT2a5ci7PbvLQFXCgErIDRgBYQGrIDQgBUQGrACQgNWQGjACggNWAGhASsgNGAFhAasgNCAFRAasAJCA1ZAaMAKCA1YAaEBKyA0YAWEBqyA0IAVEHpGRPRqhkEnNwiEnpFIaFF/hKYtAqHnQogQl4VwOs8zFuCGgNAzIcKQdsJH6MsMA+ltDAg9EyQUC5pDQ+hpgdAz4fML4cceFUg4pgdCz0TINUjKIRChpwZCzwQxGUJXBELPhR130DR2uKYOJB1TA6GXAyJXAEIDVkBowAoIDVgBoQErIDRgBYQGrIDQgBUQGrACQgNWQGjACggNWAGhASsgNGAFhAasgNCAFRAasAJCA1b8H7FC99Zr7ceWAAAAAElFTkSuQmCC" alt="plot of chunk unnamed-chunk-8"/>
Some of the fits shown separately
It's helpful to see the two extremes by themselves, and an example of something “reasonable”.
Too much bias
This model is definitely too smooth, and doesn't capture the nuances of the curve. You would say this one has too much bias.
p + geom_line(aes(x = dat$x, y = predict(loess(yobs ~ x, dat, span = max(spans)))),
size = 2) + opts(title = "Too Smooth (too baised)\n")
Too much variance
This model chases the points, and would vary from one set of data to another.
p + geom_line(aes(x = dat$x, y = predict(loess(yobs ~ x, dat, span = min(spans)))),
size = 2) + opts(title = "Too much variance\n")
A tradeoff of bias and variance
This example is somewhere in between, and does a better job. But the trick is figuring out how to get to this picture automatically.
Note, in this example the code is a little more clear and you can see the manual choice of span=.2 in the loess function.
p + geom_line(aes(x = dat$x, y = predict(loess(yobs ~ x, dat, span = 0.2))),
size = 2) + opts(title = "Just Right?\n")
Conclusion
Of course, the problem that I manually picked the “good” example. It's easy for a human to pick a resonable example, but you also want to be able to tell a machine how to pick the resonable example. Also, just shooting for resonable is one thing, how does one find the “best” model? This is the whole question about machine learning.
Easy to train:
[1]
Hard to train:
[2]
[1]: A Regular Joe (it's actually Joe the Plumber)
[2]: A computer