Jul 19

This is my first Knitr document, which lets the user combine R code and text in a single formatted document.

I wanted to have an accessible example that illustrates the bias variance tradeoff.

An illustration of the Bias Variance Tradeoff

# An illustration of the Bias Variance Tradeoff

## Summary

The Bias Variance Tradeoff is an important concept in machine learning. This concept helps you evaluate which model will work the best.

When most people think of fitting a model, something like this comes to mind:

Where you basically just draw the best straight line though some points. This paradigm makes it hard to imagine what some one would mean by “model selection”.

The bais varance problem arises when you start to use non linear models that don't have to follow straight lines.

If you consider this data fit with two different smoothing parameters:

you can get a sense of the problem.

Intuitively the plot on the left seems to do a better job at representing the information contained in the data… However the model on the right has absolutely no error.

This is the bias variance tradeoff.

May 22

## Geneorama package now available

I now have a geneorama package available.  It’s not on CRAN, because it’s not even remotely documented. I do hope to do that at some point, but not today.

You can install it by simply opening up this file!

Be warned: Opening this file will modify your rprofile.site file (located in R\Rversion\etc).
The script will add the text “library(geneorama)” to end of the profile file, if it doesn’t already exist, which will automatically load the geneorama package when you start R.

http://geneorama.com/code/Install Geneorama.RData

The installation works on both a PC and a mac.  The automatic installation uses the .First function to simply copy the library files to your R Program file location.

Mar 22

## The “Data Scientist” explained: more than just a buzzword

We live in a brave new world where people possess far more data than ever before in history, but the amount of information we have per unit of data has never been lower. As people struggle to make sense of all this data, several new terms emerged, such as: Big Data, Business Intelligence, Map-Reduce, and Data Scientist… but what do these new words mean?

Definitions for these terms continue to emerge, but I’d like to share what I’ve learned about the “Data Scientist”.

I was inspired to write this because of an email I just got from Kaggle. (For those of you who don’t know, Kaggle is a website that offers analytical challenges. These challenges are open to anyone, and the best answer wins prize money that ranges from hundreds to millions of dollars.)

Kaggle’s Anthony Goldbloom offers this self promotional but awesome tidbit that helps explain the role of the data scientist:

Thus who you decide to hire as your first data scientist — a domain expert or a machine learner — might be as simple as this: could you currently prepare your data for a Kaggle competition? If so, then hire a machine learner. If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.

Recently, I was reading about Map-Reduce, and I came across another nice explanation of the data scientist. This explanation is more comprehensive, yet still concise.

Data scientists use a combination of their business and technical skills to investigate big data looking for ways to improve current business analytics and predictive analytical models, and also for possible new business opportunities. One of the biggest differences between a data scientist and a business intelligence (BI) user – such as a business analyst – is that a data scientist investigates and looks for new possibilities, while a BI user analyzes existing business situations and operations.

Data scientists require a wide range of skills:

• Business domain expertise and strong analytical skills
• Creativity and good communications
• Knowledgeable in statistics, machine learning and data visualization
• Able to develop data analysis solutions using modeling/analysis methods and languages such as MapReduce, R, SAS, etc.
• Adept at data engineering, including discovering and mashing/blending large amounts of data

People with this wide range of skills are rare, and this explains why data scientists are in short supply. In most organizations, rather than looking for individuals with all of these capabilities, it will be necessary instead to build a team of people that collectively has these skills.

Map Reduce and the Data Scientist, by Colin White (January 2012)

Granted, there’s an element of self-promotion here too, but this is a great description. I’ve had a hard time explaining my professional value proposition when I meet new people, because there are so many new concepts involved in my areas of specialization, and this description is quite helpful.

As companies are recognizing their need for someone to fill this role of the data scientist, they’re clearly struggling to define the role, advertize for the position, and evaluate candidates. Often they are overly focused on technical requirements, and they’re seeking a PhD in machine learning, or someone with years of database programming experience.

It seems to me that they usually need someone who understands concepts like cross validation, or decision trees, and knows more than the difference between a flat file and a relational database, but the most important thing is that they need someone who can understand business problems, communicate to business leaders, and appreciate the technical considerations for application development.

Update:

Feb 24

## Update: Sunshine in Chicago compared to Anchorage and Miami

I updated the last post to include a link to the source code, and updated the plots with attribution to the data source, timeanddate.com

Based on Jim’s comment on the last post I thought it would be easy to re-run the analysis for Anchorage.  However, the Anchorage data was more difficult to handle, due to a period of continuous twilight at various times in the year.

So, as a workaround I just downloaded the tables for sunrise and sunset.   Personally,  I was more curious about Miami than Anchorage… but they are both easy to run with the new code.

Here’s what we gain / give up in terms of daylight for these locations.

Also, I thought that the speed at which the days change was much more interesting when comparing cities:

The way that the website deals with a

Feb 24

## Sunshine in Chicago

My favorite day of the year is December 21, because that’s the day where the days finally start getting longer.

I’ve always wondered how quickly we gain and lose time as the seasons change, and so I thought I would try “scraping” the data off the web.  Here is that result:

Although it’s interesting to note to see where the days are getting shorter and longer, something else grabbed my attention along the way to this graph.  I was interested by the effect of daylight savings on our day.

In my younger days I loved that magical weekend when we “gained an hour”, because it felt easier to wake up for at least one Monday a year.  These days I feel much more anticipation for the spring ahead weekend, when we regain our fair share of sunshine.

Here’s what our days look like currently (with daylight savings):

Here is what our days would look like without daylight savings:

Source code: http://geneorama.com/code/SunriseSunsetExample/

Feb 17

## Getting started with R

I finally posted my guide to “getting started with R.

Now I need to spend some time to figure out “permalinks” in WordPress to make the link simpler and less likely to changet.

Jan 30

## Installing StatET

StatET is a powerful plug-in that allows you to use R inside the Integrated Development Environment (IDE) known as Eclipse. The features in Eclipse make it easier to write code in R, unless perhaps you’re already using something more sophisticated.

Eclipse has a reputation for having a “steep learning curve”. However, I have found it to be useful even if you barely know what you’re doing. The more you learn, the more useful it becomes.

StatET has a reputation for being difficult to install. There are a few things that tricky for non-programmers. Hopefully this post will make those things more obvious.

Note: If you want something easier, just download R Studio. It has many features that are a huge improvement over the standard R GUI.

StatET is written by Stephan Wahlbrink. The official website and more detailed instructions can be found here:  www.walware.de

System Requirements

I will be showing you how I installed the plug-in for Eclipse Indigo, using R 2.14.1. I’m using a Windows XP machine. The process is similar for Windows 7.

My Steps

Jan 30

## How to upgrade to a new version of R

I updated to R 2.14.1 for the StatET instructions post (forthcoming).  While doing that, I noticed some upgrading instructions in R’s Frequent Asked Questions.

I gave it a try, but the results were a little annoying.  First of all, I had to be careful to copy over only my custom libraries, and not the core libraries (like “base” and “stats”).

Then, when I issued the update commands:
 ## The FAQ had ask=FALSE, but I wanted to see what was going on, ## so I set ask=TRUE update.packages(checkBuilt=TRUE, ask=TRUE) 

Unfortunately, the update.packages command updated nearly every custom package, and (oddly) a few core packages as well.  Also, I was expecting “update” to mean “just update missing files”. However, “update” meant “download the whole package and install from scratch”. So it didn’t save time or bandwidth.

I found it easier to run these commands to list the folders that are in the old library, but not in the new one:
OldFolders = list.files('C:/Documents and Settings/Gene/My Documents/R/win-library/2.13') NewFolders = list.files('C:/Program Files/R/R-2.14.1/library') OldFolders[!OldFolders %in% NewFolders]

Note that in 2.14 they seem to have gone back to storing the libraries in the “Program Folder” rather than in “My Documents”.  I think the original switch to “My Documents” was a work around to avoid needing admin privileges every time you install a new package / library.

Then I manually installed the libraries one by one using “install.packages”, e.g.:
install.packages(‘earth') install.packages('zoo') install.packages('rJava') install.packages('tkrplot') 
The manual installation is useful because
•    Some of libraries might not be available on CRAN
•    You might not need all your old libraries
•    Some libraries install dependencies, so you can skip the dependences

Every so often I would rerun the oldfolders / newfolders code to check what was still needed.

Jan 16

## Guide to using Easy Install in Python from “sadphaeton”

While casting about looking for resources to get Numpy working (more about that later), I found a cool blog.  The author  knows what he’s talking about AND has a very down to earth tone; a very rare combination.

I had Python and Numpy working for the most part, but the “easy_install” command was still a mystery. I kept seeing installation instructions various packages that said “just use easy_install”.  I’m thinking “Thanks jerks. Where and how do I use this ‘easy_install'”??  The very name “easy_install” seemed to mock me at every encounter.

As it turns out, you use it from the command window (the DOS prompt on Windows or Terminal on Mac).  Not in the Python shell.  Also, you just type “easy_install [package name]”, and the command find the URL for you (apparently located somewhere in Python heaven).

Here are the kindly posted instructions that saved me:
Part 1 – Installing Python (I didn’t use this, but it’s helpful anyway)
Part 2 – Installing Modules in Python

Dec 13

## Use R to choose your secret santa partner

Ok, so you want to choose your secret santa partners, but you can’t find a hat? Well, here is an R Script that can swoop in to your rescue.

This isn’t the most elegant or efficient code, but unless you have a really huge family it won’t take long to run.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21  ChooseSS = function(people, avoidmatch){ permuteMyPeople = function(peeps){ PeepsPermuted = sample(peeps) if(any(peeps==PeepsPermuted)){ PeepsPermuted = permuteMyPeople(peeps) } return(PeepsPermuted) } cbindMyPermutedPeople = function(peeps){ cbind(p1=people, p2=permuteMyPeople(people)) } ret = cbindMyPermutedPeople(people) m1 = sapply(avoidmatch, match, ret[,1]) m2 = sapply(avoidmatch, match, ret[,2]) while(any(m1[1,]==m2[2,])|any(m1[2,]==m2[1,])){ ret = cbindMyPermutedPeople(people) m1 = sapply(avoidmatch, match, ret[,1]) m2 = sapply(avoidmatch, match, ret[,2]) } ret }

And, you can run it with this “example” family:

 1 2 3 4 5 6  set.seed(2011) family = c('Dick', 'Bonnie', 'Suzy', 'Jeff', 'Amy', 'Mike', 'Kindy','Gene','Emily','Joe', 'Courtney', 'Meghann') avoidmatch = list(c('Mike', 'Amy'), c('Suzy', 'Jeff'), c('Courtney', 'Meghann'), c('Dick', 'Bonnie')) ChooseSS(family, avoidmatch)