Leaflet in R – West Nile Virus Map

Recently I finished working on a demonstration for a West Nile Virus map. I found myself referring back to my example often, so I thought that if it’s useful for me, maybe it will be useful for someone else!

Most of the data I was using was already in the public domain, but it only took a few edits to rely 100% on public data. Now I have a nice shareable example.

I’m not going to try an embed it into this post, here’s a link to the map, and here’s the source code for reference.

Also, I’ve started a github project to store some of my most often used map examples, hopefully that stay updated: https://github.com/geneorama/wnv_map_demo/

This is what the map looks like:

I did normal leaflet things, like used the values to control the circle size, customized opacity to make it easier to see the map below, and I made the red circle plot on top of the blue circle to give a sense of the proportion of mosquitoes affected.

However I learned some new tricks in this map:

  • Of course, data.table provides a fast and flexible way to manipulate and reshape the data
  • I developed my own Mapbox template to mimic the look and feel of our Opengrid application
  • I developed my own HTML pop-ups using htmltools::HTML

please use the Rmd file to adopt for your own purposes!

Happy mapping, and remember to use your DEET based bug repellent.

Bias Variance Tradeoff

This is my first Knitr document, which lets the user combine R code and text in a single formatted document.

I wanted to have an accessible example that illustrates the bias variance tradeoff.


An illustration of the Bias Variance Tradeoff


An illustration of the Bias Variance Tradeoff

by Gene Leynes
http://geneorama.com/
http://www.linkedin.com/in/geneleynes

Summary

The Bias Variance Tradeoff is an important concept in machine learning. This concept helps you evaluate which model will work the best.

When most people think of fitting a model, something like this comes to mind:
plot of chunk unnamed-chunk-1

Where you basically just draw the best straight line though some points. This paradigm makes it hard to imagine what some one would mean by “model selection”.

The bais varance problem arises when you start to use non linear models that don't have to follow straight lines.

If you consider this data fit with two different smoothing parameters:
plot of chunk unnamed-chunk-2

you can get a sense of the problem.

Intuitively the plot on the left seems to do a better job at representing the information contained in the data… However the model on the right has absolutely no error.

This is the bias variance tradeoff.

Continue reading

The “Data Scientist” explained: more than just a buzzword

We live in a brave new world where people possess far more data than ever before in history, but the amount of information we have per unit of data has never been lower. As people struggle to make sense of all this data, several new terms emerged, such as: Big Data, Business Intelligence, Map-Reduce, and Data Scientist… but what do these new words mean?

Definitions for these terms continue to emerge, but I’d like to share what I’ve learned about the “Data Scientist”.

I was inspired to write this because of an email I just got from Kaggle. (For those of you who don’t know, Kaggle is a website that offers analytical challenges. These challenges are open to anyone, and the best answer wins prize money that ranges from hundreds to millions of dollars.)

Kaggle’s Anthony Goldbloom offers this self promotional but awesome tidbit that helps explain the role of the data scientist:

Thus who you decide to hire as your first data scientist — a domain expert or a machine learner — might be as simple as this: could you currently prepare your data for a Kaggle competition? If so, then hire a machine learner. If not, hire a data scientist who has the domain expertise and the data hacking skills to get you there.

Recently, I was reading about Map-Reduce, and I came across another nice explanation of the data scientist. This explanation is more comprehensive, yet still concise.

Data scientists use a combination of their business and technical skills to investigate big data looking for ways to improve current business analytics and predictive analytical models, and also for possible new business opportunities. One of the biggest differences between a data scientist and a business intelligence (BI) user – such as a business analyst – is that a data scientist investigates and looks for new possibilities, while a BI user analyzes existing business situations and operations.

Data scientists require a wide range of skills:

  • Business domain expertise and strong analytical skills
  • Creativity and good communications
  • Knowledgeable in statistics, machine learning and data visualization
  • Able to develop data analysis solutions using modeling/analysis methods and languages such as MapReduce, R, SAS, etc.
  • Adept at data engineering, including discovering and mashing/blending large amounts of data

People with this wide range of skills are rare, and this explains why data scientists are in short supply. In most organizations, rather than looking for individuals with all of these capabilities, it will be necessary instead to build a team of people that collectively has these skills.

Map Reduce and the Data Scientist, by Colin White (January 2012)

Granted, there’s an element of self-promotion here too, but this is a great description. I’ve had a hard time explaining my professional value proposition when I meet new people, because there are so many new concepts involved in my areas of specialization, and this description is quite helpful.

As companies are recognizing their need for someone to fill this role of the data scientist, they’re clearly struggling to define the role, advertize for the position, and evaluate candidates. Often they are overly focused on technical requirements, and they’re seeking a PhD in machine learning, or someone with years of database programming experience.

It seems to me that they usually need someone who understands concepts like cross validation, or decision trees, and knows more than the difference between a flat file and a relational database, but the most important thing is that they need someone who can understand business problems, communicate to business leaders, and appreciate the technical considerations for application development.

Update:
Better link and explanation here
http://radar.oreilly.com/2010/06/what-is-data-science.html