Business Insights

Harvard Business School Online's Business Insights Blog provides the career insights you need to achieve your goals and gain confidence in your business skills.

Filter Results

Going Beyond the Stats: Cinderella Teams and the Anscombe Quartet

16 Mar 2017

Jenny Gutbezahl Staff

As we head into March, watercooler discussions naturally turn to the NCAA basketball championship and who's going to win the office bracket pool. The popular statistics site FiveThirtyEight has generated win likelihoods for all teams as they have in the past, but this year predictions are less certain than ever.

Even the top-ranked Villanova is given only a 15% chance of winning. As you may recall, last fall FiveThirtyEight garnered a lot of attention by being relatively uncertain about a Clinton win in the general election – this skepticism proved to be well-founded.

To make datasets more comprehensible, statistics summarize datasets with one or two numbers; this can obscure patterns. In college basketball, the entire complexity of a team's season performance can be reduced to the Rating Percentage Index (RPI), a ratio based on the team's wins and losses, and the strength of the teams played (based on those teams' wins and losses, and the wins and losses of the teams they played). Interested readers can find a fuller explanation, including the computational formula here.

Last year, Michigan State's impressive RPI of .6272 led to a No. 2 seed position. Middle Tennessee State on the other hand, with an RPI of .5562, was seeded at No. 15. And yet, on March 18, Middle Tennessee won 90-81 against Michigan State. Middle Tennessee turned out to be a Cinderella team; while it looked like they might end up as the belle of the ball, at the last moment, they choked and their carriage turned back into a pumpkin.

This year, Middle Tennessee is seeded as the underdog in the 12 spot, with a somewhat stronger RPI of .5960. Many pundits (though not FiveThirtyEight) are looking at them to do better than expected in the postseason, which would not be too unusual for a No. 12 Seed.

However, just looking at RPI (and last year's performance), may not provide enough information. While looking at all the specifics of a sports season, a national election, or a data distribution can be daunting, it is often necessary to do so if you want a full understanding of what's going on.

In 1973, the English statistician Francis Anscombe came up with an elegant way of demonstrating this. He created four data sets, each containing 11 data points with two values. In all sets the means and sample variances for the two variables are identical, as are the correlation between the two and the regression line predicting y from x:

anscombe table

You'd think these four distributions would be pretty similar, with only minor differences due to random variability. You might also think that linear regression would be an effective model to help make predictions based on the data, in all cases.

But you'd be wrong. These are the four data sets:

Source: Wikipedia

The upper left graph shows a distribution which is about what we'd expect from the statistics, and the linear regression model is a good fit for this data set. The upper right graph clearly has a curvature. There's likely to be a great model for prediction, but it's NOT linear (the best model probably includes x²).

The two bottom models are more problematic. The one on the left shows a linear relationship – but that one outlier near the top is pulling the regression line away from the rest of the data. And the one on the bottom right is a real challenge. It looks as though, in general, x is a poor predictor for y. That is, almost all cases have an x-value of 8 and the y-values vary quite a bit, untethered to x. And one strange case, all by itself, is driving the entire model.

A similar situation exists in sports. Looking at summary stats for the season might make for a good model, or might overlook a non-linear relationship, or might be slightly (or greatly) misleading, due to a few unusual games, players, or plays, which may not translate to the post-season. That's why creating a bracket is so much fun!

Credential of Readiness | Master the fundamentals of business | Learn More

About the Author

Jenny is a former member of the Course Delivery team who worked on the Business Analytics course for the Credential of Readiness (CORe) program. Jenny holds a B.F.A. in theater from New York University and a Ph.D. in Social Psychology from the University of Massachusetts at Amherst. She's active in the greater Boston arts and theater community and enjoys solving and creating diabolically difficult word puzzles.

Business Insights

Topics

Topics

Courses

Courses

Topics

Topics

Courses

Courses

Going Beyond the Stats: Cinderella Teams and the Anscombe Quartet

About the Author

Top FAQs

How are HBS Online courses delivered?

Are HBS Online programs available in languages other than English?

Do I need to come to campus to participate in HBS Online programs?

How do I enroll in a course?

Does Harvard Business School Online offer an online MBA?

What are my payment options?

What are the policies for refunds and deferrals?

Subject Areas

Quick Links

About

Legal

Business Insights

Filters

Going Beyond the Stats: Cinderella Teams and the Anscombe Quartet

About the Author

How are HBS Online courses delivered?

Are HBS Online programs available in languages other than English?

Do I need to come to campus to participate in HBS Online programs?

How do I enroll in a course?

Does Harvard Business School Online offer an online MBA?

What are my payment options?

What are the policies for refunds and deferrals?