NCAA Championship Bracket Recommendation Engine

March madness is one of the rare moments that it pays off in a social context to be a data geek. It’s a rare concord of well documented data and pop culture.

What makes it most interesting is this guy:

Neo4j March Madness

Warren Buffet, the Oracle of Omaha, each year makes a “billion dollar bet” with the world, with the prize going to whomever can build the perfect bracket. With the hubris that belongs to anyone who works with big data — I assumed I could do better than most. How hard could it possibly be. While I’m not a huge basketball fan, I’ve got a good grasp of the basics.

graph database

 

While this bet grabbed the interest of everyone from Wall St. quants to Silicon Valley engineers to arm-chair MoneyBallers everywhere. How bad could the odds be? Really bad. I’m more likely to win the California Lottery 650 BILLION times than get this bracket correct. The perfect bracket is one of 9.2 quintillion possible brackets.

graph database neo4j recommendations

Those aren’t great odds. But the eternal optimist — I noticed that lonely “1” on the far side of things.

giphy (1)

It’s All Relative

What’s my unfair advantage? Or what do I at least hope it is? Relationships. To me, sports performance is fundamentally relative, how we fare against most teams, may not always be the best metric for how we’ll do against teams at the championship level.  There are of course teams that are unambiguously better than others. However, there is nearly always some sort of relative performance bias. Wherein a team performs better or worse than their average performance would project due to some confluence of factors, whether it’s a team with an infamously brutal crowd of fans, a Point Guard that dissects your league-leading zone, or a decades long rivalry that motivates your players to dig just a little more. These statistics are difficult to track across a single season and often incredibly difficult to track across time.

database neo4j recommendations NCAA

Secondly, being able to iterate on that model is taxing both in terms of writing the queries and in maintaining any reasonable performance. I accrued a mountain of data from the past 4 seasons (~50,000 games). Including points scored, location, date, etc. etc. We could easily add more granular information or more historic data. I decided that in my model this relative performance metric should churn almost entirely every four years (as current players graduate and move on).

Let’s start by talking about the underlying data model.

From Idea —> Graph Model

I am not a clever boy.

However, I have clever tools. The most chief of which is Neo4j. So, I started as I do all of my graphy projects — with the questions I planned to ask most frequently and a whiteboard (or a piece of paper in this case).

The Idea:

whiteboard
Which after a little effort becomes:

Just in TIME!

Before I loaded any data into Neo4j, I first needed to create the time-tree seen in the above model. One of Neo4j’s brilliant engineers ( Thanks Mark! ) did the heavy lifting for me and wrote a short Cypher snippet to generate the time-model I needed. I stole this snippet from his blog post on creating time trees.

Spreadsheets.csv —> Graph.db

Neo4j ships with a great ETL tool called LOAD CSV. We’re going to use that. I downloaded a mess of NCAA scores, then surreptitiously converted the data I downloaded from Excel spreadsheets into CSV format. The reason why I like the LOAD CSV tool is that no matter what your current data format is: MySQL, HDFS, etc. etc. you can dump it down to a CSV. It is the data world’s lowest common denominator. Secondly, it’s easy to map a CSV with heads to a graph model.

If you’re coding along with me, I’ve hosted them in a public dropbox with links found in the repo linked above. We’ll be bringing in several CSV files, each one representing a given season and then sewing that all together based on team names instead of an actual UUID as I would in a production system. Here’s a sample of what those load scripts look like:

//2016
LOAD CSV WITH HEADERS from 'https://dl.dropboxusercontent.com/u/313565755/ncaa2016.csv' AS line
WITH line, toINT(line.Year) as Year, toINT(line.Month) as Month, toINT(line.Day) as Day
WHERE line.ignore IS NOT NULL
MATCH (:Year {year:Year})-[:HAS_MONTH]->(:Month {month:Month})-[:HAS_DAY]->(t:Day {day:Day})
CREATE (game:Game {winner:line.winnerName})-[:OCCURED_ON]->(t)
WITH line, game
MERGE (team:Team {name:line.Team})
WITH line, game, team
MERGE(opp:Team {name:line.Opponent})
WITH line, game, team, opp
WITH line, toINT(line.opponentScore) as oppScore, toINT(line.teamScore) as teamScore, team, opp, game, toFLOAT(line.teamDiff) as teamDiff, toFLOAT(line.oppDiff) as oppDiff
CREATE (team)-[:PLAYED_IN {scored:teamScore, differential:teamDiff, location:line.teamLocation}]->(game)<-[:PLAYED_IN {scored:oppScore, differential:oppDiff, location:line.oppLocation}]-(opp);Step 4: History, Victory, and a little Math

Now that we’ve transformed our CSV into a graph, the important next step is determining how we predict performance between teams?

 

Pythagorean Expectation ~= Win Power?

I needed a method that allowed me to focus on relative performance. A statistician named Bill James created a method to estimate the number of games a given baseball team would win in a season as a function of the runs they score and the runs they allow.

I expanded on this with the hopes that we can estimate the games a basketball team WILL win as a function of the points they’ve scored against a given opponent (in total) and how many they’ve allowed.  I’ve created a relationship between each team called :WINPOWER to represent this new function. In computing this WINPOWER property, I added in a decay factor to weigh more recent games more heavily than those played long ago.

 

winPower

This math looks like this in Cypher:

//Assigning Pythagorean Expectation (Direct Win Power)
MATCH (a:Team)-[aa:PLAYED_IN]->(game)<-[bb:PLAYED_IN]-(b:Team)
WHERE a<>b
WITH toFloat(aa.scored*aa.scored) as team2, toFloat(bb.scored*bb.scored) as opp2, game, a,b
WITH ((team2)/(team2+opp2)) as PyEx, game,a,b
MATCH (game)-[:OCCURED_ON]->(day)<-[:HAS_DAY]-(month)<-[:HAS_MONTH]-(year)
WITH (365*2016 + 2016 /4 - 2016 /100 + 2016 /400 + 15 + (153*3+8)/5) as dayBeforeTournament,
(365*(year.year) + (year.year)/4 - (year.year)/100 + (year.year)/400 + (day.day) + (153*(month.month)+8)/5) as oldYear, PyEx,a,b
WITH ((4*365.25)-(dayBeforeTournament-oldYear))/(4*365.25) as weight, PyEx, a, b
WITH SUM(weight*PyEx) as winPower, a, b
MERGE (a)-[w:WINPOWER]->(b)
SET w.winPower = winPower, w.simulated = FALSE;

This works well but, there are a great deal of teams in the NCAA tournament who haven’t previously played one another — so what we’ll need to do is find a way to approximate WINPOWER for these teams without any direct contact.

inferredWPModel
So, we’ll use their intermediate opponents and their relative WINPOWERs between them as a proxy for how we might expect them to perform against one another. i.e,:

//Estimating Win Power (Indirect Win Power)
MATCH (a:Team)-[aa:WINPOWER]->(intermediate:Team)-[bb:WINPOWER]->(other:Team)
WITH a, other, SUM(aa.winPower) + SUM(bb.winPower) AS numer, SIZE(COLLECT(intermediate)) * 2 AS denom
WHERE a <> other AND SIZE((a)-[:WINPOWER]-(other)) = 0
MERGE (a)-[w:WINPOWER]->(other)
SET w.winPower = numer * 1.0 / denom, w.simulated = TRUE;

Testing the Stupid Thing…

Using the play-in games, I decided see if I could at least predict those:

Screen Shot 2016-03-17 at 1.24.40 PM

So far, we’re 4 out of 4.

Here are my predictions for Round 1:

Screen Shot 2016-03-17 at 1.32.40 PM

I’ll keep you all updated as I discover the efficacy of my recommendation engine as the the tournament progresses.

Github Link to the Project 

Results through the Sweet Sixteen!

Trackbacks & Pings

Leave a Reply

Your email address will not be published.