Statistics may not have been the most popular subject in school, but it’s a VIP in technology. You’re witnessing statistical correlations every time you see suggestions based on purchase history or recommendations based on “likes.”
Netflix is perhaps the all-star when it comes to correlations. How many times have you asked yourself, “How did Netflix know I would love that show?!” To understand how Netflix uses correlations to recommend shows we may enjoy using their algorithm, we’ll break down the concepts behind their technology’s statistics.
Wait, what is a correlation again?
A correlation allows you to compare 2 things to see how similar they are. One of the best examples is summer temperatures and ice cream sales. Ice cream sales are positively correlated to the temperature. As the temperature goes up, so do ice cream sales.
Correlations are so powerful because you can compare 2 things with completely different measurements. In the ice cream sales example, ice cream sales are measured in Dollars and temperature is measured in Degrees Fahrenheit.
Positive vs Negative Correlations
- If a change in one is associated with a change in the other in the same direction
- Example: Weight and height. Taller people weigh more on average than shorter people.
- If a change in one is associated with a change in the other in the opposite direction.
- Example: Exercise and weight. The more you exercise on average the less you weigh.
How to Measure It
- A correlation goes from -1 to 1.
- A correlation of 1 is often described as a perfect positive correlation. That means that every change in one variable is associated with an equivalent change in the other variable in the same direction.
- A correlation of -1 is a perfect negative correlation. That means that a change in one variable is associated with an equivalent change in the other variable in the opposite direction.
- A correlation of 0 means that the variables have no meaningful association with one another. Example: The relationship between shoe size and SAT scores.
- The closer to -1 or 1, the stronger the association.
How to Calculate a Correlation
1. Convert the first measurement to standard units: (measurement – mean) / standard deviation
2. Convert the second measurement to standard units: (measurement – mean) / standard deviation
3. Calculate the product for each (result from #1) X (result from #2)
4. Calculate the correlation coefficient by using the sum of the products calculated above divided by the number of observations.
How Netflix uses this to tell you which movies you will like
Disclaimer: This is a simplified version of what Netflix actually uses. Netflix had a 1 Million Dollar prize to the team that came up with the best algorithm, but it is a fancy version of using correlations. As a result, the new rating system has two options: ? or ? (and they can do this because they have a ginormous dataset).
Netflix basically creates a correlation between individuals that rate movies the same. The more positively correlated you are to someone, the more likely you are to like a movie they have rated positively that you haven’t rated yet. The more negatively correlated you are to someone, the more likely you are to dislike a movie they have rated positively and thus, this shouldn’t show up in your feed of suggested movies.
Suppose we have 4 people who have rated movies on a scale of 1 to 5 stars with 1 being disliking the movie and 5 being loving the movie.
|Message in a Bottle||1||4||1||5|
|Sleepless in Seattle||1||5||1||1|
Correlate Adam to 3 others
First, we need to calculate the mean (average) of Adam’s ratings.
Let’s say we wanted to know how similarly correlated Adam is to the other 3 people. Note: In Excel, you can use the AVERAGE function.
1. Sum all the ratings:
4 + 5 + 5 + 1 + 1 + 4 + 5 + 5 + 5 = 35
2. Divide by number of ratings:
35 / 9 = 3.89
Next, we need to calculate the standard deviation of Adam’s ratings
Note: In Excel, you can use the STDEVP function.
1. For each rating, subtract the mean and square the result:
(4 - 3.89)^2 = 0.01 (5 - 3.89)^2 = 1.23 ...
2. Calculate Average of the results:
= 0.01 + 1.23 + 1.23 + 8.35 + 8.35 + 0.01 + 1.23 + 1.23 + 1.23 = 22.89 / 9 = 2.54
3. Take the square root of the Average
= √(2.54) = 1.59
Convert Adam’s rating of each movie to standard units
This calculation is (rating – mean) / standard deviation.
Top Gun = (4 - 3.89) / 1.59 = 0.07 Jurassic Park = (5 - 3.89) / 1.59 = 0.70 ...
Next, we need to follow the previous 3 steps for each person
I won’t bore you with doing this over and over, I’ll just show you the results we have so far. Note: S/U = Std. Units
|Message in a Bottle||1||-1.81||4||0.93||1||-1.87||5||2.31|
|Sleepless in Seattle||1||-1.81||5||1.69||1||-1.87||1||-1.15|
Multiply the standard units of each person together
Adam's Top Gun Standard Unit (0.07) * Lindsay's Top Gun Standard Unit (-1.35) = 0.07 * -1.35 = -0.09
Again, I don’t want to bore you so let’s just show the results of Adam compared to Lindsay so far.
|Movie||Adam||Lindsay||Adam: Std. Units||Lindsay: Std. Units||Product|
|Message in a Bottle||1||4||-1.81||0.93||-1.68|
|Sleepless in Seattle||1||5||-1.81||1.69||-3.06|
Finally: Calculate the correlation coefficient
The Correlation Coefficient is the final number we need to compare 2 people. You simply get the Average of the Product result.
= -0.09 + -0.41 + 0.12 + -1.68 + -3.06 + -0.09 + -0.41 + 0.12 + 0.65 = -4.88 = -4.88 / 9 = -0.54
What does this mean?
Our Correlation Coefficient between Adam and Lindsay ended up being -0.54. This means that Adam and Lindsay are most of the time not going to like the same movies. If this had been a -1 it means they have a strong negative correlation and anytime Lindsay likes a movie, Adam will more than likely not like the same movie.
You should be able to build a Matrix from these numbers to more easily compare 2 people.
From this, you can see that Adam and Austin are very positively correlated (0.97) so they should like very similar things.
Step-by-Step to Compare 2 People
1. Get the mean rating for all movies for Person 1
2. Get the standard deviation for all movies for Person 1
3. Convert the rating of each movie to standard units for Person 1: (rating – mean) / standard deviation
4. Get the mean rating for all movies for Person 2
5. Get the standard deviation for all movies for Person 2
6. Convert the rating of each movie to standard units for Person 2: (rating – mean) / standard deviation
7. Calculate the product for each movie rating (Standard Unit for Person 1’s rating of the movie * Standard Unit for Person 2’s rating of the movie)
8. Calculate the correlation coefficient (Sum of the products from #7 divided by the number of movies they rated)
Hopefully, this is some use to you. Try to think of some scenarios where you might be able to use correlations to compare 2 things. We have a few projects where we need to recommend or predict if someone will like something. This is perfect for correlations.
If you want to see the full calculations, we’ve created a Google Sheets file with calculations – courtesy of Airship!
Naked Statistics: Stripping the Dread from the Data – Charles Wheelan (2014)
Netflix Prize – Wikipedia
Google Sheets file with calculations – From Airship!
Netflix knows what I like by Jana Vembunarayanan, Oct 201