Monday, October 30, 2006

python stats functions for netflix prize

A coworker and I entered the netflix prize which is a competition to improve netflix's recommendation engine. The grand prize is 1 million dollars if your algorithm can beat netflix's by 10 percent. The task is to predict how a user will rate a movie based on previous ratings. The past results dataset is around 2 gigs uncompressed and it works out to 100480507 ratings for 17700 movies. I won't disclose our algorithms just yet but here's something I will share. It's a couple of python statistical functions that I whipped up that calculate the mean, std deviation, kurtosis, and skewness of a list. One requirement is that you import math because I use the sqrt and pow functions in that package.

There's already a python stats module that you should use if you plan to do some heavy lifting with stats in python but if you're just looking for something a bit simpler then here are the functions. They make use of a neat python feature called list comprehension that makes the code a lot easier to write and understand. There's no perl counterpart to this built-in language feature, but the same can be accomplished with a map and an anonymous function. One of my main motivations for working on this project is learning python.

def mean(x):
        """Find mean of a list"""
        return sum(x)/len(x)


def stddev(x,mean):
        """Find standard deviation of a list"""
        return sqrt(sum([pow((val-mean),2) for val in x])/len(x))

def skewness(x,mean,stddev):
        """Find skewness of a list"""
        return sum([pow(val-mean,3) for val in x])/(len(x)*pow(stddev,3))


def kurtosis(x,mean,stddev):
        """find kurtosis of a list"""
        return (sum([pow(val-mean,4) for val in x]) / ((len(x)-1) * pow(stddev,4))) - 3

After I calculated these stats for the movies, I loaded the information into a database and played around with the data.
The fifteen most rated movies are:
mysql> select name,ratings from movie order by ratings desc limit 15;
+--------------------------------------------------------+---------+
| name                                                   | ratings |
+--------------------------------------------------------+---------+
| Miss Congeniality                                      |  232944 |
| Independence Day                                       |  216596 |
| The Patriot                                            |  200832 |
| The Day After Tomorrow                                 |  196397 |
| Pirates of the Caribbean: The Curse of the Black Pearl |  193941 |
| Pretty Woman                                           |  193295 |
| Forrest Gump                                           |  181508 |
| The Green Mile                                         |  181426 |
| Con Air                                                |  178068 |
| Twister                                                |  177556 |
| Sweet Home Alabama                                     |  176539 |
| Pearl Harbor                                           |  173596 |
| Armageddon                                             |  171991 |
| The Rock                                               |  164792 |
| What Women Want                                        |  162597 |
+--------------------------------------------------------+---------+

the movies with the highest mean are:
mysql> select name,mean from movie order by mean desc limit 15;
+---------------------------------------------------------------------+---------------+
| name                                                                | mean          |
+---------------------------------------------------------------------+---------------+
| Lord of the Rings: The Return of the King: Extended Edition         | 4.72326993942 |
| The Lord of the Rings: The Fellowship of the Ring: Extended Edition | 4.71661090851 |
| Lord of the Rings: The Two Towers: Extended Edition                 | 4.70261096954 |
| Lost: Season 1                                                      | 4.67098903656 |
| Battlestar Galactica: Season 1                                      | 4.63880920410 |
| Fullmetal Alchemist                                                 | 4.60502147675 |
| Trailer Park Boys: Season 4                                         | 4.59999990463 |
| Trailer Park Boys: Season 3                                         | 4.59999990463 |
| Tenchi Muyo! Ryo Ohki                                               | 4.59550571442 |
| The Shawshank Redemption: Special Edition                           | 4.59338378906 |
| Veronica Mars: Season 1                                             | 4.59208393097 |
| Ghost in the Shell: Stand Alone Complex: 2nd Gig                    | 4.58636379242 |
| Arrested Development: Season 2                                      | 4.58238935471 |
| The Simpsons: Season 6                                              | 4.58129596710 |
| Inu-Yasha                                                           | 4.55443429947 |
+---------------------------------------------------------------------+---------------+

the movies with the lowest mean:
mysql> select name,mean from movie order by mean asc limit 15;
+----------------------------------+---------------+
| name                             | mean          |
+----------------------------------+---------------+
| Avia Vampire Hunter              | 1.28787875175 |
| Zodiac Killer                    | 1.34602081776 |
| Alone in a Haunted House         | 1.37560975552 |
| Vampire Assassins                | 1.39676117897 |
| Absolution                       | 1.39999997616 |
| The Worst Horror Movie Ever Made | 1.39999997616 |
| Ax 'Em                           | 1.42222225666 |
| Dark Harvest 2: The Maize        | 1.45238089561 |
| Half-Caste                       | 1.48739492893 |
| The Horror Within                | 1.49624061584 |
| Vampires vs. Zombies             | 1.49659860134 |
| The Bogus Witch Project          | 1.49775779247 |
| Rise of the Undead               | 1.50295853615 |
| Vampiyaz                         | 1.50391650200 |
| Underground Comedy Movie         | 1.50503361225 |
+----------------------------------+---------------+

high standard deviation
mysql> select name,std_dev from movie where ratings > 500 order by std_dev desc limit 15;
+----------------------------------------+---------------+
| name                                   | std_dev       |
+----------------------------------------+---------------+
| 'N Sync: Live at Madison Square Garden | 1.61672019958 |
| Dragon Ball Z: Vol. 17: Super Saiyan   | 1.58626019955 |
| Dragon Ball Z: Trunks Saga             | 1.58625769615 |
| Sailor Moon R: The Promise of the Rose | 1.54338657856 |
| Princess Nine                          | 1.54320299625 |
| Big Brother 3                          | 1.52889740467 |
| Family Guy: Live in Las Vegas          | 1.51085150242 |
| Dragon Ball Z: Imperfect Cell Saga     | 1.51082730293 |
| Cher: Live in Concert                  | 1.50970792770 |
| Cardcaptor Sakura                      | 1.50106477737 |
| Tupac Shakur: Before I Wake            | 1.49814224243 |
| Slayers Try DVD Collection             | 1.48955285549 |
| Dragon Ball Z: The World's Strongest   | 1.48337769508 |
| Dragon Ball: The Saga of Goku          | 1.48244822025 |
| Dragon Ball Z: Super Android 13        | 1.47919940948 |
+----------------------------------------+---------------+

high kurtosis (=deviation due to extreme values)
mysql> select name,kurtosis from movie where ratings > 500 order by kurtosis desc limit 15;
+---------------------------------------------------------------------+---------------+
| name                                                                | kurtosis      |
+---------------------------------------------------------------------+---------------+
| Lord of the Rings: The Return of the King: Extended Edition         | 8.98888015747 |
| The Lord of the Rings: The Fellowship of the Ring: Extended Edition | 8.81478309631 |
| Lost: Season 1                                                      | 8.58331871033 |
| Lord of the Rings: The Two Towers: Extended Edition                 | 7.96247339249 |
| Veronica Mars: Season 1                                             | 7.01007556915 |
| Battlestar Galactica: Season 1                                      | 6.57994556427 |
| Arrested Development: Season 2                                      | 6.34450483322 |
| Fullmetal Alchemist                                                 | 6.30700731277 |
| The Simpsons: Season 6                                              | 5.94859170914 |
| Inu-Yasha                                                           | 5.37345218658 |
| The Simpsons: Season 5                                              | 5.32154130936 |
| Fruits Basket                                                       | 5.32094144821 |
| House                                                               | 5.26964759827 |
| Family Guy: Vol. 2: Season 3                                        | 5.23487997055 |
| The Simpsons: Season 4                                              | 5.13775539398 |
+---------------------------------------------------------------------+---------------+

select name,skewness from movie where ratings > 500 order by skewness desc limit 15;
high skewness (long positive tail=random people who voted a bad movie good)
+--------------------------------------------------------------------------+---------------+
| name                                                                     | skewness      |
+--------------------------------------------------------------------------+---------------+
| Underground Comedy Movie                                                 | 2.18097925186 |
| Ben & Arthur                                                             | 2.11108279228 |
| Visions of Sugarplums                                                    | 1.73864912987 |
| Dracula 3000                                                             | 1.56867814064 |
| Guilty by Association                                                    | 1.49578213692 |
| Survivor Exposed                                                         | 1.48025572300 |
| Leonard Part 6                                                           | 1.40650367737 |
| Homo Heights                                                             | 1.36938285828 |
| Going Overboard                                                          | 1.35962140560 |
| Shanghai Surprise                                                        | 1.27664005756 |
| Sopranos Unauthorized: Shooting Sites Uncovered                          | 1.27566981316 |
| Twentynine Palms                                                         | 1.20583796501 |
| The Life                                                                 | 1.15580248833 |
| National Lampoon's Christmas Vacation 2: Cousin Eddie's Island Adventure | 1.05684077740 |
| Glitter                                                                  | 1.04486155510 |
+--------------------------------------------------------------------------+---------------+

1 comment:

AllHomeSecurity said...

Hi and thanks for the opportunity to post on your nice blog.

I just developed a website dedicated to cartoons and different cartoon heroes. It’s still young, but growing at a fast pace ;)

So far I have covered these cartoon series briefly, but I am planning to add more details on characters and individual episodes.

Dragon Ball cartoons
Code name: Kids Next Door
Zatch Bell
Underdog

As I said, these are just some of the recent additions to the website. Hope you like my site and I’d be interested in exchanging ideas or content with similar websites or blogs.

Best regards,

Simone