datamining_881/data/raw/netflix/README

SUMMARY
================================================================================

This dataset was constructed to support participants in the Netflix Prize.  See
http://www.netflixprize.com for details about the prize.

The movie rating files contain over 100 million ratings from 480 thousand
randomly-chosen, anonymous Netflix customers over 17 thousand movie titles.  The
data were collected between October, 1998 and December, 2005 and reflect the
distribution of all ratings received during this period.  The ratings are on a
scale from 1 to 5 (integral) stars. To protect customer privacy, each customer
id has been replaced with a randomly-assigned id.  The date of each rating and
the title and year of release for each movie id are also provided.


USAGE LICENSE
================================================================================

Netflix can not guarantee the correctness of the data, its suitability for any
particular purpose, or the validity of results based on the use of the data set.
The data set may be used for any research purposes under the following
conditions:

     * The user may not state or imply any endorsement from Netflix.

     * The user must acknowledge the use of the data set in
       publications resulting from the use of the data set, and must
       send us an electronic or paper copy of those publications.

     * The user may not redistribute the data without separate
       permission.

     * The user may not use this information for any commercial or
       revenue-bearing purposes without first obtaining permission
       from Netflix.

If you have any further questions or comments, please contact the Prize
administrator <prizemaster@netflix.com>


TRAINING DATASET FILE DESCRIPTION
================================================================================

The file "training_set.tar" is a tar of a directory containing 17770 files, one
per movie.  The first line of each file contains the movie id followed by a
colon.  Each subsequent line in the file corresponds to a rating from a customer
and its date in the following format:

CustomerID,Rating,Date

- MovieIDs range from 1 to 17770 sequentially.
- CustomerIDs range from 1 to 2649429, with gaps. There are 480189 users.
- Ratings are on a five star (integral) scale from 1 to 5.
- Dates have the format YYYY-MM-DD.

MOVIES FILE DESCRIPTION
================================================================================

Movie information in "movie_titles.txt" is in the following format:

MovieID,YearOfRelease,Title

- MovieID do not correspond to actual Netflix movie ids or IMDB movie ids.
- YearOfRelease can range from 1890 to 2005 and may correspond to the release of
  corresponding DVD, not necessarily its theaterical release.
- Title is the Netflix movie title and may not correspond to
  titles used on other sites.  Titles are in English.


QUALIFYING AND PREDICTION DATASET FILE DESCRIPTION
================================================================================

The qualifying dataset for the Netflix Prize is contained in the text file
"qualifying.txt".  It consists of lines indicating a movie id, followed by a
colon, and then customer ids and rating dates, one per line for that movie id.
The movie and customer ids are contained in the training set.  Of course the
ratings are withheld. There are no empty lines in the file.

MovieID1:
CustomerID11,Date11
CustomerID12,Date12
...
MovieID2:
CustomerID21,Date21
CustomerID22,Date22

For the Netflix Prize, your program must predict the all ratings the customers
gave the movies in the qualifying dataset based on the information in the
training dataset.

The format of your submitted prediction file follows the movie and customer id,
date order of the qualifying dataset.  However, your predicted rating takes the
place of the corresponding customer id (and date), one per line.

For example, if the qualifying dataset looked like:

111:
3245,2005-12-19
5666,2005-12-23
6789,2005-03-14
225:
1234,2005-05-26
3456,2005-11-07

then a prediction file should look something like:
111:
3.0
3.4
4.0
225:
1.0
2.0

which predicts that customer 3245 would have rated movie 111 3.0 stars on the
19th of Decemeber, 2005, that customer 5666 would have rated it slightly higher
at 3.4 stars on the 23rd of Decemeber, 2005, etc.

You must make predictions for all customers for all movies in the qualifying
dataset.

THE PROBE DATASET FILE DESCRIPTION
================================================================================

To allow you to test your system before you submit a prediction set based on the
qualifying dataset, we have provided a probe dataset in the file "probe.txt".
This text file contains lines indicating a movie id, followed by a colon, and
then customer ids, one per line for that movie id.

MovieID1:
CustomerID11
CustomerID12
...
MovieID2:
CustomerID21
CustomerID22

Like the qualifying dataset, the movie and customer id pairs are contained in
the training set.  However, unlike the qualifying dataset, the ratings (and
dates) for each pair are contained in the training dataset.

If you wish, you may calculate the RMSE of your predictions against those
ratings and compare your RMSE against the Cinematch RMSE on the same data.  See
http://www.netflixprize.com/faq#probe for that value.


Good luck!


MD5 SIGNATURES AND FILE SIZES
================================================================================

d2b86d3d9ba8b491d62a85c9cf6aea39        577547 movie_titles.txt
ed843ae92adbc70db64edbf825024514      10782692 probe.txt
88be8340ad7b3c31dfd7b6f87e7b9022      52452386 qualifying.txt
0e13d39f97b93e2534104afc3408c68c           567 rmse.pl
0098ee8997ffda361a59bc0dd1bdad8b    2081556480 training_set.tar