The Netflix Prize is in the company’s own words the”quest” for “substantially improve(ing) the accuracy of predictions about how much someone is going to love a movie based on their movie preferences”.
I read about the prize last february on Michael Trick’s blog and the first thing I saw was the $1 Million for the winner. However, although we’re on it for the money (YES!) we don’t thing we gonna get it. So, let’s mess about it!:
_For all of you that are, like me, amateur OR-ers, I’m starting a series of posts showing where the heck I am.
1) The data: the training set (data you have to use to create the model) is made up of more than 17 thousand text files. So, although some experts are advising on Netflix’s forums not to group them, I’ll do.
Following my own weaknesses and economist-like-mind I’m going to group the data in a single file, in order to dump it into a database (PostgreSQL, probably). Even more, as I don’t have time to learn any other language, I’ll be using VBA for Excel.
Here we go…
Dim N As Double
Dim TextoArchivo As String
Open “C:\training_set.txt” For Output As #1
For N = 1 To 17770
Open “C:\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input As #2
Do While Not EOF(2)
Line Input #2, TextoArchivo
Print #1, TextoArchivo
The module above takes about 30 minutes (Pentium 1.73 Ghz, 1GB RAM) to process the data into a file with a size of 1,92GB.
Next, the database.