Netflix Prize for Dummies [ I.b ]

Yes, I wasn’t happy at all with the previous code so I changed it. It improved in processing time, coming down to 13 minutes and 59 seconds to aggregate all the files into a sole one (tough it increased size up to 2.62GB). Moreover, I have modified the structure so it’ll be easier to introduce the data into a database. Now the new file is divided into 4 (CSV) columns: movieid, userid, rating, date.

Here’s the VBA code:

Sub GroupData()

Dim T As Date
T = Now

Dim N As Double
Dim Text1 As String
Dim Text2 As String
Dim Text3 As String

Open “C:\Netflix\training_set.txt” For Output Access Write As #1

For N = 1 To 17770
Open “C:\Netflix\training_set\mv_00″ & Format(N, “00000″) & “.txt” For Input Access Read As #2

‘For the first line.
Input #2, Text1, Text2, Text3
Print #1, N & “,” & Right(Text1, Len(Text1) – (Len(CStr(N)) + 2)) & “,” & Text2 & “,” & Left(Text3, 10)

‘For the rest of lines.
Do While Not EOF(2)
Input #2, Text1, Text2
Print #1, N & “,” & Right(Text3, Len(Text3) – 11) & “,” & Text1 & “,” & Left(Text2, 10)
Text3 = Text2
Loop

Close #2

Next N

Close #1

MsgBox Format(Now – T, “hh:mm:ss”)

End Sub

Similar Posts:

  • http://www.kproductivity.com/ Francisco Marco-Serrano

    Remember, this is just for dummies. Don’t start me with “it could be optimised!”, “what a crappy code!”, blah blah, blah…

    Of course, I would accept suggestions! ; )

  • http://www.kproductivity.com/ Francisco Marco-Serrano

    Moreover, consider the above code for transforming the other files: “probe.txt” and “qualifying.txt”.