World's most popular travel blog for travel bloggers.

# How to compare two objects for percentage of equivalence

, ,
Problem Detail:

I'm trying to create a nodeJS application. It allows users to rate a bunch of songs and it stores them in their user profiles. I use this information to compare them to other users, and try to find users with similar interests in songs, and I suggest them new songs based on that.

Each user profile basically looks like this

userID: [the userid of the user] songs: [the list of songs that the user has rated] ratings: [the corresponding ratings each user gave to the song] 

Each song is represented by a 9 digit whole number and each rating is a whole number from 1 to 6. Basically put, I have to compare one user to the rest of the users to determine which of them match this user the best. By match, I mean gave the same songs similar ratings. To do this, I created a simple algorithm.

Step 1) Create a list, in which each entry maps our target user to each of the other          users.  Step 2) Now consider each entry, determine which songs both of the users have rated, and          store those songs (along with the corresponding ratings each user gave them) in          this entry itself  Step 3) Now iterate through each entry and perform the following operations           a) let percentage = 0         b) let num = [the number of songs that both users rated]         c) iterate through each song (that both the users rated) and perform the            following operations                  i)    determine the score our target user gave this song and store it in                        variable a                 ii)   determing the score the other user gave this song and store it in                        variable b                 iii)  map a and b to new values based on this                          1 --> -3                         2 --> -2                         3 --> -1                         4 -->  1                         5 -->  2                         6 -->  3                 iv)   now calculate sum as                      sum = |a| + |b|                     where || is the absolute values                 v)    now calculate degree as                      degree = sum/2                 vi)   now if a * b is less than 0 then                             calculate (percentage - degree) and store that value                             again in percentage                           if not then                              calculate (percentage - degree) and store that value                              again in percentage          d) now  calculate (50 + (percentage / (6*num))*100) and store that value             in this entry as match  Step 4) Now that I have my list of entries (along with the match between each pair          of target user and other user) I sort the list in descending order of match          and from that, I can determine which users have the closest match in taste by          selecting from the first entries 

Now for each user pair, I am completely neglecting the songs that one user rated and the other didn't, and that is okay for me.

There are several problems with this method though, the biggest one being that this algorithm is very time consuming for a large set of users (say around 1,000,000). And also that I have to load in all the users, every time I need to find a set of matching users for just 1 user. And I need to do this repeatedly for that user, to update his/her list.

Is there a way to make this more efficient? Can I assign a value to each user that takes into account all the songs that they rated, and use that number to compare the users? Is that even possible? I guess what I'm asking is, how can I compare and contrast this data, mathematically, to find similar users, efficiently. Also, can someone suggest a proper tag for this type of question?

EDIT

D.W. has suggested a solution, but how do I implement it? Can you provide more detail?

One approach is to use low-rank matrix factorization to approximate the ratings matrix, then use a nearest neighbors data structure.

In particular, let $M$ be the $m \times n$ ratings matrix, where $M_{ij}$ is the rating that user $i$ has provided to user $j$. Look for a $m\times r$ matrix $U$ and a $r \times n$ matrix $V$ such that $M$ is well-approximated by $UV$ (e.g., $||M-UV||$ is minimized, except that missing entries of $M$ are ignored).

Once you've done this, each user can be mapped to a $r$-dimensional vector that summarizes their scores. Store the set of all of these vectors in a data structure that supports nearest neighbor search, e.g., a k-d tree. If you have chosen $r$ small enough (e.g., 10 or so), then nearest neighbor search should be very efficient. This will let you find the users who are most similar to a given user, and thus make a recommendation (e.g., using the k-nearest neighbors classifier).

There are many other approaches you could use: read the extensive literature on recommender systems and collaborative filtering.

As far as your other issue:

I have to load in all the users, every time I need to find a set of matching users for just 1 user

Keep the data structure in memory, rather than re-loading it into memory each time.