Making the implementation of a recsystem much simpler

@embray,

We have to chew up the work (don't know if this expression works in english 😆) for students because it's actually a bit complicated (as you suspected before 😆).

Maybe you’ll think it’s too simplistic, it is not in my opinion, it is just an easy way to do recommendations, but later, it can be much smarter (in terms of latency, computation efficiency…) like you proposed by properly use streams and API calls.

The idea is to provide to students (competitors) a unique instance of something that do everything.

Here a python fake script that do recommendations:

from renewal.recsystools import RecsysTools
rectools = RecsysTools(token="XXX") # Init of the RecsysTools instance 
while True:
    users = rectools.wait_for_not_served_users(min_pending_reclists=1) # We wait buckets consumption from assigned users
    for user in users:
        candidates = rectools.get_candidates(user) # We get data using the API behind
        history = rectools.get_history(user)
        reclist = random.sample(candidates, 30) # Competiror algorithm (random)
        rectools.put_reclist(user, reclist)

Init of the RecsysTools instance

Here we do everything to init streams and API connection, plus loading assigned users and caching candidates news and so on. These data are updated using streaming queues.

We wait buckets consumption from assigned users

Behind wait_for_not_served_users there is a reclist queue for each assigned user. Actually I specified that this queue must be on the server side (it was great debate in last december) so actually it can only be on the client side and we'll see later.

So, for example, if the min_pending_reclists parameter is set to 2, it means when adding a new reclist, queues will be of length 2.

Here min_pending_reclists is set to 1, so when a user don't have anymore buckets, we add a reclists directly.

The more we push reclists is queues, the more the competitor recsys will answer rapidly to rec requests...

To better understand this mecanism, let's take the case a user refresh 3 times his home page: the consequence will be that if we set min_pending_reclists to 1, the competor recsys won't be able to answer all these requests because there are nt enough reclists in the local queue...

We get data using the API behind

It will be easy to get data with a cache mecanism behind...

Conclusion

Actually there are a lot of things to "challenge" in this proposition, and I have lots of things to talk about on what are the mechanism behind (in terms of caching...), so this issue will allow us to discuss, but don't hesitate to talk on discord / bbb with me.

As a note: I don't have as much knowledge as you do about websockets and async.io, so you will tell me if it's possible to do that... I think yes but don't know if it's easy.

Anyway, this point is crucial for the CentraleSupelec competition.

Edited Apr 16, 2021 by Julien Hay