Simpler implementation of recsystems

Here is a partial counter-proposal to the ideas in #11.

I believe the proposal in #11 actually makes things more complicated in a way. For example, it proposes that contestants write a loop in which they take recommendation requests from a queue like (this is my reinterpretation):

while True:
     pending_recommendation_requests = recommendation_queue.get()
     for user in pending_recommendation_requests:
          history = get_user_history(user)
          articles = get_candidate_articles_user)
          # get_recommendations has to be implemented by the contestant
          recommendations = get_recommendations(user, history, articles)
          send_recommendations(recommendations)

The reason this makes this more complicated is that this is in effect how the recsystems implemented by this package already work. But the details of the while True: ... loop are already abstracted away in the details of the websocket and JSON-RPC handling. For the most part (this will be expanded on later), what a contestant has to write is:

def recommend(user_id, min_articles, max_articles):
    # here get_recommendations is written by the contestant; min_articles
    # and max_articles are hints given by the backend about how many articles
    # to return.
    return get_recommendations(user_id, min_articles, max_articles)

The return value of the recommend function is simply a list of article IDs. It is documented in more detail here.

Another problem with this approach is that it does not address other events that recsystems might want to respond to. The most important task of a recsystem is to provide responses to requests for recommendations. But in order to aid recsystems in building their models in response to user activity, it was the plan from the beginning to give them access to an event stream of that activity. That does not seem to be addressed at all by the plan in #11.

The current situation

The above example of how a contestant implements a recommend function is schematic. The current reality is a bit more complicated than that, and ripe for many improvements, many of which also come up in #11.

The "real" way of implementing a recsystem (in the simplest case) is currently a matter of subclassing BaselineRecsystem and providing a "recommendation strategy" by defining a method in their subclass, as well as some other bits.

from renewal_recsystem.baseline import BaselineRecsystem

class MyRecsystem(BaselineRecsystem):
    RECOMMENDATION_STRATS = (['my_strat'] +
                             BaselineRecsystem.RECOMMENDATION_STRATS)

    async def recommend(self, user_id, max_articles, min_articles):
        recommendations = # contestant's recommender system code
        return recommendations

The class may implement additional methods in order to update their model in response to user activity with articles. For example by implementing:

async def article_interaction(self, interaction):
    # calling the base class's implementation of this method is important
    # since it already provides code for updating statistics on the
    # article that was interacted with
    super().article_interaction(interaction)
        
    # Here the contestant adds their own code for updating their
    # model on the user who performed the interaction (e.g. they 
    # clicked an article on a certain topic, so we need to update
    # their topic preferences.

This class also provides some currently very simple access to things like a collection of recent articles scraped by the backend. This is ripe for improvement as I discussed in #11 here.

It also currently lacks easy access to more details about the user and their browsing history. Currently, some of this is provided by the BaselineRecsystem class, but it's not enough. @hayj discussed this in #11 and I completely agree we need to make those details more accessible. I discussed this some more here.

The problem with the current approach

In additional to the shortcomings above (which could be addressed even with no other changes), I think this is still too hard even if we give users a boilerplate to copy from:

It requires at least a little knowledge of object-orient programming in Python, including writing a subclass, and methods. Contestants should be expected to be able to use Python objects (e.g. call methods on objects, look up attributes on objects), but they should not necessarily be expected to write their own classes or methods on classes.
It requires at least a little bit of knowledge of asynchronous programming with Python. At the very least, the recommend_my_strat method in the example above has to be declared async def or very confusing errors may occur for the contestant. I wrote a handy intro to asynchronous programming primer to help with some of the most basic issues. But I think it's still too much to ask.

Improved idea

I think we can streamline all this a little better as follows. It should be possible to implement a recsystem by writing at a minimum a single function that acts as the bridge between the "full" recsystem implementation, and the contestant's recommender system algorithm. Something like:

def recommend(user : User, article_collection : ArticleCollection,
              min_articles : int, max_articles : int) -> List[int]:
     # ...
     return recommendations

Here user is an instance of a simple User class that contains everything the implementer needs to know about the user, including user details from the app if any, the user's past history, as well as access to a rescsystem-specific datastructure representing its model for that user.

article_collection is a database of candidate articles to recommend to the user. It would consist of some large number (adjustable by the contestant) of articles pre-filtered to only include articles not already recommended to the user. It would also have additional methods for simple querying of the article collection to easily find articles to recommend to the user given the contestant's recommender algorithm.

As in the current implementation max_articles and min_articles are just hints as to how many recommendations should be returned.https://renewal-recsystems.readthedocs.io/en/latest/backend.html#renewal_recsystem.backend.article_interaction

The return value, as before, is a list of article_ids for the articles to return to the user.

In addition to a recommend function, contestants can also implement an article_interaction function much like the existing one. The difference with the current API is mostly that it would be passed the User object again, and the contestant can easily update their recsystem-specific model for the user.

These functions would be written in a Python module that they can give the path to when starting the recommendation system front-end, like:

$ renewal-recsys --token=<token> my_recsys.py

where my_recsys.py contains implementations for the recommend function, and optionally article_interaction function, and possibly other interfaces we define. It can also contain any code for the contestant's recommendation algorithm, or just import that code from another module.

Addendum: Two more things that need to be taken into consideration:

In addition to the "active" user (the user for whom the current event is targeted) we should give each function the full list of assigned users so that it can perform algorithms that use user-based filtering.
We should probably also provide a hook function that allows recsystems to perform some background processing task. This is of course already possible when implementing a a recsystem as a subclass of BaselineRecsystem, but we should provide an easy way for contestants to provide their own background processing tasks to run. As a technical note this should use loop.run_in_executor using by default a ProcessPoolExecutor since such background tasks will typically be CPU-bound.

Summary

The main differences between the current implementation and this proposal are summarized as follows:

Contestants no longer have to implement a class, or write any asyncio code. They write a simple Python module defining one or two functions with special names and signatures that we define for them.
The functions contestants implement are not methods of some class, but are instead passed as arguments all the datastructures the contestant needs to update their recommendation models and run those models to provide recommendations.
Contrary to #11, contestants do not need to write any loops or queue-related code. They simply define functions that are called on their behalf by our recsystem code in response to websocket events.

What won't change:

We still allow contestants to write entire recsystems from scratch if they want to, and provide all the necessary API documentation for how recsystems work.
The base implementation is still a class much like existing BaselineRecsystem implementation. The difference is we are just providing an easier to implement front-end to it so that contestants can get started on hooking up their recommendation algorithm with minimal additional coding.

Edited Apr 19, 2021 by Julien Hay