Collect articles in a Pandas DataFrame
Been thinking about how, in the reference recsystem implementation, we want to make articles available to the user.
Currently the ArticleCollection class is used, which is just a simple container I put together for earlier versions of the recsystems that made it easy to get slices of articles by a range of article_ids.
This implementation is not so useful anymore though, especially since we are not going to enforce that articles are returned in any particular order related to their article_id (my earlier designs demanded this, but it never really made sense).
I've thought of enhancing ArticleCollection
with various methods to make it easier to query for articles that match some condition (e.g. "give me all articles that contain these words in their text").
But then I got to thinking we could just collect the articles into a pandas DataFrame
which is already more-or-less the industry standard and will come with plenty of built-in capabilities for querying as well as indexing. We can include indexes on article_id and publication datetime, and maybe some others. Though users could add their own indexes.
If they want to do something like full text search they can use a library like Whoosh.
Once challenge will be to manage the size and growth of the dataframe, and to think about how many articles we want to include by default. If a recsystem runs for a long time, it will constantly get new articles from the backend. The ArticleCollection
class also has a size limit capped by default at 10000 articles (which can be customized). So maybe we'll use some combination of a pre-allocated DataFrame
and a LRU cache for articles. If users interact with old articles that are not already cached by the recsystem, we can also automatically use the backend API to fetch those articles and add them to the DataFrame
.
For reference, this SO question has a lot of good information on it on approaches to growing and managing the sizes of DataFrames: https://stackoverflow.com/questions/10715965/create-pandas-dataframe-by-appending-one-row-at-a-time