feat: ADR for incremental algolia indexing

openedx · Feb 14, 2024 · 08493ca · 08493ca
1 parent 03c953d
commit 08493ca
Showing 1 changed file with 67 additions and 0 deletions.
diff --git a/docs/decisions/0009-incremental-algolia-indexing.rst b/docs/decisions/0009-incremental-algolia-indexing.rst
@@ -0,0 +1,67 @@
+Incremental Algolia Indexing
+============================
+
+
+Status
+------
+Draft
+
+
+Context
+-------
+The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog
+database. This index is entirely rebuilt at least nightly, resulting in a wholesale replacement of the prior
+Algolia index. This job is time consuming and memory intensive. This job relies heavily on the search/all
+functionality of Course Discovery to determine content metadata catalog membership. This job is brittle - either
+entirely successful or entirely unsuccessful.
+
+
+Solution Approach
+-----------------
+The goals should include:
+- Run alongside/augemt the existing indexer until we’re able to entirely cut-over
+- Support all current metadata types but doesn’t need to support them all on day 1
+- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing
+update_content_metadata job, etc.
+- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required)
+- Provide a content-oriented method of determining content catalog membership
+
+
+Decision
+--------
+We want to follow updates to catalogs and content then incrementally make updates to Algolia. To do this I propose we
+both create new functionality and reuse some existing functionality of our Algolia indexing infrastructure.
+
+First, the existing indexing process begins with executing catalog queries against `search/all` to determine which
+courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the
+opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a
+given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query
+and a piece of content metadata and determine if the content matches the query (without the use of course discovery).
+
+Second, in order to incrementally update the Algolia index we need to build the ability to replace individual
+object-shard documents in the index (today we just replace the whole index). This can be implemented by creating
+methods to determine which Algolia object-shards exist for a piece of content. Once we have those IDs we are able to
+determine if a create, update, or delete of them is required. For simplicity sake an update will likely be a delete
+followed by the creation of new objects.
+
+Third, we need to provide new methods of indexing based on an individual object change. This method will determine if
+the content metadata change should result in a create, update, or delete of object-shards in Algolia. If a create or
+update action is required, it will determine catalog membership via the new `apps.catalog.filters` tooling. Then it
+will re-use much of the existing Algolia indexing code to create the new set of document object shards to send to
+Algolia. Finally, it will issue any required deletes of existing objects and creates of any new or updated objects.
+
+Lastly, incremental updates will need to be triggered by something - such as polling of updated content from Course
+Discovery, consumption of event-bus events, and/or triggering based on a nightly Course Discovery crawl or Django
+Admin button.
+
+
+Consequences
+------------
+Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will
+also provide us with more flexibility about including non-course-discovery content in catalogs because we will
+no-longer rely on `search/all`.
+
+
+Alternatives Considered
+-----------------------
+No alternatives were considered.