-
Notifications
You must be signed in to change notification settings - Fork 19
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: ADR for incremental algolia indexing
- Loading branch information
Showing
1 changed file
with
67 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,67 @@ | ||
Incremental Algolia Indexing | ||
============================ | ||
|
||
|
||
Status | ||
------ | ||
Draft | ||
|
||
|
||
Context | ||
------- | ||
The Enterprise Catalog Service produces an Algolia-based search index of its Content Metadata and Course Catalog | ||
database. This index is entirely rebuilt at least nightly, resulting in a wholesale replacement of the prior | ||
Algolia index. This job is time consuming and memory intensive. This job relies heavily on the search/all | ||
functionality of Course Discovery to determine content metadata catalog membership. This job is brittle - either | ||
entirely successful or entirely unsuccessful. | ||
|
||
|
||
Solution Approach | ||
----------------- | ||
The goals should include: | ||
- Run alongside/augemt the existing indexer until we’re able to entirely cut-over | ||
- Support all current metadata types but doesn’t need to support them all on day 1 | ||
- Support multiple methods of triggering: event bus, on-demand from django admin, on a schedule, from the existing | ||
update_content_metadata job, etc. | ||
- Higher parallelization factor, i.e. 1 content item per celery task worker (and no task group coordination required) | ||
- Provide a content-oriented method of determining content catalog membership | ||
|
||
|
||
Decision | ||
-------- | ||
We want to follow updates to catalogs and content then incrementally make updates to Algolia. To do this I propose we | ||
both create new functionality and reuse some existing functionality of our Algolia indexing infrastructure. | ||
|
||
First, the existing indexing process begins with executing catalog queries against `search/all` to determine which | ||
courses exist and belong to which catalogs. In order for incremental updates to work we first need to provide the | ||
opposite semantic and instead be able to determine catalog membership from a given course (rather than courses from a | ||
given catalog). We can make use of the new `apps.catalog.filters` python implementation which can take a catalog query | ||
and a piece of content metadata and determine if the content matches the query (without the use of course discovery). | ||
|
||
Second, in order to incrementally update the Algolia index we need to build the ability to replace individual | ||
object-shard documents in the index (today we just replace the whole index). This can be implemented by creating | ||
methods to determine which Algolia object-shards exist for a piece of content. Once we have those IDs we are able to | ||
determine if a create, update, or delete of them is required. For simplicity sake an update will likely be a delete | ||
followed by the creation of new objects. | ||
|
||
Third, we need to provide new methods of indexing based on an individual object change. This method will determine if | ||
the content metadata change should result in a create, update, or delete of object-shards in Algolia. If a create or | ||
update action is required, it will determine catalog membership via the new `apps.catalog.filters` tooling. Then it | ||
will re-use much of the existing Algolia indexing code to create the new set of document object shards to send to | ||
Algolia. Finally, it will issue any required deletes of existing objects and creates of any new or updated objects. | ||
|
||
Lastly, incremental updates will need to be triggered by something - such as polling of updated content from Course | ||
Discovery, consumption of event-bus events, and/or triggering based on a nightly Course Discovery crawl or Django | ||
Admin button. | ||
|
||
|
||
Consequences | ||
------------ | ||
Ideally this incremental process will allow us to provide a closer to real-time index using fewer resources. It will | ||
also provide us with more flexibility about including non-course-discovery content in catalogs because we will | ||
no-longer rely on `search/all`. | ||
|
||
|
||
Alternatives Considered | ||
----------------------- | ||
No alternatives were considered. |