This repo contains tools to extract content from a Discourse forum and load it into Algolia in a shape that is compatible with Algolia DocSearch.
To satisfy DocSearch, the objects created in Algolia are of two types:
content
- Text from a paragraph in a post. Hierarchical information is included in the object lvl0, lvl1, lvl2, lvl3. (see below)lvl3
- Contextual objects for headers (h2, h3, etc) within the content.
These types are based on the types created by the open source docsearch-scraper which is a very useful tool for scraping static sites.
./setup # install dependencies
./main-etl all
This tool was developed with python 3.9 or later. It may work with earlier versions of python 3 but it has not been tested.
The configuration is done via environment variables. The following variables must be set:
# The Discourse API needs read access to the forum you're trying to index.
export DISCOURSE_API_KEY=...
export DISCOURSE_URL=...
export DISCOURSE_USERNAME=...
# The Algolia API needs write access to the index you're trying to update.
export ALGOLIA_API_KEY=...
export ALGOLIA_APP_ID=...
export ALGOLIA_INDEX_NAME=...
export ALGOLIA_LVL0=... # (default: Forum)
The ALGOLIA_LVL0 is the top level name for when results show up in DocSearch. For example, if you set ALGOLIA_LVL0 to "Forum", then all results will show up under the "Forum" category.
```text
Forum > {Category Name} > {Topic Name} > {Section Name, h1, h2, etc.}
e.g.
Forum > Hardware > What antenna should I use? > Cellular
```
export ALGOLIA_TAG=... # (default: community)
The ALGOLIA_TAG is a tag that will be added to all objects in Algolia. This is useful in the DocSearch UI for filtering or tagging results as being from a certain source.
answered
All posts in a Discourse marked 'Answered' will also be tagged "answered", in addition to the ALGOLIA_TAG. Posts in unanswered topics will not get an extra tag. This is not yet configurable but a developer could follow the lead of the ALGOILA_TAG and add a new environment variable to control this.
Algolia limits objects to 10kb, so if we find a large paragraph, we split it in half repeatedly until it is small enough to fit. This is done in the transform step.
To do a subset of the steps, use one of:
./main-etl extract
./main-etl transform
./main-etl load
./main-etl extract transform
./main-etl transform load
The Extract step creates a file called discourse.json
This
file contains the raw json from the Discourse API.
The Transform step creates a file called algolia.json
This
file contains the json that will be sent to Algolia.
If you use the vscode devcontainer, you'll get a python environment with the correct version of python. Otherwise, you'll want to install python 3.9+.
./setup # install dependencies
The tests were written with the python unittest
framework. The easiest way to
run them is from the command line.
./setup # install dependencies
python3 -m unittest discover
It's also possible to use the VSCode Test Explorer to run the tests.
Debug the tests from top to bottom in the
tests/test_transform.py
The src/extract_discourse.py file contains the extraction logic.
$ src/extract_discourse.py --help
Extract posts and categories from Discourse to stdout.
Usage:
discourse-extract
Environment Variables:
DISCOURSE_API_KEY The API key to use for the Discourse API.
DISCOURSE_URL The URL of the Discourse instance.
DISCOURSE_USERNAME The username to use for the Discourse API.
The src/transform_discourse_to_algolia.py file contains the transformation logic.
$ src/transform_discourse_to_algolia.py --help
Transform posts from discourse to algolia-style. Input is expected to be json on
stdin and output is json on stdout. Allow multiple tags to be specified.
Usage:
transform-discourse-to-algolia --discourse-url=<discourse-url> --lvl0=<lvl0> --tag=<tag>...
Options:
--discourse-url=<discourse-url> The base url of the discourse forum.
--lvl0=<lvl0> The top level category name to nest all search results under. [default: Forum]
--tag=<tag> The tags to add to all algolia objects. [default: community]
The src/load_algolia.py file contains the loading logic.
$ src/load_algolia.py --help
Load objects into Algolia from a file via the Algolia API.
Usage:
load-algolia <algolia-json-file> <algolia-index-name>
Environment Variables:
ALGOLIA_APP_ID
ALGOLIA_API_KEY
Credits
Hats off to github copilot for translating my thoughts into python.