This project implements a simple REST server and command line utility for retrieving real-time information about public transport in Ireland (at least, for services operated by Dublin Bus, Bus Éireann, and Go-Ahead Ireland).
This project is inspired by Sean Rees's GTFS Upcoming. I started from scratch because I wanted to significantly optimise memory consumption so I could host the API on a single board computer. I found the full dataset consumed up to 10 gigabytes of RAM when using GTFS Upcoming, while I've managed to get it down to less than 200 megabytes after a significant rewrite. Both projects allow you to reduce RAM consumption by discarding data that does not pertain to a list of specific transport stops.
There's a Dublin Smart Home google group. Consider joining it!
The National Transport Authority (NTA) of Ireland operates a public-transport brand called Transport for Ireland or TFI, which pulls together all information related to public transport. From the point of view of the average commuter, TFI is in charge of buses and trains. The NTA previously provided a real-time passenger information (RTPI) REST API that allowed the status of routes serving particular stops to be easily queried, but this API was discontinued in September 2020 (perhaps due to scalability issues and breaches of fair use), and was replaced with a GTFS-R API. General Transit Feed Specification (GTFS) is a protocol designed by Google to allow transit operators to communicate static and real-time schedule information to Google Maps. The static data consists of a zip file at a well-known URL that containing metadata files describing operators, routes, stops, and the schedule (static data is available for most public transport services, as described here). The real-time part is an API endpoint that returns the status of the entire fleet of vehicles that are currently on the road/tracks (real-time information is currently only provided for a subset of bus operators: Dublin Bus, Bus Éireann, and Go-Ahead Ireland). This real-time feed needs to be interpreted in conjunction with the metadata from the static zip file. This architecture seems convenient for Google, who are interested in mass-syncing all available information. However, it is inconvenient if you are an average user who has a specific query about upcoming arrivals at a particular stop or station. This project bridges that gap.
This project is a GTFS-R client, which reads all static and realtime transport fleet information into RAM. It then provides a simple REST API to allow querying of upcoming scheduled and real-time arrivals at any particular stop.
On startup, it downloads the static data and parses it into memory (by default, it will re-download this whenever the data is updated). It also periodically queries the real-time API, and stores received information about arrival delays, cancelations and additions into memory (by default, it will do this every minute). The in-memory information can then be efficiently queried to return a list of all scheduled and real-time arrivals at any particular stop.
You can run this project either as a python program, as a Docker container or as a Home Assistant Addon. When running as a python program or a docker container you can choose whether to run the REST API HTTP server, or whether to directly invoke the gtfs.py module as a command-line utility.
Before you start, you will need your own NTA API key, which you can get for free by signing up to the NTA developer portal.
The API key, and all other settings, can be alternatively specified as (in order of precedence):
- environment variables
- settings file variables
- command-line arguments
If using the settings file, you might find it more convenient not to directly modify
settings.py
, but instead to create alocal_settings.py
file where you can override just the settings you care about.local_settings.py
is ignored by git.
You can view the available settings and default values by running server.py
or gtfs.py
with the --help
argument.
For reference, the available settings are:
GTFS_STATIC_URL
. URL of the static NTA data. Defaults to "https://www.transportforireland.ie/transitData/Data/GTFS_Realtime.zip"GTFS_LIVE_URL
. URL of the realtime NTA data. Defaults to "https://api.nationaltransport.ie/gtfsr/v2/TripUpdates"API_KEY
. Your NTA API key. Either your "primary" or "secondary" key should work.REDIS_URL
. The URL of a redis instance to use as a memory store for the purposes of memory optimisation or horizontal scalability. Typically something likeredis://localhost:6379
. Defaults toNone
, i.e., uses in-process memory instead.POLLING_PERIOD
. How over to query the real-time API in seconds. Defaults to 60.MAX_MINUTES
. The maximum number of minutes into the future that arrivals returned in results are expected to arrive before. Defaults to 60 minutes.HOST
. The host to run the API server at. Defaults to "localhost".PORT
. The port to run the API server on. Defaults to "7341".LOG_LEVEL
. The verbosity of output. Possible values areDEBUG
,INFO
,WARN
,ERR
. Defaults toINFO
.FILTER_STOPS
. A list of stop numbers that should be filtered for. Information received not pertaining to these stop numbers will be discarded, yielding a significant RAM saving. Defaults toNone
, meaning that information about all stops will be kept in memory.
The exact name of the corresponding command-line arguments might vary, so please run with --help
to check the correct form. Please also run with --help
to confirm the default values.
You can specify your API key in either of the following ways:
Export an environment variable called API_KEY
:
export API_KEY=abcdefghijklmnopqrstuvwxyz1234567890
python3 server.py
or
API_KEY=abcdefghijklmnopqrstuvwxyz1234567890 python3 server.py
Specify it in the settings file (preferably in local_settings.py
):
API_KEY = "abcdefghijklmnopqrstuvwxyz1234567890"
Or specify it as a command-line argument:
python3 server.py --api-key=abcdefghijklmnopqrstuvwxyz1234567890
Create a python virtual environment and install requirements:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
Run a web server:
python3 server.py --help
python3 server.py --api_key=abcdefghijklmnopqrstuvwxyz1234567890
Run as a command-line interface:
python3 gtfs.py --help
Download the static database and exit:
python3 gtfs.py --download
Query specific stops:
python3 gtfs.py 1358 7581
Note that running
gtfs.py
in this way is slow because a lot of data needs to be loaded from disk into memory every time that it is invoked. It is convenient for testing or very occasional use, but in production, you should runserver.py
, which only has to load data once at startup.
A dockerfile is provided to allow you to easily build and run the project in a Docker container. This container is configured to start and use an internal redis instance, producing optimal memory efficiency.
To build the docker container, change into the project directory where the Dockerfile
is and run:
docker build --tag tfi-gtfs .
You can run the docker container as follows:
docker run -p 7341:7341 tfi-gtfs
You can specify arguments as follows:
docker run -p 7341:7341 tfi-gtfs --help
docker run -p 7341:7341 tfi-gtfs --api-key=abcdefghijklmnopqrstuvwxyz1234567890
You can pass a local_settings.py
file into the container as follows:
docker run -p 7341:7341 -v ./settings.py:/app/settings.py:ro tfi-gtfs
In the above examples, 7341
is the default port number used by the API server. You could map port 8000
on the host to 7341
in the container by instead specifying -p 8000:7341
.
This project is also available as a Home Assistant addon. Visit my Home Assistant Addons repository for more information on how to add that repository to your Home Assistant installation and install the addon.
The API is hosted at the path /api/v1/arrivals
. You can query it by supplying one or more "stop" query parameters.
For example, if the server is running on localhost, visit http://localhost:7341/api/v1/arrivals?stop=1358 in your web browser to receive a table of upcoming arrivals at stop 1358
.
Alternatively, using cURL
, run the following command:
curl "http://localhost:7341/api/v1/arrivals?stop=1358"
Multiple stops can be queried by providing multiple stop
parameters. For example:
curl "http://localhost:7341/api/v1/arrivals?stop=1358&stop=7581"
Stop numbers are printed on bus stops. You can also find relevant stops on the official TFI journey planner. Click on a stop to see its stop number.
The server returns responses in JSON format by default. It also supports YAML, CSV and HTML. Here are some commands that test this using cURL:
# JSON (default)
curl "http://localhost:7341/api/v1/arrivals?stop=1358" -H "Accept: application/json"
# YAML
curl "http://localhost:7341/api/v1/arrivals?stop=1358" -H "Accept: application/yaml"
# CSV
curl "http://localhost:7341/api/v1/arrivals?stop=1358" -H "Accept: text/csv"
# Plain text also returns CSV
curl "http://localhost:7341/api/v1/arrivals?stop=1358" -H "Accept: text/plain"
# HTML
curl "http://localhost:7341/api/v1/arrivals?stop=1358" -H "Accept: text/html"
If you visit the URL from your web browser, you web browser will automatically send an Accept: text/html
header, so you should receive the response as a HTML table.
If you are running this project directly as python and memory consumption is an issue, you can use the REDIS_URL
option to specify an external redis instance to use as a more efficient data store. Redis is a highly-performant distributed data store written in C, and has very efficient storage. If you don't have a redis instance, you can start one using Docker as follows:
docker run --name redis-gtfs -p 127.0.0.1:6379:6379/tcp -d redis
You may see warnings from redis saying that "Memory overcommit must be enabled!". This issue is discussed in this docker issue and to get rid of the warning you must currently modify a setting on the host:
sudo sysctl vm.overcommit_memory=1
Then, pass that redis instance to the python program passing the --redis
argument or setting the REDIS_URL
setting, for example:
python3 server.py --redis redis://localhost:6379
When run directly, the default behaviour is to parse schedule data into local in-process data structures. This is convenient, but python structures are space-inefficient, resulting in a lot of system memory being consumed.
Another way to reduce memory consumption is to specify a list of stops that you are solely interested in, so that data pertaining to all other stops can be discarded. You can do this by using the FILTER_STOPS
option or --filter
argument.
Note: "Total" columns below refer to the RSS (Resident Set Size) reported by the
ps
command. This includes shared system libraries, so it's not really an accurate reflection of the marginal cost of running the process, but it is a decent comparative guideline. The "data" columns were measured using thistotal_size.py
gist. This is bundled in this project. You can generate a report on the size of python data structures (and any data in redis) if you pass the--profile
argument.
In a comparitive test on a 64 bit laptop, with all stops loaded I observed the following memory consumption patterns while running server.py
(comparative values
for GTFS-Upcoming are also given).
Test | Python data | Python Total | Redis data | Redis Total |
---|---|---|---|---|
Without redis | 317MB | 439MB | N/A | N/A |
With redis | 1MB | 39MB | 131MB | 141MB |
GTFS-Upcoming | 7,909MB | 8,687MB | N/A | N/A |
If FILTER_STOPS
is supplied, data that does not pertain to the given stops will be discarded, allowing memory use to be reduced. Repeating the above test while running the server with a single stop:
Test | Python data | Python Total | Redis data | Redis Total |
---|---|---|---|---|
Without redis | 3.4MB | 42.7MB | N/A | N/A |
With redis | 1MB | 39.9MB | 3.4MB | 13MB |
GTFS-Upcoming | 71MB | 123MB | N/A | N/A |
Conclusions:
-
We can subtract the "data" size from the "total" to get a rough estimate of the base memory consumed by each process. In this way, we see that the base memory required by Redis seems to be less than 10 MB, while
server.py
requires about 38 MB. -
Broadly speaking, we see that storing all the data in python is about 2.4 times less space-efficient than storing it in Redis.
-
Storing data for one or two stops (using the
FILTER_STOPS
option) results in negligible memory use beyond the base requirements of the program, regardless of whether redis is used or not. -
The on-disk cache (
cache.pickle
) file expands by a factor of approximately 3 in RAM. For example, a 119MB cache file will translate into an additional 384MB RAM usage.
When first run, or whenever the cache file doesn't exist or is old or invalid (e.g., was generated for a different set of filter stops), it will be rebuilt at startup by a sub-process. This sub-process is short-lived but memory intensive, and in my testing grows to up to 1.5 gigabytes before finishing.
The gtfs.py
module can be invoked directly as a command line utility, and runs as a single-threaded process. However, server.py
starts multiple threads and subprocesses.
Internally, server.py
uses Waitress to serve HTTP API requests. Waitress starts a pool of worker threads to handle requests. The default number of threads is specified by the WORKERS
setting or --workers
argument, and defaults to 1
.
server.py
also starts a long-lived thread to handle scheduled tasks like polling the live API, or redownloading the static schedule data.
Actual downloading and parsing of static schedule data is handled in sub-processes, as it is a memory-intensive operation, and we want to allow the system to reclaim that memory after the new schedule has been processed. These sub-processes are simply instances of gtfs.py
. server.py
will launch gtfs.py
in this way on startup (if the current downloaded schedule is out of date, or if the current cache is out of data or invalid). It will also check every hour if there is new static GTFS data (by performing a HTTP HEAD
request) available and if necessary will launch gtfs.py
to download it.
server.py
runs gtfs.py
with the --rebuild-cache
argument, which causes it to re-parse the static GTFS data (which may consume in the region of 1.5 gigabytes of RAM) and write a new cache.pickle
file (which may take a minute or more depending on your hardware). After writing the pickle file, the gtfs.py
process ends, its memory is released, and server.py
continues execution, by loading or reloading that pickle file, which is a fast operation.
- Workers will generally only be blocked on network I/O with redis, which is minimal. To compensate for this, consider increasing the number of requests that can be simultaneously served by
server.py
by increasingWORKERS
to 2 or 3. - To allow multiple CPU cores to be used, you will need to launch multiple instances of
server.py
. This is due to the python Global Interpreter Lock (GIL). - If launching multiple instances, use Redis to avoid duplicating all the schedule data in each process. Also, avoid duplicating work of parsing schedule data and polling the live API from each process by configuring just one of the processes with the desired
POLLING_INTERVAL
, and set it to a very high value in all the other processes (3153600000 == 100 years in seconds).
The project consists of the following modules:
-
settings.py
is a simple settings file. -
size.py
is the memory-counting function from this gist -
store.py
is a data store, which is backed by either redis or an internaldict
depending on configuration. It supports key-value styleget
/set
operations, andSet
-likeadd
/remove
/has
operations. Everything is added to a "namespace", and a configdict
can be passed in at initialization with optional rules for how items in each namespace should be expired. -
gtfs.py
contains all code related to interacting with the GTFS static schedule data and GTFS-R live feed. It provides functions to check, download and extract the static GTFS data, and provides aGTFS
class that loads that data, can query the live GTFS feed, and allows the data to be queried for upcoming arrivals at any given stop. It usesstore.py
to record all GTFS data, making it agnostic to whether data is being stored in-process or in redis. It also exposes an entrypoint so it can be run as a standalone command line utility. -
server.py
:- runs
gtfs.py
in a sub-process as-required to download static data and rebuild the cache. - creates an instance of the
gtfs.GTFS
class that it uses to fulfil API requests - starts a thread to manage scheduled tasks
- runs the HTTP server
- runs
Static GTFS data is downloaded to the /data
directory.
The following tips are based on VSCode, but you should be able to adapt them to other IDEs.
Make sure you have set up the virtual environment as follows:
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
If VSCode does not automatically detect this virtual environment, pop open the command palette (CTRL-SHIFT-P) and run "Python: Select Interpreter".
Consider creating a local_settings.py
file, so you don't have to always passing common arguments every time. This file is ignored by git. For example:
API_KEY = "abcdefghijklmnopqrstuvwxyz1234567890"
LOG_LEVEL = "DEBUG"
A launch.json
file is provided for VSCode, which contains a configuration to launch either server.py
or gtfs.py
. You should be able to run either of them at this point.
If you want to debug with redis, you will need a redis instance. See the "Running with Redis" section for info on starting a docker redis container. For debugging purposes I suggest you use your local_settings.py
file to pass a REDIS_URL
(to avoid accidentally committing changes to launch.json
).
Further development and PRs are very welcome.
Unit tests are provided in tests.py
, with test fixtures in test_data/
. Run tests as follows:
python3 -m unittest test