pdf2gtfs can be used to extract schedule data from PDF timetables and turn it into valid GTFS.
It was created as a Bachelor's project + thesis at the chair of 'Algorithms and Datastructures' of the Freiburg University.
The Bachelor's thesis, which goes into more detail and adds an evaluation, can be found here. A (shorter) blogpost detailing its usage can be found here, though some parts are outdated.
The master branch contains all the latest changes and is unstable. The release branch usually points to the latest tag, though it may contain some additional fixes.
- Linux (Windows should work as well, but I currently do not test this)
- python3.10 or higher (required)
- ghostscript >= 9.56.1-1 (recommended)
Older versions may work as well, but only the versions given above are officially supported.
Note: Using pip won't install those dependencies that are required only for development.
git clone --recursive https://github.com/heijul/pdf2gtfs.git
cd pdf2gtfs
2. (Optional) Create a venv and activate it (more info):
python3.11 -m venv venv
source venv/bin/activate
Under Windows, you have to activate the venv using ´./venv/bin/activate´.
Note: With pip you will have to manually install the development requirements (Defined in pyproject.toml).
Note: If pip/poetry complains that no pyproject.toml exists for custom_conf, you forgot to add the
--recursive
flag. To fix this, simply rungit submodule update --init --recursive
.
Using pip:
pip install .
Using poetry (requires poetry, of course):
poetry install
Using poetry, but also install the development requirements:
poetry install --with=dev
Using unittest:
python -m unittest discover test
Using pytest:
pytest test
pdf2gtfs -h
This will provide help on the usage of pdf2gtfs.
pdf2gtfs will read the provided config file in order. The default configuration will be read first, and any provided config files will be read in the order they were given. Later configurations override previous configurations.
For more information on the config keys and their possible values, check out the default configuration.
The following examples can be run from the examples
directory and show
how some config values change the accuracy of the detected locations, as well
as whether the pdf can be read at all. The base.yaml
config only contains
some basic output settings, used by all examples.
Before you run these, switch to the
examples
directory:cd examples
Example 1: Tram Line 1 of the VAG
Uses the default configuration, with the exception of the routetype.
pdf2gtfs --config=base.yaml --config=vag_1.yaml vag_1.pdf
Example 2: Subway Line S1 of the KVV
The max_row_distance
needs to be adjusted, to read this PDF properly.
pdf2gtfs --config=base.yaml --config=kvv_s1.yaml kvv_s1.pdf
Example 3: RegionalExpress Lines RE2/RE3 of the GVH
The close_node_check
, needs to be disabled, because it incorrectly disregards
valid locations, that seem too far away.
Note: This example uses the legacy table extraction, because the new one (currently) results in errors.
pdf2gtfs --config=base.yaml --config=gvh_re2_re3.yaml gvh_re2_re3.pdf
Example 4: Bus Line 680 of the Havelbus
Here, disabling the close_node_check
leads to far better results as well.
Note that the config also contains some other settings, which lead to a
similar result.
Note: This example uses the legacy table extraction, because the new one (currently) results in errors.
pdf2gtfs --config=base.yaml --config=havelbus_680.yaml havelbus_680.pdf
Example 5: Line G10 of the RMV
Reading of page 4 currently fails and reading more than one page leads to worse results in the location detection. This may sometimes happen, because the average of all locations for a specific stop is used.
pdf2gtfs --config=base.yaml --config=rmv_g10.yaml rmv_g10.pdf
In principle, pdf2gtfs works in 3 steps:
- Extract the timetable data from the PDF
- Create the GTFS in memory
- Detect the locations of the stops using the stop names and their order.
Finally, the GTFS feed is saved on disk, after adding the locations.
In the following are some rough descriptions on how each of the previously mentioned steps is performed.
- Use ghostscript to remove all images and vector drawings from the PDF
- Use pdfminer.six to extract the text from the PDF
- Split the LTTextLine objects of pdfminer.six into words
- Detect the words that are times using the
time_format
config-key - Define the body of the table using the times
- Add cells to the table that overlap with its rows/columns
- If an
agency.txt
is given using theinput_files
option, and it contains a single entry, use that agency by default. If it contains multiple entries, ask the user to choose, which agency should be used. - If a
stops.txt
is given using theinput_files
option, search it for the stops. - Create basic skeleton of required GTFS files
- In case the tables contain annotations, create a new
calendar.txt
entry for each annotation and date combination.- Ask the user to input dates, where there is an exception in the service, which are added to the calendar_dates.txt
- Iterate through the TimeTableEntries of all TimeTables and create a new entry
data to the
stop_times.txt
.
This is only done, if there is no stops.txt
input file, or if the given file
does not contain all necessary stops.
- Get a list of all stop locations (nodes) along with their name, type and some attributes from OpenStreetMap (OSM) using QLever.
- Normalize the names of the nodes by stripping any non-letter symbols and expanding any abbreviations
- For each stop of the detected tables, find those nodes that contain every word of the (normalized) stop name.
- Add basic costs:
- Name costs, based on the difference in length between a stops name and any of the node's names. (This works, because of the normalization)
- Node costs, based on the selected gtfs_routetype and the attributes of the node.
- Use Dijkstra's algorithm, to find the nodes with the lowest cost. The cost of a node, is simply the sum of its name-, node- and travel cost. The travel cost is calculated using either a "closer-is-better" approach or a "closer-to-expected-distance-is-better" approach.
- If any of the stops was found in the
stops.txt
file (if given), it's location will be used instead of checking the OSM data. - If the location of a stop was not found, it is interpolated using the surrounding stop locations.
The first two steps are generally the slowest steps of the location detection. Therefore, we cache the result and use the cache, if possible.
The new table extraction, as well as the overall process and evaluation of pdf2gtfs are detailed in my Bachelor's thesis. There is also a blogpost, which describes the previously used table extraction and provides a shorter overview on how pdf2gtfs works.
If something is not working or is missing, feel free to create an issue.
Copyright 2022 Julius Heinzinger
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.