Inference Engine Traces

Introduction

As described on the Inference Engine Debugging page, debugging or testing of the Inference Engine requires “traces” — samples of data simulating that which would come from a bus (i.e. the inputs) and, ideally, the desired results produced by the Inference Engine (i.e. the outputs). Traces that have the desired/specified outputs are referred to as “labeled” traces, because the inputs are “labeled” with the desired outputs.

This page describes the format of trace files, and the process/guidelines by which to create a trace file.

There are many examples of traces created according to the format and conventions described in this page already in the onebusaway-nyc repository: https://github.com/camsys/onebusaway-nyc/tree/master/onebusaway-nyc-integration-tests/src/integration-test/resources/traces

To create an integration tests from the trace file, see Creating an Integration Test

Trace File Format

Each trace is embodied by a single plain text CSV file, with the columns described below.

Input Columns:

These are the columns that define the inputs to the Inference Engine.

Column	Description	Required	Example	Notes
vid	Vehicle ID, fully qualified with agency ID	Required	MTA NYCT_7564
lat	Latitude	Required	40.553357
lon	Longitude	Required	-74.117308
operator_id	Numeric employee ID	Required	123456	Actual value is irrelevant, can use arbitrary value.
reported_run_id	Run ID transmitted from the bus. Not agency-qualified.	Optional?	63-101	This column is not expected to match directly to run ID’s in the bundle; it is fuzzy-matched.
assigned_run_id	Run ID assigned to the operator whose ID was received from the bus. Not agency-qualified.	Optional	B63-101	This column must match exactly a run in the bundle.
timestamp	Timestamp, in YYYY-MM-DD HH:MM:SS	Required	2012-01-11 09:13:52	This is assumed to be in the same timezone as the location of the bundle.
dsc	Desgination Sign Code, numeric	Optional	4630
direction_deg	Bearing of the bus, decimal	Optional	80.28	North is… 0?
speed	Speed, in integer MPH	Optional	35

Output Columns (aka “labels”):

These are the columns that define the expected outputs of the Inference Engine. Integration tests only tests the inferred outputs against those columns that are provided in the trace.

Column	Description	Required	Example	Notes
actual_is_run_formal	Boolean indicator for formal inference: TRUE or FALSE	Required	FALSE
actual_run_id	Run ID, not agency-qualified	Required only if is_run_formal = TRUE	B63-101
actual_trip_id	Trip ID, fully agency-qualified	Optional	MTA NYCT_JG_C3-Weekday-SDon-080200_B35_27
actual_block_id	Block ID, fully agency-qualified	Optional	MTA NYCT_JG_C3-Weekday-SDon_E_JG_46920_B35-27
actual_dsc	Destination Sign Code	Optional	4630
actual_phase	Operational Phase (see Inference Engine page)	Optional	IN_PROGRESS	Also supports prefixes (e.g. “LAYOVER_” meaning any layover phase) and/or multiple values separated by “+” character (e.g. “IN_PROGRESS+LAYOVER_”)
actual_status	Operational Status (see Inference Engine page)	Optional	default	Also supports multiple values separated by ‘+’ character (e.g. “default+stalled”)

Creating Traces

Since traces are CSV files, there are any number of ways to create them. They can be synthesized completely from scratch, for example if there no actual vehicles installed with tracking equipment. More commonly, traces are created because a OBA-NYC system is up and running, but some bug or erroneous behavior in the Inference Engine needs to be investigated or changed.

To generate a trace from an existing OBA-NYC system, the easiest way to start is to create the trace from the database. This usually includes both the input fields and output columns described above. Typically the results that were inferred in actual operation are the starting point for creating the desired/actual results.

TODO: Document which columns from the OBA-NYC databases (obanyc_cclocationreport + obanyc_inferredlocation, or obanyc_reporting) are typically used to populate the input and (initial) output/actual columns.

Given a trace file that has been populated from the OBA-NYC databases, the trace file is typically modified using Microsoft Excel according to the following procedures.

Ensure that the timestamp column is the correct format (which it typically will not be after Excel reads the CSV file). Change it to a Custom format, with format string ‘yyyy-mm-dd hh:mm:ss’.
If the file has column names of ‘inferred_*’ (e.g. ‘infered_run_id’), change them (e.g. using Find/Replace) to ‘actual_*’ (e.g. ‘actual_run_id’)
Remove any ‘actual_*’ columns that are not accepted in the trace format described above (e.g. actual_service_date, actual_distance_along_block, actual_distance_along_trip, actual_block_lat, actual_block_lon)
If missing, add the ‘actual_is_run_formal’ column, as it is required.
Remove, add, or modify the ‘actual_*’ columns according to what is actually being tested by this particular trace, as discussed below.

Prototype SQL query

This SQL query can be used as a prototype for generating a trace file. If executed in a SQL tool (e.g. DbVisualizer) the results can generally then be exported as a CSV in the right format. Obviously the specifics of the WHERE clause need to be adjusted to get the exact records for a trace; this is just an example.

SELECT
    COALESCE(cc.vehicle_id, '')               AS vid,
    COALESCE(cc.latitude, '')                 AS lat,
    COALESCE(cc.longitude, '')                AS lon,
    COALESCE(cc.operator_id_designator, '')   AS operator_id,
    COALESCE(cc.run_id_designator, '')        AS reported_run_id,
    COALESCE(inf.assigned_run_number, '')     AS assigned_run_id,
    COALESCE(cc.time_reported, '')            AS TIMESTAMP,
    COALESCE(cc.dest_sign_code, '')           AS dsc,
    COALESCE(cc.direction_deg, '')            AS direction_deg,
    COALESCE(cc.speed, '')                    AS speed,
    COALESCE('', '')                          AS assigned_block_id,
    inf.inference_is_formal                   AS actual_is_run_formal,
    COALESCE(inf.inferred_run_id, '')         AS actual_run_id,
    COALESCE(inf.inferred_trip_id, '')        AS actual_trip_id,
    COALESCE(inf.inferred_block_id, '')       AS actual_block_id,
    COALESCE(inf.inferred_dest_sign_code, '') AS actual_dsc,
    COALESCE(inf.inferred_phase, '')          AS actual_phase,
    COALESCE(inf.inferred_status, '')         AS actual_status
FROM
    (
        SELECT
            *
        FROM
            obanyc_cclocationreport
        WHERE
            vehicle_id=423
        AND time_reported>='2015-01-26'
        AND time_reported <= '2015-01-27') cc
LEFT OUTER JOIN
    (
        SELECT
            *
        FROM
            obanyc_inferredlocation
        WHERE
            vehicle_id=423
        AND time_reported>='2015-01-26'
        AND time_reported <= '2015-01-27') inf
ON
    cc.uuid=inf.uuid

Labeling Traces

Deciding (a) which ‘actual_*’ columns should be in the trace, and (b) the values of those columns, is the most subtle part of this process. It depends on understanding exactly what the trace is attempting to accomplish in terms of constraining the behavior of the Inference Engine in a desired yet feasible manner. As such, it is not possible to thoroughly document all ways in which the output/actual columns would be populated.

Nevertheless, below are some of the common guidelines for populating the output/actual columns of a trace file given experience to date.

Typically the first 2-3 rows of a trace do not have any actual_* values (except actual_is_run_formal=FALSE). This is to give the Inference Engine time to ‘warm up’ when it starts the trace.
With the exception of the first 2-3 rows, it is good practice to always specify actual_phase and actual_status.
It is typical to insert some amount of ‘slop’ in the actual_phase column during transitions to/from LAYOVER_ states, unless the point of the trace is to specifically enforce the timing of those transitions down to the single-update level. This ‘slop’ consists of having the actual_phase be a combined value of DEADHEAD_+LAYOVER_ for a couple of updates surrounding the transition between a deadhead and a layover state (or vice versa). Likewise for LAYOVER_+IN_PROGRESS surrounding the transition between a layover and an in progress state (or vice versa).
Actual_block_id or actual_run_id should only be specified if actual_is_run_formal is TRUE.
Actual_run_id and actual_block_id are typically not both required, as there is a certain equivalency between blocks and runs. Beware however of scheduled mid-route and terminal reliefs, during which a run would change but a block would not.
Actual_trip_id is rarely used. Either it is implied by the run or block id in a formal inference case, or is overly specific for informal inference (in which case actual_dsc is preferable, see below).
For informal inference, actual_dsc is typically used to constrain the inference to a certain route and direction (since the traces do not explicitly accommodate route_id or direction).
Actual_dsc (or trip_id) is typically specified only for trace rows with actual_phase of IN_PROGRESS. The exception to this would be a trace that specifically enforces when during a layover the Inference Engine changes the inferred trip.

Integration Tests

With a trace created, consider adding it to an integration test to continuously verify the intended behaviour. See Creating an Integration Test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly