This repository contains pipelines for finding anchorages and associated port-visit events.
You just need docker and docker-compose in your machine to run the pipeline. No other dependency is required.
The pipeline reads it's input from BigQuery, so you need to first authenticate with your google cloud account inside the docker images. To do that, you need to run this command and follow the instructions:
docker-compose run gcloud auth application-default login
The pipeline includes a CLI that can be used to start both local test runs and
remote full runs. Just run docker-compose run [anchorages|name_anchorages|port_events] --help
and follow the
instructions there.
The most common manual task is updating the named anchorages, which needs to be done whenever anchorage overrides is edited. If you need to build and upload a new docker container use the following container:
docker build -f Dockerfile-worker -t gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test .
docker push gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test
The run:
docker-compose run name_anchorages \
--job_name name-anchorages \
--input_table CURRENT_UNNAMED_ANCHORAGES \
--output_table TARGET_DATASET.TARGET_TABLE \
--config ./name_anchorages_cfg.yaml \
--max_num_workers 100 \
--fishing_ssvid_list gs://machine-learning-dev-ttl-120d/fishing_mmsi.txt \
--project world-fishing-827 \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-120d/anchorages/anchorages/output/staging \
--temp_location gs://machine-learning-dev-ttl-120d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1 \
--sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test \
--experiments=use_runner_v2
where CURRENT_UNNAMED_ANCHORAGES
is the current (typically most recent) unnamed anchorages
table and TARGET_DATASET.TARGET_TABLE
is where the unnamed anchorages are stored. I often
put this in a temporary table for inspection, then copy it to it's final destination.
Run:
docker-compose run anchorages \
--job_name JOB-NAME \
--start_date YYYY-MM-DD \
--end_date YYYY-MM-DD \
--messages_thinned_table DATASET.messages_thinned_ \
--output_table OUTPUT_DATASET_NAME.OUTPUT_TABLE_NAME \
--config anchorage_cfg.yaml \
--max_num_workers MAX_WORKER \
--fishing_ssvid_list GCS_LOCATION_OF_FISHING_SSVID_LIST \
--project PROJECT-NAME \
--requirements_file requirements-worker-frozen.txt \
--staging_location GCS_STAGING_LOCATION \
--temp_location GCS_TEMPLOCATION \
--setup_file ./setup.py \
--runner DataflowRunner \
--worker_machine_type=custom-1-13312-ext \
--disk_size_gb 200
Standard dataflow options must also be specified.
For example, to run all years:
docker-compose run anchorages \
--job_name unnamed-anchorages \
--start_date 2012-01-01 \
--end_date 2019-06-30 \
--messages_thinned_table pipe_production_b.messages_thinned_ \
--output_table machine_learning_dev_ttl_120d.unnamed_anchorages_v20190816 \
--config anchorage_cfg.yaml \
--max_num_workers 300 \
--fishing_ssvid_list gs://machine-learning-dev-ttl-120d/fishing_mmsi.txt \
--project world-fishing-827 \
--requirements_file requirements-worker-frozen.txt \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-120d/anchorages/anchorages/output/staging \
--temp_location gs://machine-learning-dev-ttl-120d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--worker_machine_type=custom-1-13312-ext \
--disk_size_gb 200
Or to run a minimal testing run:
docker-compose run anchorages \
--job_name unnamed-anchorages \
--start_date 2017-01-01 \
--end_date 2017-01-31 \
--messages_thinned_table machine_learning_dev_ttl_120d.messages_segmented_ \
--output_table machine_learning_dev_ttl_120d.unnamed_anchorages_test \
--config anchorage_cfg.yaml \
--max_num_workers 200 \
--fishing_ssvid_list gs://machine-learning-dev-ttl-120d/fishing_mmsi.txt \
--project world-fishing-827 \
--requirements_file requirements-worker-frozen.txt \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-120d/anchorages/anchorages/output/staging \
--temp_location gs://machine-learning-dev-ttl-120d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--worker_machine_type=custom-1-13312-ext \
--disk_size_gb 200
Note that fishing_ssvid_list
should refer to a file on GCS.
After a set of anchorages is created, names are assigned using name_anchorages_main
For example:
docker-compose run name_anchorages \
--job_name name-anchorages \
--input_table machine_learning_dev_ttl_120d.unnamed_anchorages_test \
--output_table machine_learning_dev_ttl_120d.named_anchorages_test \
--config ./name_anchorages_cfg.yaml \
--max_num_workers 200 \
--fishing_ssvid_list gs://machine-learning-dev-ttl-120d/fishing_mmsi.txt \
--project world-fishing-827 \
--requirements_file requirements-worker-frozen.txt \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-120d/anchorages/anchorages/output/staging \
--temp_location gs://machine-learning-dev-ttl-120d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100
or
docker-compose run name_anchorages \
--job_name name-anchorages \
--input_table anchorages.unnamed_anchorages_v20190816 \
--output_table machine_learning_dev_ttl_120d.named_anchorages_v20210429 \
--config ./name_anchorages_cfg.yaml \
--max_num_workers 100 \
--fishing_ssvid_list gs://machine-learning-dev-ttl-120d/fishing_mmsi.txt \
--project world-fishing-827 \
--requirements_file requirements-worker-frozen.txt \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-120d/anchorages/anchorages/output/staging \
--temp_location gs://machine-learning-dev-ttl-120d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1
To update a day of events, run, for example:
docker-compose run thin_port_messages
--job_name porteventstest
--input_table pipe_production_v20201001.position_messages_
--anchorage_table anchorages.named_anchorages_v20201104
--start_date 2018-01-01
--end_date 2018-01-07
--output_table machine_learning_dev_ttl_120d.port_visit_msgs_v20220927_
--project world-fishing-827
--max_num_workers 100
--project world-fishing-827
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp
--setup_file ./setup.py
--runner DataflowRunner
--disk_size_gb 100
--region us-central1
--sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test
--experiments=use_runner_v2
docker-compose run port_visits
--job_name portmessagestest
--thinned_message_table machine_learning_dev_ttl_120d.port_visit_msgs_v20220927_
--end_date 2018-01-07
--vessel_id_table pipe_production_v20201001.segment_info
--anchorage_table anchorages.named_anchorages_v20201104
--output_table machine_learning_dev_ttl_120d.port_visits_v20220927_
--project world-fishing-827
--max_num_workers 100
--project world-fishing-827
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp
--setup_file ./setup.py
--runner DataflowRunner
--disk_size_gb 100
--region us-central1
--sdk_container_image gcr.io/world-fishing-827/pipe-anchorage/worker:tim_test
--experiments=use_runner_v2
--bad_segs "(SELECT DISTINCT seg_id FROM world-fishing-827.gfw_research.pipe_v20201001_segs WHERE overlapping_and_short)"
docker-compose run port_events
--job_name porteventstest
--input_table pipe_production_v20201001.position_messages_
--anchorage_table anchorages.named_anchorages_v20201104
--start_date 2018-01-01
--end_date 2018-12-31
--output_table machine_learning_dev_ttl_120d.raw_port_events_v20210506_
--state_table machine_learning_dev_ttl_120d.port_port_state_v20210506_
--project world-fishing-827
--max_num_workers 100
--requirements_file requirements-worker-frozen.txt
--project world-fishing-827
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp
--setup_file ./setup.py
--runner DataflowRunner
--disk_size_gb 100
--region us-central1
--ssvid_filter '(select case(vi_ssvid as string) from machine_learning_dev_ttl_120d.vessel_list_new_visits_5_6_21)'
docker-compose run port_events \
--job_name porteventstest \
--input_table pipe_production_v20201001.position_messages_ \
--anchorage_table anchorages.named_anchorages_v20201104 \
--start_date 2017-01-01 \
--end_date 2021-4-30 \
--output_table machine_learning_dev_ttl_120d.port_event_test_v20210506_events_ \
--state_table machine_learning_dev_ttl_120d.port_event__test_v20210506_batch_state_ \
--project world-fishing-827 \
--max_num_workers 100 \
--requirements_file requirements-worker-frozen.txt \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging \
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1
For a full list of options run:
python -m port_events -h
To create a corresponding day of visits do:
docker-compose run port_visits \
--job_name portvisitstest \
--events_table machine_learning_dev_ttl_120d.raw_port_events_v20210506_ \
--vessel_id_table pipe_production_v20201001.segment_info \
--bad_segs_table "(SELECT DISTINCT seg_id FROM world-fishing-827.gfw_research.pipe_v20201001_segs WHERE overlapping_and_short)" \
--start_date 2018-01-01 \
--end_date 2018-12-31 \
--output_table machine_learning_dev_ttl_120d.port_visit_test_v20210506_stableid \
--project world-fishing-827 \
--max_num_workers 50 \
--requirements_file requirements-worker-frozen.txt \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging \
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1
docker-compose run port_visits \
--job_name portvisitstest \
--events_table machine_learning_dev_ttl_120d.port_event_test_v20210506_events_ \
--vessel_id_table pipe_production_v20201001.segment_info \
--bad_segs_table "(SELECT DISTINCT seg_id FROM world-fishing-827.gfw_research.pipe_v20201001_segs WHERE overlapping_and_short)" \
--start_date 2017-01-01 \
--end_date 2021-04-30 \
--output_table machine_learning_dev_ttl_120d.port_visit_test_v20210506_stableid \
--project world-fishing-827 \
--max_num_workers 50 \
--requirements_file requirements-worker-frozen.txt \
--project world-fishing-827 \
--staging_location gs://machine-learning-dev-ttl-30d/anchorages/portevents/output/staging \
--temp_location gs://machine-learning-dev-ttl-30d/anchorages/temp \
--setup_file ./setup.py \
--runner DataflowRunner \
--disk_size_gb 100 \
--region us-central1
Parameters controlling the generation of anchorages and port_visits is stored
in a .yaml
file. By default this information is read from config.yaml
, but
a different configuration can be specified use the --config
flag.
Please perform the following steps.
- Branch the git repo.
- Create a pull request (PR).
- Add description of changes with a link to the PR in
CHANGES.md
. In some cased multiple PRs may be involved in a single version change; list them all. - Bump version in VERSION.
- Merge PR after review. Note that if using squash merge, you may need to fiddle with the commit references.
Copyright 2017 Global Fishing Watch
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.