Skip to content

Input Data

Arya Massarat edited this page Sep 11, 2020 · 15 revisions

Our pipeline uses overlapping drone imagery taken by a Phantom DJI Drone. There are several different ways to provide this input data to the pipeline. For this example, we will just illustrate the most simple directory structure.

Note first that the pipeline expects that you have drone images from each region surveyed by the drone in separate directories like this

flower_map/
├── config.yml
├── data/
|   ├── models/
|   ├── samples.tsv
|   ├── region1/
|   |   ├── DJI_001.JPG
|   |   ├── DJI_002.JPG
|   |   ├── DJI_003.JPG
|   ├── region2/
|   |   ├── DJI_001.PNG
|   |   ├── DJI_002.PNG
|   |   ├── DJI_003.PNG
|   ├── region3/
|   |   ├── DJI_001.JPG
|   |   ├── DJI_002.PNG
|   |   ├── DJI_003.JPG
... (not shown: the rest of the files in this repository)

We've placed all of our data in a git-ignored data/ folder within the project root. If your data exists in a separate place on your filesystem, you can symlink it to the data/ directory or symlink the data/ directory itself. You may even choose not to have a data/ directory at all. The only requirement is that each region must have its own directory of drone image files.

Inside the data/ directory, we created a samples.tsv file describing the paths to these datasets:

region1    data/region1
region2    data/region2
region3    data/region3    .PNG

The samples.tsv file has three tab-separated columns and a line for each dataset that you'd like to analyze. The first column is a unique identifier you assign to the dataset. This is used by the pipeline when it creates its output, so you should avoid using spaces in your unique identifiers. The second column is the path to the dataset from the root of the project directory.

The third column is optional and denotes the extension of the image files in the dataset's directory. If this is not specified, the most commonly used extension will be used. In our example, the pipeline would default to using .JPG for region3, since data/region3 has only one .PNG file. But by specifying .PNG in our samples.tsv file, we are instructing the pipeline to use only the .PNG files in data/region3.

Once you're done constructing your samples.tsv file, you should specify the path to it in your config.yml configuration file.

SAMP_NAMES

It is best to specify all of your datasets in the samples.tsv file even if you only plan to use a few of them at first. A separate configuration option in config.yml called SAMP_NAMES allows you to use only a subset of the datasets at once.

SAMP_NAMES should be set to a list of dataset IDs like this

SAMP_NAMES: [region1, region3]

If you'd like to use all of the datasets in the samples.tsv file, set SAMP_NAMES to a falsey value.

Clone this wiki locally