-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processing big datasets #10
Comments
Each stage has a last written file. We can use that last written file to understand when the stage completed. It's possible to do something similar for each sub-stage of OpenSfM as well, but for now, we will restrict ourselves to the overall stage times. The following is with a ~7400 image dataset on a 20 core machine with 768 RAM:
|
If we do a bit of calculations against these dates, we can see the progression:
|
Honestly, I expected OpenSfM to be a more expensive piece, and props to @pierotofy for bringing down that odm_dem number so much. |
Now I am testing with the same 7400 image dataset but resized using https://github.com/pierotofy/exifimageresize and resizing the images first to First, log in to the docker instance:
Then, we run the run.py script directly. This gives us some flexibility in resuming, or tweaking as we go:
I expect the OpenSfM step to take about the same length of time: 30+ hours, but depthmaps to be 4x as fast, which should get this done over the weekend instead of 9 days. 🤞 This time, I am also monitoring using Glances so I have the full curve of resource usage through the running of the project. |
I am going to keep this thread going and expand a bit to include info about how to process monster datasets. This will be messy, and riddled with contradictions, but I will ultimately probably end up as a section in docs.opendronemap.org, at least the how-to bits. |
We should probably add a flag for this, but I have removed the align portion of split merge. For massive datasets, the combination of SfM, error distribution, and OpenDroneMap's use of DOP and other accuracy tags seems to make the alignment step not just unnecessary but a source of error. Here's my branch which removes that step in a brutish way... :
|
Also, for massive datsets, perhaps the objective (as it is in my case) is to capture elevation models, not orthophotos. So one trick we can do is to resample the data down quite a bit, and then adjust the output resolutions to save on memory usage and processing time. A good tool for that is UAV4Geos https://github.com/pierotofy/exifimageresize which will resize an entire directory of images but keep the exif the same (aside from the exif on image dimensions). I resized my dataset to 1280 which should leave enough features for matching and enough for depthmaps as well. When I ran my 7400 image subset at default depthmap resolution of 640, it took 9 days, 5 hours to complete. Reducing that resolution to 320, we get completion in 3 days, 18 hours.
|
Also, docker is cool, but when things fail (as they often do with big datasets) reconnecting to the docker machine to inspect can be difficult, since the ODM docker image is meant to turn off automatically when done (whether a failure or not). So, instead of the typical
Now we are logged in and can run our OpenDroneMap command:
If we need to reconnect, we can look for our docker image:
And then use the machine ID to connect
We can also use
|
For particularly large datasets, like the 80,000+ image dataset I am currently attempting, we need to invoke split-merge by using the
|
A lot of folks don't necessarily understand the relationship between depthmap resolution and ground sampling resolution, so I should explain: Suppose we have a 6400x4200 image with a ground sampling distance (pixel size) of 10cm. The maximum depthmap we want to use would be half the linear image resolution at 3200x2100, which could be specified with I often aim for 1/4 the resolution as it is a bit more robust to noise, especially on specular reflectors like metal rooftops, and saves a lot of computation time. In this case, I would set In our case above, I have also resized the input images as well. Since I am not too interested in the orthophotos, this saves on memory usage in the texturing step, and as I want this to run as fast as possible, I have optimized for a depthmap resolution of 320. Most of the time, OpenDroneMap uses good defaults, good optimization, and chooses these settings for us. But for massive datasets, we probably need to be more thoughtful than using the defaults. At least, until we have good standards for these larger datasets and can embody some of this rationale into the code base. |
A # Find out how many images there are
totallines=`wc -l ~/outdir/znz/resize/img_list.txt | awk '{print $1}'`
previouspermillage=0
while true
do
# Count the number of files in the OpenSfM features directory
num1=`find ~/outdir/znz/resize/opensfm/features/ -type f | wc -l `
# Calculate percentage of images that have had features extracted
percentage=`echo $num1 / $totallines \* 100| bc -l`
# We calculate permillage for the sake of determining whether we are
# going to display a new number
permillage=`echo $num1 / $totallines \* 1000| bc -l`
# If our rounded permillage number has increased, then display a new number
if [ `printf "%.0f\n" $permillage` -gt `printf "%.0f\n" $previouspermillage` ]
then
# echo
echo -n `printf "%.1f\n" $percentage`
fi
# Speak less. Smile more.
echo -n "."
sleep 5
# Check if we are done
if [ `printf "%.0f\n" $percentage` -gt 99 ]
then
# Bing!
echo -ne '\007'
echo "All done with feature extraction!"
break
fi
previouspermillage=$permillage
done |
Just wanted to pop in and say I'm glad you're including this info here @smathermather . |
Hey thanks! I was hoping I wasn't creating unneeded noise. But, I need to track this, and I figure it can become a platform for further use and understanding. |
I have stopped using the above approach of logging into the docker instance and running the script. It seems like enough to just to remove the docker run -ti -v /home/gisuser/outdir/znz:/datasets odmnolign --project-path /datasets resize --dtm --dsm --smrf-scalar 0 --smrf-slope 0.06 --smrf-threshold 1.05 --smrf-window 100 --dem-resolution 413 --depthmap-resolution 320 --split 7000 --use-hybrid-bundle-adjustment |
One of the big challenges with massive datasets is catching issues before they are issues. In my case, that is finding small pockets of bad data. We probably need more robust error checking in the chain. So, for example, I had 4 files with some issue (
I still need to take a look at the 4 bad files and find out why they are failing, but it seems we are having uncaught errors in the feature extraction step, and then these stop the toolchain at the matching stage. |
Now we are in the matcher stage for these data. As noted above, I forced barrel of words matching to get a speed bump. So, how do we track progress here? We can look at the logs and see how many matches in we are: docker logs great_golick | grep "Matching" | wc -l
102155 But that just gives us some relative progress. How many matches should we have? The total number of images helps inform this: wc -l ~/outdir/znz/resize/img_list.txt | awk '{print $1}'
82087 We can assume the total number of matches will be the number of images
|
We can wrap the above in a monitoring script which displays a number each time we progress by 1/10th of a percentage: #!/bin/bash
# Find out how many images there are
totallines=`wc -l ~/outdir/znz/resize/img_list.txt | awk '{print $1}'`
previouspermillage=0
while true
do
# Count the number of matches
num1=`docker logs great_golick | grep "Matching" | wc -l`
# Calculate percentage of images that have had features extracted
percentage=`echo $num1 / $totallines / 8 \* 100| bc -l`
# We calculate permillage for the sake of determining whether we are
# going to display a new number
permillage=`echo $num1 / $totallines / 8 \* 1000| bc -l`
# If our rounded permillage number has increased, then display a new number
if [ `printf "%.0f\n" $permillage` -gt `printf "%.0f\n" $previouspermillage` ]
then
# echo
echo -n `printf "%.1f\n" $percentage`
fi
# Speak less. Smile more.
echo -n "."
sleep 5
# Check if we are done
if [ `printf "%.0f\n" $percentage` -gt 99 ]
then
# Bing!
echo -ne '\007'
echo "All done with matching!"
break
fi
previouspermillage=$permillage
done |
👆🏻 This doesn't really work, btw. I don't know how to estimate how many matches we are supposed to have. |
Buying instead of RentingUsually if we do split-merge, we are renting some infrastructure, and the autoscaler in ClusterODM is the way to go. It's a fantastic tool. But sometimes, we have infrastructure we own, so I will document a bit of the management of owned infrastructure here. It will eventually make it's way into docs.opendronemap.org as it matures, but I want to capture it here while it's fresh. |
Parallel shell use:
Connecting to hostsLogins to the machines are handled in my Host proxyhost
User useruser
Hostname proxyhost.someaddress.com
Port 22
Host webodm
User loseruser
Hostname 192.168.0.10
Port 22
ProxyCommand ssh -q -W %h:%p proxyhost
Host node1
User winorloser
Hostname 192.168.0.11
Port 22
ProxyCommand ssh -q -W %h:%p proxyhost
Host node2
User bob
Hostname 192.168.0.12
Port 22
ProxyCommand ssh -q -W %h:%p proxyhost
.... Clusters of hostsClusters of hosts which need controlled together are than contained in hosts files as follows: First all the machines for when we want to run commands on all: more .ssh/odmall
webodm
node1
node2
node3
node4
node5 Then the groups of hosts that we want to control. For simplicity, I do have groups of one, just to simplify my work flow. Alternatively, I could use just The parent host file contains just one machine: more .ssh/odmrent
webodm The big machine is in a host file all it's own: more .ssh/odmbigs
node5 All the smaller machines are together in a file: more .ssh/odmsmoll
node1
node2
node3
node4 And finally, we have a hosts file with just the child nodes: more .ssh/odmhosts
node1
node2
node3
node4
node5 Now we can do things with these groups. Let's dive in. Using clusters of hostsKilling all the NodeODM InstancesWe might want to kill all the NodeODM docker machines on the host when there are failures. We can do so as follows: parallel-ssh --timeout 15 -i -h ~/.ssh/odmall 'docker kill $(docker ps -a -q)'
[1] 17:27:42 [FAILURE] webodm Exited with error code 1
Stderr: "docker kill" requires at least 1 argument.
See 'docker kill --help'.
Usage: docker kill [OPTIONS] CONTAINER [CONTAINER...]
Kill one or more running containers
[2] 17:27:42 [SUCCESS] node5
43b423c3aa25
[3] 17:27:42 [SUCCESS] node4
70452bf0691e
[4] 17:27:42 [SUCCESS] node2
20da8ae8b876
[5] 17:27:43 [SUCCESS] node1
62440df4f007
[6] 17:27:43 [SUCCESS] node3
b202fd6258ed Running NodeODM on the hosts:Since we have 3 sizes of machines, we need different commands to start NodeODM on each. We set parallel-ssh --timeout 15 -i -h ~/.ssh/odmrent "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 20 --max_images 1000000&"parallel-ssh --timeout 15 -i -h ~/.ssh/odmrent "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 20 --max_images 1000000&"
[1] 17:32:14 [SUCCESS] webodm
parallel-ssh --timeout 15 -i -h ~/.ssh/odmbigs "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 12 --max_images 10000&"
[1] 17:32:44 [FAILURE] node5 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000
parallel-ssh --timeout 15 -i -h ~/.ssh/odmsmoll "docker run -p 3000:3000 opendronemap/nodeodm --max_concurrency 8 --max_images 5000&"
[1] 17:33:23 [FAILURE] node1 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000
[2] 17:33:23 [FAILURE] node2 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000
[3] 17:33:23 [FAILURE] node3 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000
[4] 17:33:23 [FAILURE] node4 Timed out, Killed by signal 9
info: Authentication using NoTokenRequired
info: Listening on 0.0.0.0:6367 UDP for progress updates
info: No tasks dump found
info: Checking for orphaned directories to be removed...
info: Server has started on port 3000 I set a short timeout here, but for the first time this is run, we probably want a much longer time out so that docker has time to download the images. |
Stitching the cluster togetherNow we need a ClusterODM instance to use these nodes. It would be really easy to do this as a docker instance, but then we can't use our host names so easily there, and ClusterODM is easy enough to run locally, so we'll do that. First we add them to our hosts file: sudo vi /etc/hosts
First we get and install: git clone https://github.com/OpenDroneMap/ClusterODM
cd ClusterODM
npm install Then we run our ClusterODM node, in this case with a limit on network speed to avoid packet losses. It'll still do things at pretty fast: node index.js --public-address http://192.168.0.10:3000 --upload-max-speed 200000000 & ClusterODM, needs the nodes attached. For this we telnet in: telnet localhost 8080
Trying ::1...
Connected to localhost.localdomain.
Escape character is '^]'.
Welcome ::1:33106 ClusterODM:1.4.3
HELP for help
QUIT to quit
#> And we need to add our hosts:
We can list our available hosts:
And our settings when we started these nodes will help the load balancing of any split merge work. We can check this with
|
I don't know if this needs to be part of the odm-benchmarks project, but I have a particularly large dataset to process, so I am doing some monitoring of individual stages so that I can do a better job predicting processing time over the life of the project. I thought I would document that here in case it is useful to see.
The text was updated successfully, but these errors were encountered: