Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore stateful crawling support #864

Merged
merged 22 commits into from
Mar 29, 2021

Conversation

boolean5
Copy link
Contributor

@boolean5 boolean5 commented Mar 3, 2021

This PR restores support for stateful crawling by addressing the core issue that broke it, i.e., the fact that geckodriver would delete the browser profile when closing or crashing before we had a chance to archive it. It does so by using a custom profile instead of creating the profile via the FirefoxProfile Selenium class. The custom profile is set via the arguments of the Options class.

Note:
Geckodriver has a bug that makes it write the browser preferences we set, as well as its own default browser preferences, to a user.js file in the wrong profile directory when using a custom profile: mozilla/geckodriver#1844. As a temporary workaround until this issue gets fixed, we create the user.js file ourselves. In order to do this, we keep a copy of geckodriver's default preferences in configure_firefox.py.

This PR also:

Closes #423, Closes #383, Closes #161, Closes #253, Closes #384, Closes #680

boolean5 added 4 commits March 1, 2021 17:18
Reenable stateful crawling and profile tests. Also, update the docs now
that stateful crawling is supported. Currently, stateful crawling is
broken, as geckodriver deletes the browser profile when closing or
crashing before we can archive it.
Use a custom profile by setting it as an argument via the Options class,
instead of using the FirefoxProfile class. This way geckodriver does not
delete it when crashing or closing. Also, remove some unused arguments
from the function that configures privacy settings in Firefox. Finally,
remove the code that clears driver.profile before calling driver.quit(),
as driver.profile is always None when using a custom profile.
Fix a bug in PatchedGeckoDriverService that caused geckodriver not to
receive the service_args passed when starting the browser.
PatchedGeckoDriverService is a modified version of Selenium's Service
class and this bug has been fixed in the original version.
Geckodriver has a bug that makes it write the browser preferences we
set, as well as its own default browser preferences, to a user.js file
in the wrong profile directory when using a custom profile:
mozilla/geckodriver#1844. As a temporary
workaround until this issue gets fixed, we create the user.js file
ourselves. In order to do this, we keep a copy of geckodriver's default
preferences in our code.

Closes openwpm#423
@codecov
Copy link

codecov bot commented Mar 3, 2021

Codecov Report

Merging #864 (37271ba) into master (cb95ecc) will increase coverage by 0.52%.
The diff coverage is 32.27%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #864      +/-   ##
==========================================
+ Coverage   49.86%   50.38%   +0.52%     
==========================================
  Files          35       35              
  Lines        3309     3384      +75     
==========================================
+ Hits         1650     1705      +55     
- Misses       1659     1679      +20     
Impacted Files Coverage Δ
openwpm/task_manager.py 80.06% <ø> (+2.10%) ⬆️
openwpm/deploy_browsers/configure_firefox.py 8.77% <9.17%> (+6.57%) ⬆️
openwpm/deploy_browsers/deploy_firefox.py 21.97% <11.11%> (-0.25%) ⬇️
openwpm/command_sequence.py 89.23% <66.66%> (+1.73%) ⬆️
openwpm/commands/profile_commands.py 64.51% <75.00%> (+39.90%) ⬆️
openwpm/browser_manager.py 63.05% <83.33%> (+4.05%) ⬆️
openwpm/config.py 96.63% <100.00%> (ø)
openwpm/deploy_browsers/selenium_firefox.py 28.81% <100.00%> (+0.02%) ⬆️
openwpm/commands/browser_commands.py 26.87% <0.00%> (-0.80%) ⬇️
openwpm/commands/utils/webdriver_utils.py 21.59% <0.00%> (-0.57%) ⬇️
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cb95ecc...37271ba. Read the comment docs.

@englehardt englehardt requested review from englehardt and vringar March 8, 2021 16:51
openwpm/deploy_browsers/deploy_firefox.py Outdated Show resolved Hide resolved
test/test_profile.py Show resolved Hide resolved
openwpm/browser_manager.py Outdated Show resolved Hide resolved
openwpm/browser_manager.py Outdated Show resolved Hide resolved
@boolean5
Copy link
Contributor Author

boolean5 commented Mar 9, 2021

manual_test.py still sets the browser profile via the FirefoxProfile class. Should we change it to use a custom profile instead?
My impression is that since we're not interested in being able to archive the profile in the context of manual_test.py, there's no need to change it, but I'd like a confirmation on this.

@vringar
Copy link
Contributor

vringar commented Mar 9, 2021

Yes, we should use a custom profile Here.
Good job for remembering manual_test.py!

openwpm/browser_manager.py Outdated Show resolved Hide resolved
openwpm/browser_manager.py Outdated Show resolved Hide resolved
openwpm/deploy_browsers/configure_firefox.py Outdated Show resolved Hide resolved
openwpm/deploy_browsers/configure_firefox.py Outdated Show resolved Hide resolved
openwpm/deploy_browsers/deploy_firefox.py Outdated Show resolved Hide resolved
test/test_profile.py Show resolved Hide resolved
openwpm/browser_manager.py Outdated Show resolved Hide resolved
@vringar vringar self-assigned this Mar 11, 2021
Copy link
Collaborator

@englehardt englehardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thank you Georgia! I left a bunch of comments, but nothing major.

openwpm/browser_manager.py Outdated Show resolved Hide resolved
openwpm/browser_manager.py Show resolved Hide resolved
openwpm/browser_manager.py Outdated Show resolved Hide resolved
openwpm/deploy_browsers/configure_firefox.py Outdated Show resolved Hide resolved
openwpm/deploy_browsers/deploy_firefox.py Outdated Show resolved Hide resolved
openwpm/deploy_browsers/deploy_firefox.py Outdated Show resolved Hide resolved
openwpm/deploy_browsers/deploy_firefox.py Outdated Show resolved Hide resolved
test/test_crawl.py Show resolved Hide resolved
test/test_crawl.py Outdated Show resolved Hide resolved
test/test_profile.py Show resolved Hide resolved
1. In `deploy_firefox` do not use `driver.capabilities["moz:profile"]`
to get the profile location. Custom profiles, unlike profiles created
via `FirefoxProfile`, are used in-place, so we already know the
location.

2. In `launch_browser_manager`, `spawned_profile_path` and
`driver_profile_path` point to the same location now that we are using a
custom profile. Replace them with a single `browser_profile_path`
variable.

3. Rename `prof_folder` and `browser_profile_folder` to
`browser_profile_path` for consistency.

4. Improve naming of the temporary Firefox profile.
Running manual_test.py resulted in an error because the `xpi()` fixture
was called directly. Apply the fix suggested in
https://docs.pytest.org/en/stable/deprecations.html#calling-fixtures-directly
Also, use a custom profile instead of `FirefoxProfile` and update some
docstrings.
@boolean5
Copy link
Contributor Author

Yes, we should use a custom profile Here.
Good job for remembering manual_test.py!

manual_test.py happened to be broken for a different reason and also needed another update (check f5bacae)

Move the core implementation of profile dumping into a `dump_profile`
function, which can be used both internally when closing or restarting a
crashed browser and from the `execute()` method of `DumpProfileCommand`.
Also, make compression the default in `DumpProfileCommand`. Finally, do
not compress the tar archive of the crashed browser's profile when
restarting from a crash. We should avoid the extra compression/
decompression step as this is a short-lived tar file.
@boolean5
Copy link
Contributor Author

I assume that in these three cases [1] [2] [3] reset=True was used because stateful crawling was disabled and we can now let it take the default False value. Is there a reason that I'm missing to keep reset=True for any of them?

@englehardt
Copy link
Collaborator

I assume that in these three cases [1] [2] [3] reset=True was used because stateful crawling was disabled and we can now let it take the default False value. Is there a reason that I'm missing to keep reset=True for any of them?

For [1] we should keep reset=True. We want those crawls to be stateless (of course other users can change as they see fit). For [2] and [3] I'm not sure it matters, but I'll defer to @vringar just in case.

@vringar
Copy link
Contributor

vringar commented Mar 16, 2021

[2] and [3] can be removed as well.

Use the public suffix + 1 instead of the public suffix when comparing
the domains in the crawl database with those in the profile history.
Also, update an incorrectly formed query to the crawl database.
@boolean5
Copy link
Contributor Author

As we discussed, I checked and the original tar file is kept intact after extracting, so I removed the intermediate step of copying it from load_profile().

@boolean5
Copy link
Contributor Author

@vringar, @englehardt, thanks for reviewing! This is ready, I've covered all the requested changes. You can take another look if you'd like.

@boolean5 boolean5 requested review from vringar and englehardt March 22, 2021 10:00
test/test_profile.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vringar vringar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we delete the profile folders on a normal shutdown? I don't think we need to but maybe we should still do it.

@boolean5
Copy link
Contributor Author

Do we delete the profile folders on a normal shutdown? I don't think we need to but maybe we should still do it.

I think we do that here, in shutdown_browser(). Does that cover what you had in mind?

@vringar
Copy link
Contributor

vringar commented Mar 22, 2021

Do we delete the profile folders on a normal shutdown? I don't think we need to but maybe we should still do it.

I think we do that here, in shutdown_browser(). Does that cover what you had in mind?

Yeah. Sorry for missing this and thanks for pointing it out.

Make `PatchedGeckoDriverService` class subclass
selenium.webdriver.firefox.service.Service instead of
selenium.webdriver.common.service.Service, so that we only have to keep
track of the changes in the `__init__()` method of the former class.
@englehardt englehardt merged commit bfc4644 into openwpm:master Mar 29, 2021
@birdsarah
Copy link
Contributor

WOOOOTTTT!!!!

@vringar
Copy link
Contributor

vringar commented Apr 12, 2021

Not officially released as of yet though. Gonna publish it with the next FF release.
Although it seems that FF 89 is in the process of being released see the tags
So I might just do a release tomorrow.

Zaxeli pushed a commit to Zaxeli/OpenWPM that referenced this pull request Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants