-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Restore stateful crawling support #864
Conversation
Reenable stateful crawling and profile tests. Also, update the docs now that stateful crawling is supported. Currently, stateful crawling is broken, as geckodriver deletes the browser profile when closing or crashing before we can archive it.
Use a custom profile by setting it as an argument via the Options class, instead of using the FirefoxProfile class. This way geckodriver does not delete it when crashing or closing. Also, remove some unused arguments from the function that configures privacy settings in Firefox. Finally, remove the code that clears driver.profile before calling driver.quit(), as driver.profile is always None when using a custom profile.
Fix a bug in PatchedGeckoDriverService that caused geckodriver not to receive the service_args passed when starting the browser. PatchedGeckoDriverService is a modified version of Selenium's Service class and this bug has been fixed in the original version.
Geckodriver has a bug that makes it write the browser preferences we set, as well as its own default browser preferences, to a user.js file in the wrong profile directory when using a custom profile: mozilla/geckodriver#1844. As a temporary workaround until this issue gets fixed, we create the user.js file ourselves. In order to do this, we keep a copy of geckodriver's default preferences in our code. Closes openwpm#423
Codecov Report
@@ Coverage Diff @@
## master #864 +/- ##
==========================================
+ Coverage 49.86% 50.38% +0.52%
==========================================
Files 35 35
Lines 3309 3384 +75
==========================================
+ Hits 1650 1705 +55
- Misses 1659 1679 +20
Continue to review full report at Codecov.
|
|
Yes, we should use a custom profile Here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thank you Georgia! I left a bunch of comments, but nothing major.
1. In `deploy_firefox` do not use `driver.capabilities["moz:profile"]` to get the profile location. Custom profiles, unlike profiles created via `FirefoxProfile`, are used in-place, so we already know the location. 2. In `launch_browser_manager`, `spawned_profile_path` and `driver_profile_path` point to the same location now that we are using a custom profile. Replace them with a single `browser_profile_path` variable. 3. Rename `prof_folder` and `browser_profile_folder` to `browser_profile_path` for consistency. 4. Improve naming of the temporary Firefox profile.
Running manual_test.py resulted in an error because the `xpi()` fixture was called directly. Apply the fix suggested in https://docs.pytest.org/en/stable/deprecations.html#calling-fixtures-directly Also, use a custom profile instead of `FirefoxProfile` and update some docstrings.
|
Move the core implementation of profile dumping into a `dump_profile` function, which can be used both internally when closing or restarting a crashed browser and from the `execute()` method of `DumpProfileCommand`. Also, make compression the default in `DumpProfileCommand`. Finally, do not compress the tar archive of the crashed browser's profile when restarting from a crash. We should avoid the extra compression/ decompression step as this is a short-lived tar file.
For [1] we should keep reset=True. We want those crawls to be stateless (of course other users can change as they see fit). For [2] and [3] I'm not sure it matters, but I'll defer to @vringar just in case. |
[2] and [3] can be removed as well. |
Use the public suffix + 1 instead of the public suffix when comparing the domains in the crawl database with those in the profile history. Also, update an incorrectly formed query to the crawl database.
As we discussed, I checked and the original tar file is kept intact after extracting, so I removed the intermediate step of copying it from |
@vringar, @englehardt, thanks for reviewing! This is ready, I've covered all the requested changes. You can take another look if you'd like. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we delete the profile folders on a normal shutdown? I don't think we need to but maybe we should still do it.
I think we do that here, in |
Yeah. Sorry for missing this and thanks for pointing it out. |
Make `PatchedGeckoDriverService` class subclass selenium.webdriver.firefox.service.Service instead of selenium.webdriver.common.service.Service, so that we only have to keep track of the changes in the `__init__()` method of the former class.
WOOOOTTTT!!!! |
Not officially released as of yet though. Gonna publish it with the next FF release. |
Restore stateful crawling support
This PR restores support for stateful crawling by addressing the core issue that broke it, i.e., the fact that geckodriver would delete the browser profile when closing or crashing before we had a chance to archive it. It does so by using a custom profile instead of creating the profile via the
FirefoxProfile
Selenium class. The custom profile is set via the arguments of theOptions
class.Note:
Geckodriver has a bug that makes it write the browser preferences we set, as well as its own default browser preferences, to a
user.js
file in the wrong profile directory when using a custom profile: mozilla/geckodriver#1844. As a temporary workaround until this issue gets fixed, we create theuser.js
file ourselves. In order to do this, we keep a copy of geckodriver's default preferences inconfigure_firefox.py
.This PR also:
PatchedGeckoDriverService
class that caused geckodriver not to receive theservice_args
passed when starting the browser.manual_test.py
Closes #423, Closes #383, Closes #161, Closes #253, Closes #384, Closes #680