Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore stateful crawling support #864

Merged
merged 22 commits into from
Mar 29, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,6 @@ def callback(success: bool, val: str = site) -> None:
command_sequence = CommandSequence(
site,
site_rank=index,
reset=True,
callback=callback,
)

Expand All @@ -74,5 +73,5 @@ def callback(success: bool, val: str = site) -> None:
# Have a look at custom_command.py to see how to implement your own command
command_sequence.append_command(LinkCountingCommand())

# Run commands across the three browsers (simple parallelization)
# Run commands across all browsers (simple parallelization)
manager.execute_command_sequence(command_sequence)
6 changes: 0 additions & 6 deletions docs/Configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,11 +249,6 @@ TODO

# Browser Profile Support

**WARNING: Stateful crawls are currently not supported. Attempts to run
stateful crawls will throw `NotImplementedError`s. The work required to
restore support is tracked in
[this project](https://github.com/mozilla/OpenWPM/projects/2).**

## Stateful vs Stateless crawls

By default OpenWPM performs a "stateful" crawl, in that it keeps a consistent
Expand Down Expand Up @@ -323,7 +318,6 @@ but will not be used during crash recovery. Specifically:
profile specified by `seed_tar`. If OpenWPM determines that Firefox needs to
restart for some reason during the crawl, it will use the profile from
the most recent page visit (pre-crash) rather than the `seed_tar` profile.
Note that stateful crawls are currently [unsupported](https://github.com/mozilla/OpenWPM/projects/2)).
* For stateless crawls, the initial `seed_tar` will be loaded during each
new page visit. Note that this means the profile will very likely be
_incomplete_, as cookies or storage may have been set or changed during the
Expand Down
13 changes: 7 additions & 6 deletions docs/Release-Checklist.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Release Checklist

We aim to release a new version of OpenWPM with each new Firefox release (~1 release per month). The following steps are necessary for a release
We aim to release a new version of OpenWPM with each new Firefox release (~1 release per month). The following steps are necessary for a release:

1. Upgrade Firefox to the newest version.
1. Go to: https://hg.mozilla.org/releases/mozilla-release/tags.
Expand All @@ -10,12 +10,13 @@ We aim to release a new version of OpenWPM with each new Firefox release (~1 rel
1. Run `npm update` in `openwpm/Extension/firefox`.
2. Run `npm update` in `openwpm/Extension/webext-instrumentation`.
3. Update python and system dependencies by following the ["managing requirements" instructions](../CONTRIBUTING.md#managing-requirements).
4. Increment the version number in [VERSION](../VERSION)
5. Add a summary of changes since the last version to [CHANGELOG](../CHANGELOG.md)
6. Squash and merge the release PR to master.
7. Publish a new release from https://github.com/mozilla/OpenWPM/releases:
4. If a new version of geckodriver is used, check whether the default geckodriver browser preferences in [`openwpm/deploy_browsers/configure_firefox.py`](../openwpm/deploy_browsers/configure_firefox.py#L8L65) need to be updated.
5. Increment the version number in [VERSION](../VERSION)
6. Add a summary of changes since the last version to [CHANGELOG](../CHANGELOG.md)
7. Squash and merge the release PR to master.
8. Publish a new release from https://github.com/mozilla/OpenWPM/releases:
1. Click "Draft a new release".
2. Enter the "Tag version" and "Release title" as `vX.X.X`.
3. In the description:
1. Include the text `Updates OpenWPM to Firefox X` if this release is also a new FF version.
2. Include a link to the CHANGELOG, e.g. `See the [CHANGELOG]() for details.`.
2. Include a link to the CHANGELOG, e.g. `See the [CHANGELOG]() for details.`.
117 changes: 51 additions & 66 deletions openwpm/browser_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,11 @@
import shutil
import signal
import sys
import tempfile
import threading
import time
import traceback
from pathlib import Path
from queue import Empty as EmptyQueue
from typing import Optional, Union

Expand All @@ -16,6 +18,7 @@
from selenium.common.exceptions import WebDriverException
from tblib import pickling_support

from .commands.profile_commands import dump_profile
from .commands.types import BaseCommand, ShutdownSignal
from .config import BrowserParamsInternal, ManagerParamsInternal
from .deploy_browsers import deploy_firefox
Expand All @@ -33,7 +36,7 @@

class Browser:
"""
The Browser class is responsbile for holding all of the
The Browser class is responsible for holding all of the
configuration and status information on BrowserManager process
it corresponds to. It also includes a set of methods for managing
the BrowserManager process and its child processes/threads.
Expand All @@ -52,7 +55,7 @@ def __init__(
self._UNSUCCESSFUL_SPAWN_LIMIT = 4

# manager parameters
self.current_profile_path = None
self.current_profile_path: Optional[Path] = None
self.db_socket_address = manager_params.storage_controller_address
assert browser_params.browser_id is not None
self.browser_id: BrowserId = browser_params.browser_id
Expand All @@ -62,7 +65,7 @@ def __init__(

# Queues and process IDs for BrowserManager

# thread to run commands issues from TaskManager
# thread to run commands issued from TaskManager
self.command_thread: Optional[threading.Thread] = None
# queue for passing command tuples to BrowserManager
self.command_queue: Optional[Queue] = None
Expand All @@ -75,7 +78,7 @@ def __init__(
# the port of the display for the Xvfb display (if it exists)
self.display_port: Optional[int] = None

# boolean that says if the BrowserManager new (to optimize restarts)
# boolean that says if the BrowserManager is new (to optimize restarts)
self.is_fresh = True
# boolean indicating if the browser should be restarted
self.restart_required = False
Expand All @@ -97,29 +100,29 @@ def launch_browser_manager(self):
sets up the BrowserManager and gets the process id, browser pid and,
if applicable, screen pid. loads associated user profile if necessary
"""
# Unsupported. See https://github.com/mozilla/OpenWPM/projects/2
# if this is restarting from a crash, update the tar location
# to be a tar of the crashed browser's history
"""
if self.current_profile_path is not None:
# tar contents of crashed profile to a temp dir
tempdir = tempfile.mkdtemp(prefix="owpm_profile_archive_") + "/"
profile_commands.dump_profile(
self.current_profile_path,
self.manager_params,
self.browser_params,
tempdir,
close_webdriver=False,
tempdir = tempfile.mkdtemp(prefix="openwpm_profile_archive_")
tar_path = Path(tempdir) / "profile.tar"

dump_profile(
browser_profile_path=self.current_profile_path,
tar_path=tar_path,
compress=False,
browser_params=self.browser_params,
)

# make sure browser loads crashed profile
self.browser_params.recovery_tar = tempdir
self.browser_params.recovery_tar = tar_path

crash_recovery = True
else:
"""
tempdir = None
crash_recovery = False

self.logger.info("BROWSER %i: Launching browser..." % self.browser_id)
tempdir = None
crash_recovery = False
self.is_fresh = not crash_recovery

# Try to spawn the browser within the timelimit
Expand Down Expand Up @@ -159,8 +162,8 @@ def check_queue(launch_status):
# Read success status of browser manager
launch_status = dict()
try:
# 1. Selenium profile created
spawned_profile_path = check_queue(launch_status)
# 1. Browser profile created
browser_profile_path = check_queue(launch_status)
# 2. Profile tar loaded (if necessary)
check_queue(launch_status)
# 3. Display launched (if necessary)
Expand All @@ -170,7 +173,7 @@ def check_queue(launch_status):
# 5. Browser launched
self.geckodriver_pid = check_queue(launch_status)

(driver_profile_path, ready) = check_queue(launch_status)
ready = check_queue(launch_status)
if ready != "READY":
self.logger.error(
"BROWSER %i: Mismatch of status queue return values, "
Expand All @@ -183,7 +186,6 @@ def check_queue(launch_status):
unsuccessful_spawns += 1
error_string = ""
status_strings = [
"Proxy Ready",
"Profile Created",
"Profile Tar",
"Display",
Expand All @@ -202,17 +204,15 @@ def check_queue(launch_status):
)
self.close_browser_manager()
if "Profile Created" in launch_status:
shutil.rmtree(spawned_profile_path, ignore_errors=True)
shutil.rmtree(browser_profile_path, ignore_errors=True)

# If the browser spawned successfully, we should update the
# current profile path class variable and clean up the tempdir
# and previous profile path.
if success:
self.logger.debug("BROWSER %i: Browser spawn sucessful!" % self.browser_id)
self.logger.debug("BROWSER %i: Browser spawn successful!" % self.browser_id)
previous_profile_path = self.current_profile_path
self.current_profile_path = driver_profile_path
if driver_profile_path != spawned_profile_path:
shutil.rmtree(spawned_profile_path, ignore_errors=True)
self.current_profile_path = browser_profile_path
if previous_profile_path is not None:
shutil.rmtree(previous_profile_path, ignore_errors=True)
if tempdir is not None:
Expand Down Expand Up @@ -360,15 +360,15 @@ def kill_browser_manager(self):
os.kill(self.display_pid, signal.SIGKILL)
except OSError:
self.logger.debug(
"BROWSER %i: Display process does not " "exit" % self.browser_id
"BROWSER %i: Display process does not exit" % self.browser_id
)
pass
except TypeError:
self.logger.error(
"BROWSER %i: PID may not be the correct "
"type %s" % (self.browser_id, str(self.display_pid))
)
if self.display_port is not None: # xvfb diplay lock
if self.display_port is not None: # xvfb display lock
lockfile = "/tmp/.X%s-lock" % self.display_port
try:
os.remove(lockfile)
Expand All @@ -394,33 +394,27 @@ def shutdown_browser(self, during_init: bool, force: bool = False) -> None:
self.close_browser_manager(force=force)

# Archive browser profile (if requested)
if not during_init and self.browser_params.profile_archive_dir is not None:
self.logger.warning(
"BROWSER %i: Archiving the browser profile directory is "
"currently unsupported. "
"See: https://github.com/mozilla/OpenWPM/projects/2" % self.browser_id
)
"""
self.logger.debug(
"BROWSER %i: during_init=%s | profile_archive_dir=%s" % (
self.browser_id, str(during_init),
self.browser_params.profile_archive_dir)
"BROWSER %i: during_init=%s | profile_archive_dir=%s"
% (
self.browser_id,
str(during_init),
self.browser_params.profile_archive_dir,
)
)
if (not during_init and
self.browser_params.profile_archive_dir is not None):
if not during_init and self.browser_params.profile_archive_dir is not None:
self.logger.debug(
"BROWSER %i: Archiving browser profile directory to %s" % (
self.browser_id,
self.browser_params.profile_archive_dir))
profile_commands.dump_profile(
self.current_profile_path,
self.manager_params,
self.browser_params,
self.browser_params.profile_archive_dir,
close_webdriver=False,
compress=True
"BROWSER %i: Archiving browser profile directory to %s"
% (self.browser_id, self.browser_params.profile_archive_dir)
)
tar_path = self.browser_params.profile_archive_dir / "profile.tar.gz"
assert self.current_profile_path is not None
dump_profile(
browser_profile_path=self.current_profile_path,
tar_path=tar_path,
compress=True,
boolean5 marked this conversation as resolved.
Show resolved Hide resolved
browser_params=self.browser_params,
)
"""

# Clean up temporary files
if self.current_profile_path is not None:
Expand All @@ -441,22 +435,20 @@ def BrowserManager(
display = None
try:
# Start Xvfb (if necessary), webdriver, and browser
driver, prof_folder, display = deploy_firefox.deploy_firefox(
driver, browser_profile_path, display = deploy_firefox.deploy_firefox(
status_queue, browser_params, manager_params, crash_recovery
)
if prof_folder[-1] != "/":
prof_folder += "/"

# Read the extension port -- if extension is enabled
# TODO: Initial communication from extension to TM should use sockets
if browser_params.extension_enabled:
logger.debug(
"BROWSER %i: Looking for extension port information "
"in %s" % (browser_params.browser_id, prof_folder)
"in %s" % (browser_params.browser_id, browser_profile_path)
)
elapsed = 0
port = None
ep_filename = os.path.join(prof_folder, "extension_port.txt")
ep_filename = browser_profile_path / "extension_port.txt"
while elapsed < 5:
try:
with open(ep_filename, "rt") as f:
Expand All @@ -483,10 +475,9 @@ def BrowserManager(

logger.debug("BROWSER %i: BrowserManager ready." % browser_params.browser_id)

# passes the profile folder back to the
# TaskManager to signal a successful startup
status_queue.put(("STATUS", "Browser Ready", (prof_folder, "READY")))
browser_params.profile_path = prof_folder
# passes "READY" to the TaskManager to signal a successful startup
status_queue.put(("STATUS", "Browser Ready", "READY"))
browser_params.profile_path = browser_profile_path

# starts accepting arguments until told to die
while True:
Expand All @@ -498,12 +489,6 @@ def BrowserManager(
command: Union[ShutdownSignal, BaseCommand] = command_queue.get()

if type(command) is ShutdownSignal:
# Geckodriver creates a copy of the profile (and the original
# temp file created by FirefoxProfile() is deleted).
# We clear the profile attribute here to prevent prints from:
# https://github.com/SeleniumHQ/selenium/blob/4e4160dd3d2f93757cafb87e2a1c20d6266f5554/py/selenium/webdriver/firefox/webdriver.py#L193-L199
if driver.profile and not os.path.isdir(driver.profile.path):
driver.profile = None
driver.quit()
status_queue.put("OK")
return
Expand Down
Loading