Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design protocol for determining crawl accuracy over time #136

Open
SebastianZimmeck opened this issue Sep 18, 2024 · 2 comments
Open

Design protocol for determining crawl accuracy over time #136

SebastianZimmeck opened this issue Sep 18, 2024 · 2 comments
Assignees
Labels
crawl Perform crawl or crawl feature-related

Comments

@SebastianZimmeck
Copy link
Member

@katehausladen provided some initial analysis accuracy analysis as shown in our draft paper (section 3.5). Starting with the September crawl (#118) we should come up with a protocol to check for each crawl going forward 100 randomly selected sites manually whether the crawl results are accurate. As we are crawling over longer periods of time we might otherwise see a drift in accuracy, for example, due to code changes or site changes, and, thus, should keep an eye on it.

I am particularly concerned about the following:

  • uspapi_before_gpc, uspapi_after_gpc
  • usp_cookies_before_gpc, usp_cookies_after_gpc
  • OptanonConsent_before_gpc, OptanonConsent_after_gpc
  • gpp_before_gpc, gpp_after_gpc
  • usps_before_gpc, usps_after_gpc
  • USPS implementation
  • error
  • Well-known

A few comments:

  • @katehausladen came up with strategy to check the ground truth while the crawl was running. That was useful because site loads may differ in terms of which ad networks and other third party a site loads in one run vs another run. Now, if we do a post-hoc analysis after the crawl, we cannot do that as the crawl has already happened. But I think that it is OK because I am less concerned about the urlClassification results but more about the above. Question: do the above change from load to load? I do not think that should be the case since, for example, a site should set the OptanonConsent cookie on every load. But we need to check that.
  • Also, how do we ensure that our random selection covers all the cases? Having one instance of the GPP implementation behaving correctly is not meaningful. We should aim for around ten instances per condition. So, how do we select random sites but ensuring sufficient instances? Our set of sites is skewed. For example, most sites do not have a GPP string. So, total random selection may not lead to sufficient coverage. Maybe, randomly select from the set of sites that have a GPP implementation per the crawl and confirm 10 positive instances and randomly select from the set of non-GPP sites 10 to confirm negative instances. This analysis is further complicated as there are different sub-conditions, for example, USP API opts out after receiving a GPC signal vs USP API already opted out before receiving a GPC signal. In the first case we have an instance of a string change that our crawl needs to get accurate.

The bottom line is, we need a protocol that allows us to check the analysis accuracy of our different conditions (including sub-conditions) for every crawl to keep track of analysis accuracy over time. Since we need to do it every crawl and it involves manual work, it should be manageable time-wise but also meaningful.

@natelevinson10 will take the lead here and work with @franciscawijaya and @eakubilo before starting the next crawl.

@SebastianZimmeck SebastianZimmeck added the crawl Perform crawl or crawl feature-related label Sep 18, 2024
@natelevinson10
Copy link
Member

Did a quick review of the manual data we collected a couple of weeks ago and targeted instances of a mismatch (marked in red) signaling our ground truth was different than the crawl data. I used several VPN locations (California, multiple Colorado, Virginia, and no VPN (CT)) and gave ample time to let all of the site content to load.

I was not able to find a single instance of our manual data changing from what we had reported, except for bumble.com 's USPapi_before being "1YNN" instead of the reported "1YYN", and I would chalk this up to a manual error on our end. It would seem that for the mismatches of data from crawl to manual, the manual data is more accurate.

As for the comments left by @SebastianZimmeck above, cookies do seem to load on every refresh - I have yet to find an instance where an OptAnon cookie does not load where it should. I plan to do some more testing on this to be certain over the next few days. As for our site sample skew, I believe it could be worth it to have a subset of websites we know to have GPP / OTGPPConsent data. A thought I have is to compile a list of websites we know to have all our needed behaviors (i.e. USP API opts out after receiving a GPC signal vs USP API already opted out before receiving a GPC signal ETC.) as these are crucial in our crawl list to get a holistic representation of results. I plan on seeing if there is a list of directory of websites having certain attributes that could simplify a search for these websites, if it is something we choose to do.

@SebastianZimmeck
Copy link
Member Author

Thanks, @natelevinson10!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crawl Perform crawl or crawl feature-related
Projects
None yet
Development

No branches or pull requests

4 participants