Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate if increased timeouts or other paramater changes improve crawl analysis accuracy #135

Open
SebastianZimmeck opened this issue Sep 18, 2024 · 5 comments
Assignees
Labels
crawl Perform crawl or crawl feature-related top priority Important and as soon as possible to address

Comments

@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Sep 18, 2024

Before the next crawl (#118), we should look into what caused the divergence of crawl results from the manual analysis results in our most recent analysis. See the red-labeled fields:

Screenshot 2024-09-18 at 9 03 57 AM

Do increased timeouts help? Maybe, some sites were not fully loaded before the data was captured. Are there other parameters that we can fine-tune to improve accuracy?

@franciscawijaya will take the lead here and work with @natelevinson10 and @eakubilo before starting the next crawl.

@SebastianZimmeck SebastianZimmeck added the crawl Perform crawl or crawl feature-related label Sep 18, 2024
@franciscawijaya
Copy link
Member

franciscawijaya commented Sep 24, 2024

Some of my findings:

  1. Optanonconsent after GPC
  • increased timeout did not change the crawl data (especially for Optanonconsent after gpc) because there is a human check error
  • I noticed that the websites that had discrepancies with the manual check for Optanonconsent_after_gpc had the same human check error power by PerimeterX (see attached below)
  • I tried to bypass the human check error by pressing and holding during the crawl but was not able to bypass it; my conclusion is that this particular human check mechanism is that it recognized that we are accessing the site using a selenium crawler (as it is mentioned that some of the reasons of the limited access to the page was because either javascript is disabled or the browser does not support cookies -- suggesting that the human check recognized that there is a possibility of automation in the browser)
  • I also tried to inspect storage while the crawl is ongoing with the human check and realize that in the storage, the value of the cookie is also zero. Hence, we can conclude that the human check page does not reflect the cookie value of the actual site detecting GPC.
  1. Pridecounseling.com redirect
  • the crawl did not capture the data of betterhelp.com (the site that pride counseling.com is being redirected to when done manually)
  • It instead logged it in the errorlogging as an InsecureCertificateError
  • While during manual check, we are redirected to betterhelp.com, the crawler instead only access a blank page when trying to access pride counseling.com

The good news is that

  • all the human check errors that caused the optanonconsent after gpc to still be 0 are correctly logged as humancheckerrors in error logging.

Given that this error is correctly logged in the error logging and would only be able to be accessed manually, I think it would still be okay to start the crawl. Should we encounter this in our crawl, this has better informed us about the possible reason of cases where optanonconsent cookies are 0 both before and after GPC for sites that are also flagged with human check errors. I will begin the California crawl shortly.

Screenshot 2024-09-23 at 3 27 57 AM

@SebastianZimmeck SebastianZimmeck added the top priority Important and as soon as possible to address label Sep 24, 2024
@SebastianZimmeck
Copy link
Member Author

@franciscawijaya, for the Optanonconsent after GPC:

  1. Were some of the sites also in our previous crawls? If so, what were the results then?
  2. Do you get the same results for Colorado, California, another VPN? In other words, is this VPN-dependent?

@natelevinson10 and @eakubilo, if you have a chance, please also look into this issue? This is the most important issue at the moment. It would be good to get a good understanding before we start the next crawl.

@SebastianZimmeck
Copy link
Member Author

@natelevinson10 is saying:

Did a quick review of the manual data we collected a couple of weeks ago and targeted instances of a mismatch (marked in red) signaling our ground truth was different than the crawl data. I used several VPN locations (California, multiple Colorado, Virginia, and no VPN (CT)) and gave ample time to let all of the site content to load.

I was not able to find a single instance of our manual data changing from what we had reported, except for bumble.com 's USPapi_before being "1YNN" instead of the reported "1YYN", and I would chalk this up to a manual error on our end. It would seem that for the mismatches of data from crawl to manual, the manual data is more accurate.

@franciscawijaya
Copy link
Member

franciscawijaya commented Sep 26, 2024

  1. Were some of the sites also in our previous crawls? If so, what were the results then?

I wanted to be precise and check the progression of the data across the different rounds of Crawl that we did and these are the findings:

  1. Dickies.com: Dec-April (isGpcEnabled=1) [matches Manual Data], June Crawl (isGpcEnabled=0) [matches recent Crawl Data] --> high priority: sudden change in the Crawl output
  2. Altrarunning.com: Dec-June (isGpcEnabled=1) [matches Manual Data] --> high priority: the Crawl output has been accurate since the first crawl so it is strange that we are getting isGpcEnabled = 0 now
  3. Smartwool.com: Dec-June (isGpcEnabled=1) [matches Manual Data] --> high priority: the Crawl output has been accurate since the first crawl so it is strange that we are getting isGpcEnabled = 0 now
  4. Goodrx.com: NULL for uspapiaftergpc [matches recent Crawl Data] --> low priority: the difference between Crawl & Manual has been present since the first Crawl
  5. Redrobin.com: Dec-Feb (isGpcEnabled=0) [matches Recent Crawl Data], April-June (isGpcEnabled=1) [matches Manuall Data] --> high priority: it's strange that the Crawl has gotten an output of isGpcEnabled = 1 after previously ouputting 0 but is now back to 0

Analysis:
While there are only 5 overlapping sites, their outputs in the past crawls have mostly been the correct ones, with the exception of Dickies.com so it's pretty strange that we got the opposite for our recent small crawl.

2. Do you get the same results for Colorado, California, another VPN? In other words, is this VPN-dependent?

Since I used Colorado VPN for the recent small crawl, I also have done a crawl focusing on these focus sites with California crawl. However, after analyzing the result with California VPN, I realized that the VPN could be a potential problem as it gave minor differences in the outputs. Hence what I will do is redo these focus mini crawls with more VPN IP addresses to have more data to compare (ie. using more than 1 Colorado and California VPN) and confirm before writing the analysis here.

@SebastianZimmeck
Copy link
Member Author

Thanks, @franciscawijaya!

While there are only 5 overlapping sites, their outputs in the past crawls have mostly been the correct ones, with the exception of Dickies.com so it's pretty strange that we got the opposite for our recent small crawl.

That is helpful to know! So, before starting the next crawl we should try to understand what the reasons for these performance drops are.

I realized that the VPN could be a potential problem as it gave minor differences in the outputs. Hence what I will do is redo these focus mini crawls with more VPN IP addresses to have more data to compare

Yes, that is a good point to try.

@natelevinson10, can you coordinate with @franciscawijaya and also look into this as a team?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crawl Perform crawl or crawl feature-related top priority Important and as soon as possible to address
Projects
None yet
Development

No branches or pull requests

4 participants