Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deal with null entries in post-crawl data analysis #121

Open
franciscawijaya opened this issue Jul 12, 2024 · 9 comments
Open

Deal with null entries in post-crawl data analysis #121

franciscawijaya opened this issue Jul 12, 2024 · 9 comments
Assignees
Labels
data analysis A data analysis issue documentation Improvements or additions to documentation

Comments

@franciscawijaya
Copy link
Member

As we have identified from the April and June crawl, there has always been sites with empty entries (No Data). . For the data for these two months, there are around 900+ empty entries.

I have been brainstorming about what to do with these entries that have been present for the past crawls as well.
I was initially thinking of including it in the error page but I'm not sure if that would be the best move, given that the null entries for one site could be caused by a different reason than another site with null entries (ie. it is not a definite and given human check error, Insecure Certificate Error or Web Driver Error).
Screenshot 2024-07-11 at 8 23 42 PM

Right now, I am thinking of just showing these empty results in our data analysis, maybe creating a column for creating figures of the percentage of sites in our data set that give empty entries for the month's crawl?

@franciscawijaya franciscawijaya added documentation Improvements or additions to documentation data analysis A data analysis issue labels Jul 12, 2024
@franciscawijaya franciscawijaya self-assigned this Jul 12, 2024
@SebastianZimmeck
Copy link
Member

Right now, I am thinking of just showing these empty results in our data analysis, maybe creating a column for creating figures of the percentage of sites in our data set that give empty entries for the month's crawl?

Yes, that works! We can deal with the null entries at the data analysis step. We do not need to create figures for the null failures, but we should be able to say x% of a given crawl had all null values, just like we are able to say that y% of sites had a certain error.

For both null values as well as any errors, we should not include these sites in any analysis statistics or figures we present as those do not have meaning for the analysis.

Do you have a thought what causes the null entries? If we do a second attempt, would that lead to fewer null entries (i.e., apply the same approach as for errors)?

@franciscawijaya
Copy link
Member Author

Yes, that works! We can deal with the null entries at the data analysis step. We do not need to create figures for the null failures, but we should be able to say x% of a given crawl had all null values, just like we are able to say that y% of sites had a certain error.

Got it! I'll be working on creating the code to do the data analytics for the percentage of null error.

Do you have a thought what causes the null entries? If we do a second attempt, would that lead to fewer null entries (i.e., apply the same approach as for errors)?

Based on my manual observation of the list of the sites with null entries, I don't think there's a fixed way to generalize one specific cause for the null entries. However, as discussed in the previous issue, we concluded that yelp.com was a VPN issue (potentially because we have been accessing the site using the same LA VPN IP address) because when I did another crawl after June crawl with the same VPN it still failed but succeeded when I changed to another LA VPN with different IP address.

While this is the case for yelp, we can't say for sure that the cause for null entries is the same for all the other null sites. Nevertheless, looking at the null entries that have been consistently present in the previous crawls, especially some of the big names like meta.com, apple.com, I suspect that the cause for the null entries is something along the same line -- that is they recognize or block the VPN IP address. Although it is curious that they just give a blank page instead of an explicit 'Access Denied' page, as discussed in issue 51

Since I have successfully collected the sites that gave empty entries in June (which is a slightly shorter list compared to the April list), I will also try to do another crawl on just this list of sites that gave null entries as I've only tried for yelp last time to troubleshoot what was causing the change to null entries. Since I tried yelp with a different VPN LA IP address last time to check if my hypothesis was true, I will also be following that methodology for the re-crawl today.

@SebastianZimmeck
Copy link
Member

Sounds good!

We do not necessarily need to figure out the reason for the null entries for sure. But it is nice for the paper to say that the VPN IP address blocking is the reason for at least some.

As we discussed, maybe you find that a different LA VPN address for the next crawl results in fewer null entries. We may also update the crawl protocol slightly by doing a second crawl for all null entries (just like we do for other errors) with a different LA VPN.

@franciscawijaya
Copy link
Member Author

franciscawijaya commented Jul 15, 2024

I finished the re-crawl specifically for the sites that output null values for all and manually looked through the result.

Some important observations:

  • While doing the crawl, I realized that not all these sites did not necessary give a blank page like yelp.com during the original june crawl (eg. I waited in front of the computer for a while during the crawl and I realized that the crawl was able to access and visit apple.com); some sites also gave a login page or even an explicit error page
Screenshot 2024-07-12 at 5 34 28 PM From this my first conclusion is that: not all of the sites that gave null error in June crawl (and I suspect the same for April crawl) is caused by VPN error as my first hypothesis - After the crawl finished, I manually checked the analysis.json and all of them have the same null values with the exceptions of these 4 sites. _However, I'm not sure if these exceptions give meaningful information so I would love to clarify on this._ Screenshot 2024-07-15 at 4 25 50 AM Screenshot 2024-07-15 at 4 25 42 AM Screenshot 2024-07-15 at 4 23 34 AM Screenshot 2024-07-15 at 4 23 02 AM - The only notable output that is certain is just yelp.com (which has been confirmed in previous issue in a crawl just for yelp.com Screenshot 2024-07-15 at 4 44 13 AM and also in this re-crawl) - I also manually looked through the error-logging.json and I found something interesting: some of these sites that were previously flagged with errors as just null have their errors explicitly identified (ie. "WebDriverError: Reached Error Page", "HumanCheckError", "InsecureCertificateError", "TimeoutError"). So, this could give explanation to the previous question of what causing these null entries. It might be the case that in the previous crawl, the site has yet to successfully redirect to an error page, or the login page or humancheck error within the timeout period of the crawler. In other words, the null entries is a good sign of the site to be error; whether the crawl identifies exactly what the error is within the timeout period is arbitrary. This is because (in reference to my second observation), the data itself does not change (ie. they still have null entries), the only difference is just that the exact reason is identified in the new re-crawl.

I doublechecked this conclusion by looking through our result from previous crawls for sites with specific errors identified like HumanCheckError and all of their entries also give null outputs; only the error column is filled with the identified errors.

We do not necessarily need to figure out the reason for the null entries for sure. But it is nice for the paper to say that the VPN IP address blocking is the reason for at least some.

I think the cause for the null entries is an amalgamation of possible reasons. For example, yelp.com seems to be an explicit VPN issue because of the experiment that I did a few weeks ago and reconfirmed with this crawl (went from blank page to be able to access it after changing the IP address). There are other forms of such VPN error. For instance, "Access Denied" could also be a problem of VPN IP address being blocked.
Screenshot 2024-07-12 at 5 35 06 PM
However, there are also other causes to the error like "WebDriverError: Reached Error Page", "HumanCheckError", "InsecureCertificateError", "TimeoutError" that were identified for these sites in the re-crawl as mentioned in my third point of observation above.

@franciscawijaya
Copy link
Member Author

franciscawijaya commented Jul 16, 2024

Outcome: We wanted to explicitly identify the sites with only null entries. We noticed that these null entries are present in the previous crawls with roughly similar number of sites.

We found that these null entries is similar to an error. After doing a re-crawl of sites with previously null entries, we found that these null entries indicate a precursor to an error; in the re-crawl, our crawl identified and flagged some of these sites as WebDriverError: Reached Error Page", "HumanCheckError", "InsecureCertificateError", "TimeoutError" that may have caused our ability to access these sites' data and thus returned null entries.

We also found for some of the other sites, it could be the case of VPN error. For instance, after doing the re-crawl with a different VPN IP address, we managed to get data for the previously empty entries for yelp.com. At the same time, we also noticed sites that still blocked access because they potentially recognized our VPN IP address.

@SebastianZimmeck
Copy link
Member

Well said, @franciscawijaya!

For the future we will:

  • Use a different VPN LA address than the one that possibly caused yelp to have all null values
  • Add some code in the Colab to calculate the percentage of sites with all null values
  • Find for which figures that @katehausladen created, if any, we need to take out all null values and/or error sites

(If any of these warrant more discussion, feel free to open a new issue. But it is also OK to address these points here if the answers are straightforward.)

@franciscawijaya
Copy link
Member Author

franciscawijaya commented Jul 20, 2024

Update: I have added the code in the Colab to calculate the percentage of sites with all null values and have also made sure the figures for monthly data analysis did not use any of the null values and/or error sites for their calculations. (For June, these null values and/or error sites made up 8.47% of our crawl list of 11708 sites).

Misc. notes:
in my calculation of the percentage of sites with all null values, I identified and included both the sites with empty entries that we have been discussing (all null values but gpc was sent and status was added) (1) and sites that also have null values due to an explicit error (gpc was not sent from the start and status was 'not added') (2)

Examples for these two different natures of null values for reference:

Site URL site_id status domain sent_gpc uspapi_before_gpc uspapi_after_gpc usp_cookies_before_gpc usp_cookies_after_gpc OptanonConsent_before_gpc OptanonConsent_after_gpc gpp_before_gpc gpp_after_gpc urlClassification OneTrustWPCCPAGoogleOptOut_before_gpc_x OneTrustWPCCPAGoogleOptOut_after_gpc_x OTGPPConsent_before_gpc_x OTGPPConsent_after_gpc_x usps_before_gpc usps_after_gpc decoded_gpp_before_gpc decoded_gpp_after_gpc USPS implementation error Well-known Tranco OneTrustWPCCPAGoogleOptOut_before_gpc_y OneTrustWPCCPAGoogleOptOut_after_gpc_y OTGPPConsent_before_gpc_y OTGPPConsent_after_gpc_y third_party_count third_party_urls unique_ad_networks num_unique_ad_networks
https://apple.com (1) 18 added apple.com 1 null null null null null null null null {"firstParty":{},"thirdParty":{}} null null null null null null None None neither null None 42 null null null null 0 {} [] 0
https://sprint.com (2) 84 not added null null null null null null null null null null null null null null null null null None None null WebDriverError: Reached Error Page, singleTimeoutError None 152 null null null null null nan nan null

I also counted these two types of null values/error sites in my percentage calculation accordingly and hence excluded them from the figures.

Next: I'll be working on updating the code for the Crawl_Data_Over_Time (though this might take more time as I'm still working on fully understanding this Colab file).

@SebastianZimmeck
Copy link
Member

As we discussed today, at this point this issue is purely one for the data processing after the crawl. @franciscawijaya mentioned that the Crawl_Data_Over_Time still needs to be updated. Both @franciscawijaya and @natelevinson10 will work on adapting this and the other scripts as necessary.

@SebastianZimmeck SebastianZimmeck changed the title Deal with null entries Deal with null entries in post-crawl data analysis Sep 16, 2024
@SebastianZimmeck
Copy link
Member

@eakubilo will take the lead on this issue, explore it a bit, more and report findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data analysis A data analysis issue documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants