This is a basic template which can be used to write a data extractor.
The extraction logic should be placed in the process
function within
data_extractor/__init__.py
. An example has been provided.
The argument that the process
function receives is a file-like object.
It can therefore be used with most regular Python libraries. The example
demonstrates this by usage of the zipfile
module.
This project makes use of Poetry. It makes
creating the required Python Wheel a straight-forward process. Install
Poetry with the following command: pip install poetry
.
Then easily install the required python packages with poetry install
.
The behavior of the process
function can be verified by running the
tests. The test are located in the tests
folder. To run the tests
execute: poetry run pytest
.
To run the extraction code from the browser run:
python3 -m http.server
from the project root folder (the one with .git
).
This will start a webserver on: localhost.
Opening a browser with that URL will initialize the application. After
it has been loaded a file can be selected. The output of the process
function will be displayed after a while (depending on the amount of
processing required and the speed of the machine).
In this example, we first create a simulated Google Semantic Location History (GSLH) data download package (DDP). Subsequently, we extract relevant information from the simulated DDP.
Command:
poetry run python google_semantic_location_history/simulation_gslh.py
This creates a zipfile with the simulated Google Semantic Location
History data in tests/data/Location History.zip
.
GSLH data is simulated using the python libraries
GenSON,
Faker, and
faker-schema.
The simulation script first generates a JSON schema from a JSON object using
GenSON's SchemaBuilder
class. The JSON object is derived from an
example GSLH data package, downloaded from Google Takeout. The GSLH data
package consists of monthly JSON files with information on e.g.
geolocations, addresses and time spend in places and in activity. The
JSON schema describes the format of the JSON files, and can be used to
generate fake data with the same format. This is done by converting the
JSON schema to a custom schema expected by faker-schema
in the form of
a dictionary, where the keys are field names and the values are the
types of the fields. The values represent available data types in Faker,
packed in so-called providers. Faker
provides a wide variety of data
types via providers, for example for names, addresses, and geographical
data. This allows us to easily customize the faked data to our
specifications.
Command:
poetry run python google_semantic_location_history/main.py
This extracts and displays the relevant data and summary. It calls the
same process
function as the web application would. To use the GSLH
data extraction script in the web application, one needs to specify this
in pyworker.js by changing
data_extractor/__init__.py
into
google_semantic_location_history/__init__.py
.
In this example, we first create a simulated Google Search History (GSH) DDP. Subsequently, we extract relevant information from the simulated DDP.
Command:
poetry run python google_search_history/simulation_gsh.py
This will generate a dummy Google Takeout ZIP file containing a
simulated BrowserHistory.json file, and stores it in tests/data/takeout.zip
.
The BrowserHistory.json file mainly describes what websites were visited by the user and when. To simulate (news) web pages, you can either base them on real URLs (see tests/data/urldata.csv) or create entirely fake ones using Faker. The timestamp of each web visit is set to be before, during, or after a certain period (in this case, the Dutch COVID-19 related curfew), and is randomly generated using the datetime and random libraries.
Note that, even though the script is seeded and will, therefore, always yield the same outcome, there are various options to adapt the output depending on your personal (research) goal. These options are:
- n: integer, size of BrowserHistory.json (i.e. number of web visits). Default=1000,
- site_diff: float, percentage of generated websites that should be 'news' sites. Default=0.15,
- time_diff: float, minimal percentage of web searchers that were specifically made in the evening during the curfew period. Default=0.20,
- seed: integer, sets seed. Default=0,
- fake: boolean, determines if URLs are based on true URLs (False) or entirely fake (True). Default=False
Command:
poetry run python google_search_history/main.py
After running this script, the relevant data (i.e., an overview of the
number of visits of news vs. other websites before, during, and after
the curfew, and the corresponding time of day of the visits) are
extracted from the (simulated) takeout.ZIP and displayed in a dataframe
together with a textual summary. It calls the same process
function as
the web application would. To use the GSH data extraction script in the
web application, specify this in pyworker.js by changing
data_extractor/__init__.py
into
google_search_history/__init__.py
.