Tomás Costa - 89016
You need to install the requirements with pip:
pip install -r requirements.txt
To test with a larger dataset, you must also change the file in content/metadata.csv to the one you desire. Since the upload limit is 50Mb, the metadata included couldnt be too large
https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/2020-12-01/metadata.csv
cd code
python3 main.py -h
First time: (for cleaning the hidden indexes)
cd code
python3 main.py -c 10000 -t complex -r bm25 -a -p -z
Next times: (fetches previously indexed values)
python3 main.py -c 10000 -t complex -r bm25 -a -p
Usage: python3 main.py
-t <tokenizer_mode: complex/simple>
-c <chunksize:int>
-p <positional_boosting:boolean>
-n <limit of docs returned:int>
-r <ranking_mode:tf_idf/bm25>
-a <analyze_table:boolean>
-z <reset .tmp dir: boolean>
-
The tokenizer mode specifies if the tokenizer is simple or complex, and we know the complex one is better to analyze the text, since it deletes pronouns and commonly used words that are not related to the theme of the corpus.
-
The number_lines defines the amount of lines you want to read at once, we recommend 8000-10000 for this document, since it doesnt slow down a lot but loads way less data into memory.
The code provided is in the /code folder, the answers to the questions are printed by the code with the special option -a
/content provides the datasets and texts used.
/output provides the indexed map txt.
The indexing takes quite some time, since the collection is very big and we are using a SPIMI approach, but you only need to index once. After indexing once, the results will be written to a .tmp folder and be hidden from the user, but the blocks and indexes will stay there.
There shouldnt be a memory problem with loading indexes to memory, since the index will always occupy at most 75% of the available memory, and if there's no memory for that, it will start deleting indexes that have been loaded but havent been used frequently.