For Devopedia Articles the citation adapted is Chicago style. More on the style is in the following link
https://www.citationmachine.net/chicago
In this project we have aimed to construct the citation for website given the URL. The examples are in the following link
https://www.chicagomanualofstyle.org/tools_citationguide/citation-guide-1.html#cg-website
The standard format is as follows
Author, year_of_publishing, Title, granular time details, granular content details
We attempt to predict the content in bold given URL
-
Total Devopedia Articles - About 10K
-
Web Site Referenced Articles - About 7K
-
Used Articles
-For Authors- 5064 files
-For Titles- 6220 files
-For YoPs- 3532 files
-
Train Articles
-For Authors- 4477 files
-For Titles- 5633 files
-For YoPs- 2945 files
-
Test Articles -587 files for each
-
Challenges:
-
Plenty of .htm files had no relevant content (for example, just script and meta tags)
-
Some .htm files did not support 'utf-8' or 'iso' encodings (perhaps in another language script)
-
Those .htm files which did not contain even a single Title/Author/YoP encoding label as 1 for all its extracted tag-text rows were not used in their respective field datasets
-
-
Crawl the URL
-
Extract Tag and Content
-
Label the Tag and Content with Author/YoP/Title if the content has Author/YoP/Title respectively
-
Build Features
- Number of tokens in text
- Weights for Tags, say, if title appears in 'h1' tag 10% of the times, the weight for 'h1' is 0.10
- number of commas as percentage of number of tokens - this is useful to find Author
- number of tokens have first letter as Capital letter
- index(position) weightage of Tag/content. the earlier the tag/content, the higher weightage is assigned
- if the text contains year(4 digit number with first digit either 1 or 0)
-
Split the URLs(and its content) into Train and Test
-
Train distinct models for Author/Title/YoP by up-sampling/down-sampling
-
Since the probable tags are predicted some post processing is done to extract Author/Title/YoP
-
The dataset is imbalanced because only fewer tags can be labelled as Author/Title/YoP.
Hence False negatives are minimized and False positives are compromised. The idea is to increase recall and compromise on precision
We publish top 3 predictions and expect the exact predictions to be part of the top 3
Reference string is constructed
pip install -r requirements.txt
- Modify the train_inputs.csv file to configure
- html_path - location where crawled html files are present
- articles - location where train data is made available in json format
- meta_data - location where meta data is present
- you can also configure number of layers, number of nodes in each layer, learning rate, number of epoch and batch size towards the end of the file
- The output files are named in the configuration. please leave them as it is to avoid errors
python commands.py preprocess train inputs.csv (files_batch_size)
python commands.py train inputs.csv save (model_name)
python commands.py postprocess train inputs.csv (model_name)
- The preprocess will generate files in Train_preprocess folder
- The model files will be generated in the root folder as .pkl files
- The postprocess will generate files Train_diagnostics folder
Since we ignore false positives on purpose and depend on top 3 predictions post processing a custom metric(based on similarity with original ref string) is worked out to arrive at accuracy
This needs further attention
python commands.py predict (url) (model_name)
- The URL inputted is crawled and saved as file.htm in the Prediction_outputs folder (which is created if not already present in the root direcotry)
- This results are displayed in screen and files will be present in Prediction_outputs folder
- The results are also stored in Prediction_results.txt file in the Prediction_outputs folder
python commands.py predict https://sunscrapers.com/blog/10-django-packages-you-should-know/ ANN
- Reference String construction for PDFs/Text files
- Eliminate redundant code for train and predict
- A robust mechanism to define accuracy
- Model file is big and takes long time to load at the time of prediction
- Exploring NER's to narrow down predictions yielding accuracy.
- We attempted, but it slows down training/prediction
- Improve features - Tag Weight and index Weight - investigate data and innovate math to represent tag and index with appropriate weights
- Constructing Reference String in case of multiple authors