a simple and extensible Indonesian Index News Crawler
- Detik.com : "https://news.detik.com/indeks"
- Liputan6.com : "https://www.liputan6.com/indeks"
- Kompas.com : "https://indeks.kompas.com" (only the news site is being indexed)
- CNNIndonesia.com : "https://www.cnnindonesia.com/nasional/indeks/3" (only the nasional site is being indexed)
- Tempo.co : "https://www.tempo.co/indeks" (only the nasional site is being indexed)
- TurnBackHoax : "https://turnbackhoax.id" (HIGHLY EXPERIMENTAL, USE WITH CAUTION)
- BeautifulSoup4
- requests
I recommend you to use virutalenv, but this is completely optional
-
on python > 3.x :
virtualenv main
on python > 3.3 :
python -m venv main
-
activate the virutalenv :
source main/bin/activate
-
deactivate the virtualenv:
deactivate
-
Install Requirements
pip install -r requirements.txt
-
Open / Create
main.json
filesthis files is used to determine what task the program will do (see more here)
-
Run the program
python -m main
this program use a simple json file to determine from which source it needs to take.
a template under the name main.json
is already available. You can use it as a base.
"tasks" : [
{
"src":"detik",
"target_length":5,
"start_date": "29/3/2021"
}
]
available parameter
params | required | options | description | default |
---|---|---|---|---|
src | required | "detik", "liputan6", "cnnIndo", "kompas", "tempo", "turnbackhoax" | the source of the news | |
target_length | required | how many news is crawled | ||
start_date | optional | the start date of the news (format dd/mm/yyyy) | system date (datetime.now() ) |
|
end_date | optional | the maximum date of the news (format dd/mm/yyyy) | system date (datetime.now() ) |
"output": {
"type": "csv",
"name": "result.csv"
}
available parameter
params | required | options | description | default |
---|---|---|---|---|
type | required | "csv" | the output file type | |
name | optional | the output file name | "output.csv" |