declaratii.integritate.eu scraper
declaratii-integritate.py
- fetches html rows and csvs (Exporta resultate)consolidate_dl_csvs.py
- consolidates dowloaded (Exporta resultate) csvsconsolidate-all.py
- consolidates Exported csv w parsed html rows
big pages wo pagination? count page turns max 10k go for bucuresti, bucuresti si un fel de declaratie then difference per judet, log errors
-
get data from result table
-
follow pagination
-
loop dates
-
log errors
-
log results & meta/ count results
-
prevent overwrite (update untill found)
-
scrape from today to the latest logged day
-
split into functions
-
update missing values
- break query in parts - date, tip declaratie, judet
-
download csvs
- rename to target date
-
consolidate declaratii din tabel
-
consolidate downloaded csvs
- remove duplicates
-
compare results - between downloaded csvs and scraped table pages / rows
-
build stats
-
download pdfs
-
parse pdfs
open page click advanced
loop (while start_date < earliest_date)
add dates
see if max 10000 - if so previou day log err
see if none - if so previous range log err
if resultsTable
read table (function)
check if next
append table
advance next
write to csv
advance days
- open declaratii.integritate.eu
- click 'Căutare avansată' button (
<a id="form:showAdvancedSearch">
) found here:<div class="srch_switch"><a href="#" id="form:showAdvancedSearch">Căutare avansată</a><span></span></div>
- wait for
<div id="form:advanced-search-panel"> ... <div id="form:advanced-search-panel_content">
to load - input current date (ex: 03.10.2023) in this field, also change 'value' parameter accordingly
<input id="form:endDate_input" name="form:endDate_input" role="textbox" type="text" value="03.10.2023">
input current date - 3 working days (ex: 29.09.2023) in<input id="form:startDate_input" name="form:startDate_input" role="textbox" size="10" type="text" value="29.09.2023">
- also change 'value' attribute inside tag - click on
<input id="form:submitButtonAS">
inside<div class="advanced_srch_subm_right"><input class="button" id="form:submitButtonAS" name="form:submitButtonAS" type="submit" value="caută>"><span></span></div>
Nume Prenume | Institutie | Functie | Localitate | Judet | Data completare declaratie | Tip declaratie |
---|---|---|---|---|---|---|
GRECU P PETRUȚA AURORA | Liceul Tehnologic Energetic Dragomir Hurmuzescu Deva | Director adjunct | Deva | Hunedoara | 30.09.2023 | Declaraţie de avere |
DAMIAN D IULIA | Administratia Nationala Apele Romane - Cod Caen 3600, 3900, 4291, 6203 | Șef birou | Sectorul 1 | Bucuresti | 02.10.2023 | Declaraţie de avere |
TEODORESCU Gh GEORGETA | Inspectoratul Scolar Al Judetului Arges | Inspector scolar | Pitesti | Arges | 30.09.2023 | Declaraţie de interese |
Gorcitz V Lucica | Directia Generala Regionala A Finantelor Publice Brasov | Inspector | Targu Mures | Mures | 02.10.2023 | Declaraţie de avere |
SELECT
COALESCE(d.NumePrenume, t.NumePrenume) AS NumePrenume,
COALESCE(d.Institutie, t.Institutie) AS Institutie,
COALESCE(d.Datacompletaredeclaratie, t.Datacompletaredeclaratie) AS Datacompletaredeclaratie,
COALESCE(d.Tipdeclaratie, t.Tipdeclaratie) AS Tipdeclaratie,
d.Functie_x,
d.Localitate_x,
d.Judet_x,
d.Functie_y,
d.Localitate_y,
d.Judet_y,
d.Vezideclaratie AS Vezideclaratie_d,
d."Vezideclaratie.1" AS Vezideclaratie1_d,
t.Functie AS Functie_t,
t.Localitate AS Localitate_t,
t.Judet AS Judet_t,
t.Vezideclaratie AS Vezideclaratie_t,
t."Vezideclaratie2" AS Vezideclaratie2_t,
COALESCE(d.page, t.page) AS page,
COALESCE(d.rezultate, t.rezultate) AS rezultate,
COALESCE(d.start_date, t.start_date) AS start_date,
COALESCE(d.end_date, t.end_date) AS end_date,
CASE
WHEN d.NumePrenume IS NOT NULL AND t.NumePrenume IS NOT NULL THEN 'both'
WHEN d.NumePrenume IS NOT NULL THEN 'dlcsv'
ELSE 'tblz'
END AS merged_status
FROM dlcsv d
LEFT JOIN tblz t ON d.NumePrenume = t.NumePrenume
AND d.Institutie = t.Institutie
AND d.Datacompletaredeclaratie = t.Datacompletaredeclaratie
UNION ALL
SELECT
COALESCE(t.NumePrenume, d.NumePrenume),
COALESCE(t.Institutie, d.Institutie),
COALESCE(t.Datacompletaredeclaratie, d.Datacompletaredeclaratie),
COALESCE(t.Tipdeclaratie, d.Tipdeclaratie),
d.Functie_x,
d.Localitate_x,
d.Judet_x,
d.Functie_y,
d.Localitate_y,
d.Judet_y,
d.Vezideclaratie,
d."Vezideclaratie.1",
t.Functie,
t.Localitate,
t.Judet,
t.Vezideclaratie,
t."Vezideclaratie2",
COALESCE(t.page, d.page),
COALESCE(t.rezultate, d.rezultate),
COALESCE(t.start_date, d.start_date),
COALESCE(t.end_date, d.end_date),
CASE
WHEN t.NumePrenume IS NOT NULL AND d.NumePrenume IS NULL THEN 'tblz'
ELSE 'dlcsv'
END AS merged_status
FROM tblz t
LEFT JOIN dlcsv d ON t.NumePrenume = d.NumePrenume
AND t.Institutie = d.Institutie
AND t.Datacompletaredeclaratie = d.Datacompletaredeclaratie
WHERE d.NumePrenume IS NULL;
CREATE VIEW stats1 AS
SELECT
SUBSTR(Datacompletaredeclaratie, 7, 4) AS year,
SUBSTR(Datacompletaredeclaratie, 4, 2) AS month,
Institutie,
COALESCE(Localitate_x, Localitate_t) AS localitate,
COALESCE(Judet_x, Judet_y, Judet_t) AS judet,
COUNT(*) AS count
FROM match_all
WHERE Institutie IS NOT NULL
AND Datacompletaredeclaratie IS NOT NULL
GROUP BY year,
month,
Institutie,
localitate,
judet
HAVING count > 0;