Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jupyter to streamlit workflow #55

Closed
wants to merge 64 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
73adb1b
Merge pull request #10 from TheDataRideAlongs/fix(50m)
lmeyerov Mar 31, 2020
ec478f3
refactor(print): now via Python logger
lmeyerov Apr 1, 2020
eb4e7c6
fix(fh missing cols): handle
lmeyerov Apr 1, 2020
771c1fb
feat(fixed fh arrow schema): and jsonify nested cols bc pq writer can…
lmeyerov Apr 1, 2020
5c1d0ff
Merge pull request #11 from TheDataRideAlongs/dev/fix-fh
lmeyerov Apr 1, 2020
4bab6db
Changed the print() statements for actual pythong logging statements
bechbd Apr 1, 2020
fc851da
Merge pull request #12 from TheDataRideAlongs/logging_fix
bechbd Apr 1, 2020
262e8d8
addig gitignore
007vasy Apr 2, 2020
f2b6543
Merge branch 'master' into Issue-#1
007vasy Apr 2, 2020
c331889
prints changed to logger
007vasy Apr 2, 2020
cecb571
Added methods to get_from_neo and get_tweets_by_id
bechbd Apr 2, 2020
1436b8d
Merge pull request #14 from TheDataRideAlongs/logging_fix
bechbd Apr 2, 2020
207d32c
Use pandas when cudf does not exist
ZiyaoWei Apr 2, 2020
14cb467
Merge pull request #15 from TheDataRideAlongs/wzy/no_gpu
ZiyaoWei Apr 2, 2020
f1426b6
Prototype rehydrate pipeline
ZiyaoWei Apr 2, 2020
f14dffe
Fix bug
ZiyaoWei Apr 2, 2020
325db35
docs(README): add calendar
lmeyerov Apr 2, 2020
c48ef17
Merge branch 'master' into Issue-#1
007vasy Apr 2, 2020
3c24774
Fixed issue with limit on get_from_neo, added timeout parameter, and …
bechbd Apr 2, 2020
a323358
Merge pull request #13 from TheDataRideAlongs/Issue-#1
007vasy Apr 2, 2020
83e7dcd
Merge branch 'master' into switch_python_neo_driver
bechbd Apr 2, 2020
ba2ea00
Merge pull request #17 from TheDataRideAlongs/switch_python_neo_driver
bechbd Apr 2, 2020
83a2294
docs(tightening)
lmeyerov Apr 3, 2020
328c997
Skip when there is no data
ZiyaoWei Apr 2, 2020
b72eb75
Fix logging, add parameter for saving to Neo4j
ZiyaoWei Apr 3, 2020
256df48
Merge pull request #20 from TheDataRideAlongs/wzy/rehydratePipeline
ZiyaoWei Apr 3, 2020
4c2b1ab
docs(issue tracker): link gh projects on README
lmeyerov Apr 3, 2020
a89528a
docs(volunteers): Add legal
lmeyerov Apr 3, 2020
8ec9061
docs(README): project tracker links
lmeyerov Apr 3, 2020
80c8622
Got initial version of the unit tests working for Neo
bechbd Apr 3, 2020
609b95a
Add docker-compose.yml for prefect UI
ZiyaoWei Apr 4, 2020
8e7a88b
Merge pull request #37 from TheDataRideAlongs/wzy/dockerizePrefect
ZiyaoWei Apr 4, 2020
dea4f03
fixed getting interlnational trials, with utf8 encoding
007vasy Apr 6, 2020
55958d2
urls to config
007vasy Apr 6, 2020
51ab12f
removed consol useage from scraping data
007vasy Apr 6, 2020
22d418a
Dockerize pipeline and add instructions
ZiyaoWei Apr 5, 2020
70f196e
Merge pull request #40 from TheDataRideAlongs/wzy/dockerizePipelines
ZiyaoWei Apr 6, 2020
07409e4
confortable neo4j import setup
007vasy Apr 6, 2020
8cf111b
confortable edge inserting into neo4j
007vasy Apr 6, 2020
4e6c3ec
flexible insertion into neo4j
007vasy Apr 6, 2020
a4aa10f
add config
007vasy Apr 6, 2020
f1831f6
Made minor tweaks to get the prefect ui stuff to run correctly on the…
bechbd Apr 7, 2020
4a4970e
Merge pull request #48 from TheDataRideAlongs/update-prefect-ui-files
bechbd Apr 7, 2020
c96adf3
data scraping into class
007vasy Apr 7, 2020
c9c89f1
updated gitignore
007vasy Apr 7, 2020
3f6f5c5
all drugs and synonyms are imported
007vasy Apr 7, 2020
52be179
cleanup
007vasy Apr 7, 2020
6e554ef
refactor
007vasy Apr 7, 2020
7a030d6
filtering international studies
007vasy Apr 7, 2020
31db102
docs(README.md): infra links
lmeyerov Apr 7, 2020
530c087
Merge pull request #52 from TheDataRideAlongs/add_neo_unit_tests
bechbd Apr 8, 2020
9b137f1
Added metrics configuration
bechbd Apr 8, 2020
d73186b
Added metrics configuration
bechbd Apr 8, 2020
0d7e2e5
drug analysis
007vasy Apr 8, 2020
6df6b15
table merging WIP
007vasy Apr 9, 2020
1567927
studies normalized into one table from 2 different sources
007vasy Apr 9, 2020
98c9e19
#39 - Added method to allow for adding enrichment properties to a node
bechbd Apr 9, 2020
2a9afb0
studies to neo4j is done
007vasy Apr 9, 2020
4c17c8c
WF
007vasy Apr 9, 2020
6c4bc2b
study import fix
007vasy Apr 9, 2020
46fdf2b
drug-study links
007vasy Apr 9, 2020
9aa0a13
cudf set up
007vasy Apr 9, 2020
212a9e2
dict to cypher property code merge for make it available for others
007vasy Apr 9, 2020
d3f626e
Merge pull request #56 from TheDataRideAlongs/DictToCypherProperties
bechbd Apr 9, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
**/.git
153 changes: 153 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,153 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
# For a library or package, you might want to ignore these files since the code is
# intended to run in multiple environments; otherwise, check them in:
# .python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# static files generated from Django application using `collectstatic`
media
static

# Env files
.DominoEnv/
.vscode/
data/
.Domino/
.XLS2CSV/
modules/TempNB/all_US_studies_by_keyword.json
modules/TempNB/drug_vocab.csv.zip
modules/TempNB/vocab.csv
30 changes: 21 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,34 @@

## Scaling COVID public behavior change and anti-misinformation

* [Private repository](https://github.com/graphistry/ProjectDomino-internal)

* [Community Slack channel: #COVID](https://thedataridealongs.slack.com/) via an [open invite link](https://join.slack.com/t/thedataridealongs/shared_invite/zt-d06nq64h-P1_3sENXG4Gg0MjWh1jPEw)

* **Graph4good contributors:** We're excited to work with you! Check out the subprojects below we are actively seeking help on, and ping on Slack for which you're curious about tackling. We can then share our current thoughts and tips for getting started. Most will be useful as pure Python [Google Collab notebook](https://colab.research.google.com) proveouts and local Neo4j Docker + Python proveouts: You can move quickly, and we can more easily integrate into our automation pipelines.
* [Meetings: Google calendar](https://calendar.google.com/calendar?cid=Z3JhcGhpc3RyeS5jb21fdTQ3bmQ3YTdiZzB0aTJtaW9kYTJybGx2cTBAZ3JvdXAuY2FsZW5kYXIuZ29vZ2xlLmNvbQ)

* Project Tackers: [Core](https://github.com/TheDataRideAlongs/ProjectDomino/projects/1), [Interventions](https://github.com/TheDataRideAlongs/ProjectDomino/projects/2), and [Issues](https://github.com/TheDataRideAlongs/ProjectDomino/issues)

* Infra - ask for account access in Slack
* [Private repository](https://github.com/graphistry/ProjectDomino-internal)
* Observability - [Grafana](http://13.68.225.97:10007/d/n__-I5jZk/neo4j-4-dashboard?orgId=1)
* DB - Neo4j
* Orchestration - Prefect
* Key sharing - Keybase team
* CI - Github Actions

* **Graph4good contributors:** We're excited to work with you! Check out the subprojects below we are actively seeking help on, look at some of the Github Issues, and ping on Slack for which you're curious about tackling and others not here. We can then share our current thoughts and tips for getting started. Most will be useful as pure Python [Google Collab notebook](https://colab.research.google.com) proveouts and local Neo4j Docker + Python proveouts: You can move quickly, and we can more easily integrate into our automation pipelines.

**One of the most important steps in stopping the COVID-19 pandemic is influencing mass behavior change for citizens to take appropriate, swift action on mitigating infection and human-to-human contact.** Government officials at all levels have advocated misinformed practices such as dining out or participating in outdoor gatherings that have contributed to amplifying the curve rather than flattening it. At time of writing, the result of poor crisis emergency risk communication has led to over 14,000 US citizens testing positive, 2-20X more are likely untested, and over 200 deaths. The need to influence appropriate behavior and mitigation actions are extreme: The US has shot up from untouched to become the 6th most infected nation.

**Project Domino accelerates research on developing capabilities for information hygiene at the mass scale necessary for the current national disaster and for future ones.** We develop and enable the use of 3 key data capabilities for modern social discourse:
* Identifying at-risk behavior groups and viable local behavior change influencers
* Detecting misinformation campaigns
* Identifying at-risk behavior groups and viable local behavior change influencers
* Automating high-precision interventions

## The interventions

We are working with ethics groups to identify safe interventions along the following lines:

* **Targeting of specific underserved issues**: Primary COVID public health issues such as unsafe social behavior, unsafe medicine, unsafe science, dangerous government policy influence, and adjacent issues such as fake charities, phishing, malware, and hate groups
* **Targeting of specific underserved issues**: Primary COVID public health issues such as unsafe social behavior, unsafe medicine, unsafe science, dangerous government policy influence, and adjacent issues such as fake charities, phishing, malware, and hate group propaganda

* **Help top social platforms harden themselves**: Trust and safety teams at top social networks need to be able to warn users about misinformation, de-trend it, and potentially take it down before it has served its purpose. The status quo is handling incidents months after the fact. We will provide real-time alert feeds and scoring APIs to help take action during the critical minutes before misinformation gains significant reach.

Expand All @@ -36,21 +46,23 @@ We are working with ethics groups to identify safe interventions along the follo
* Twitter firehose monitor
* Data integration pipeline for sources of known scams, fraud, lies, bots, propaganda, extremism, and other misinformation sources
* Misinformation knowledge graph connecting accounts, posts, reports, and models
* Automated GPU / graph / machine learning pipeline
* Automated GPU / graph / machine learning pipeline: general classification (bot, community, ...) and targeted (clinical disinformation, ...)
* Automated alerting & reporting pipeline
* Interactive visual analytics environment for data scientists and analysts: GPU, graph, notebooks, ML, ...
* Intervention bots

## How to help

We are actively seeking several forms of support:

* **Volunteers**: Most immediate priority is on data engineering and advisors on marketing/public health
* **Data engineers: Orchestration (Airflow, Prefect.io, Nifi, ...), streaming (Kafka, ...), graph (Neo4j, cuGraph), GPU (RAPIDS), ML (NLP libs), and databases**
* Analysts: OSINT, threat intel, campaign tracking, ...
* Data scientists: especially around graph, misinformation, neural networks, NLP, with backgrounds such as security, fraud, misinformation, marketing
* **Analysts: OSINT, threat intel, campaign tracking, ...**
* **Data scientists: especially around graph, misinformation, neural networks, NLP, with backgrounds such as security, fraud, misinformation, marketing**
* Developers & designers: intelligence integrations, website for search & reports, automations, intervention bots, API
* Marketing: Strategy & implementation
* Public health and communications: Especially around intervention design
* Public health and communications: Especially around safe and effective intervention design
* Legal: Risk & legal analysis for various interventions

* **APIs and Data**:
* Feeds & enriching APIs: Lists and intel on URLs, domains, keywords, emails, topics, blockchain accounts, social media accounts & content, clinical trials, esp. if tunable on topic
Expand Down
44 changes: 44 additions & 0 deletions infra/metrics/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
version: '3.3'

networks:
lan:
external:
name: i-already-created-this

services:
prometheus:
image: prom/prometheus
network_mode: 'bridge'
ports:
- 10005:9090
volumes:
- /datadrive/prometheus:/etc/prometheus
- /datadrive/prometheus/storage:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
restart: always

cadvisor:
image: google/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- 10006:8080
network_mode: 'bridge'
restart: always

grafana:
image: grafana/grafana
depends_on:
- prometheus
network_mode: 'bridge'
ports:
- 10007:3000
volumes:
- /datadrive/grafana/storage:/var/lib/grafana
- /datadrive/grafana/provisioning/:/etc/grafana/provisioning/
restart: always
30 changes: 30 additions & 0 deletions infra/metrics/prometheus/prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: []
scheme: http
timeout: 10s
api_version: v1
scrape_configs:
- job_name: prometheus
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:9090
- job_name: neo4j-prometheus
honor_timestamps: true
scrape_interval: 10s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- 172.17.0.3:2004
53 changes: 48 additions & 5 deletions infra/neo4j/docker/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,17 +1,22 @@
version: '3.3'

services:
neo4j:
image: neo4j:4.0.2-enterprise
network_mode: "bridge"
container_name: neo4j-server
network_mode: 'bridge'
ports:
- "10001:7687"
- "10002:7473"
- "10003:7474"
- '10001:7687'
- '10002:7473'
- '10003:7474'
- '2004:2004'
restart: unless-stopped
volumes:
- /datadrive/neo4j/plugins:/plugins
- /datadrive/neo4j/data:/data
- /datadrive/neo4j/import:/import
- /datadrive/neo4j/logs:/logs
- /datadrive/neo4j/conf:/conf
environment:
- NEO4JLABS_PLUGINS=["apoc"]
- NEO4J_AUTH=neo4j/neo123
Expand All @@ -20,6 +25,44 @@ services:
- NEO4J_apoc_export_file_enabled=true
- NEO4J_dbms_backup_enabled=true
- NEO4J_dbms_transaction_timeout=60s
- NEO4j_apoc_trigger_enabled=true
logging:
options:
tag: "ImageName:{{.ImageName}}/Name:{{.Name}}/ID:{{.ID}}/ImageFullID:{{.ImageFullID}}"
tag: 'ImageName:{{.ImageName}}/Name:{{.Name}}/ID:{{.ID}}/ImageFullID:{{.ImageFullID}}'
prometheus:
image: prom/prometheus
network_mode: 'bridge'
ports:
- 10005:9090
volumes:
- /datadrive/prometheus:/etc/prometheus
- /datadrive/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- /datadrive/prometheus/storage:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
restart: always

cadvisor:
image: google/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- 10006:8080
network_mode: 'bridge'
restart: always

grafana:
image: grafana/grafana
depends_on:
- prometheus
network_mode: 'bridge'
ports:
- 10007:3000
volumes:
- /datadrive/grafana/storage:/var/lib/grafana
- /datadrive/grafana/provisioning/:/etc/grafana/provisioning/
restart: always
21 changes: 20 additions & 1 deletion infra/neo4j/scripts/neo4j-indexes.cypher
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,23 @@ ON (n:Url) ASSERT n.full_url IS UNIQUE

CREATE INDEX tweet_by_type
FOR (n:Tweet)
ON (n.tweet_type)
ON (n.tweet_type)

CREATE INDEX tweet_by_hydrated
FOR (n:Tweet)
ON (n.hydrated)

CREATE INDEX tweet_by_text
FOR (n:Tweet)
ON (n.text)

CREATE INDEX tweet_by_created_at
FOR (n:Tweet)
ON (n.created_at)

CREATE INDEX tweet_by_record_created_at
FOR (n:Tweet)
ON (n.record_created_at)

CALL db.index.fulltext.createNodeIndex("tweet_by_text_fulltext",["Tweet"],["text"])

Loading