This code reads an XML file and extracts data from it to create nodes and relationships in a Neo4j graph database. It uses the py2neo library to connect to the database and the xmltodict library to parse the XML file
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Table of Contents
👤 ** Rodrigo Wurdig **
- Linkedin: @rodrigo-soares-wurdig
- Github: @rwurdig
This code reads an XML file and extracts data from it to create nodes and relationships in a Neo4j graph database. It uses the py2neo library to connect to the database and the xmltodict library to parse the XML file.
- Docker and Docker-Compose
- Neo4j (docker container)
- Python (docker container)
- Airflow (docker container)
- Bash scripting
This is how you setting up your project locally.
- To get a local copy up and running follow these simple steps bellow.
- Airflow 2.5.2
- Neo4j 5.6.0
- pendulum 2.1.2
- pip 23.0.1
- postgres:14.0
- Python 3.x
- py2neo 2021.2.3
- xmltodict 0.13.0
After installing Python and pip, run the following command to install the necessary Python packages:
pip install neo4j xml airflow etc
git clone https://github.com/rwurdig/Databio_project.git
cd Databio_project
chmod +x build.sh
./build.sh
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See License for more information.
👤 Rwurdig: E-mail
Project Link: https://github.com/rwurdig/Databio_project
tl;dr: The objective is to create a data pipeline that will ingest a UniProt XML file (
data/Q9Y261.xml
) and store the data in a Neo4j graph database.
Read the XML file Q9Y261.xml
located in the data
directory. The XML file contains information about a protein. The task is to create a data pipeline that will ingest the XML file and store as much information as possible in a Neo4j graph database.
- Use Apache Airflow or a similar workflow management tool to orchestrate the pipeline
- The pipeline should run on a local machine
- Use open-source tools as much as possible
Please use the XML file provided in the data
directory. The XML file is a subset of the UniProt Knowledgebase.
The XML contains information about proteins, associdated genes and other biological entities. The root element of the XML is uniprot
. Each uniprot
element contains a entry
element. Each entry
element contains various elements such as protein
, gene
, organism
and reference
. Use this for the graph data model.
The full XML schema is available here.
Please run a Neo4j database locally. You can download Neo4j from https://neo4j.com/download-center/ or run it in Docker:
docker run \
--publish=7474:7474 --publish=7687:7687 \
--volume=$HOME/neo4j/data:/data \
neo4j:latest
Getting Started with Neo4j: https://neo4j.com/docs/getting-started/current/
The data model should contain nodes for proteins, genes, organisms, references, and more. The graph should contain edges for the relationships between these nodes. The relationships should be based on the XML schema. For example, the protein
element contains a recommendedName
element. The recommendedName
element contains a fullName
element. The fullName
element contains the full name of the protein. The graph should contain an edge between the protein
node and the fullName
node.
Here is an example for the target data model:
In the example_code
directory, you will find some example Python code for loading data to Neo4j.