Documentation of the Biomedical Engineering Project

Databio Project

This code reads an XML file and extracts data from it to create nodes and relationships in a Neo4j graph database. It uses the py2neo library to connect to the database and the xmltodict library to parse the XML file
Explore the docs »

View Demo · Report Bug · Request Feature

Table of Contents

About The Author
About The Project
- Built With
Getting Started
- Installation and Prerequisites
Contributing
License
Contact
Documentation

About The Author

👤 ** Rodrigo Wurdig **

Linkedin: @rodrigo-soares-wurdig
Github: @rwurdig

About The Project

This code reads an XML file and extracts data from it to create nodes and relationships in a Neo4j graph database. It uses the py2neo library to connect to the database and the xmltodict library to parse the XML file.

(back to top)

Built With

Docker and Docker-Compose
Neo4j (docker container)
Python (docker container)
Airflow (docker container)
Bash scripting

(back to top)

Getting Started

This is how you setting up your project locally.

- To get a local copy up and running follow these simple steps bellow.

(back to top)

Installation and Prerequisites

To run this code, you will need to have the following softwares and libraries installed:

Airflow 2.5.2
Neo4j 5.6.0
pendulum 2.1.2
pip 23.0.1
postgres:14.0
Python 3.x
py2neo 2021.2.3
xmltodict 0.13.0

After installing Python and pip, run the following command to install the necessary Python packages:

1. Install packages:

  pip install neo4j xml airflow etc

2. Clone the repository

   git clone https://github.com/rwurdig/Databio_project.git

   cd Databio_project

3. Run the build.sh file with admin privileges.

  chmod +x build.sh
  ./build.sh

4. The project will start and it will build all the images on the docker compose and run it.

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/AmazingFeature)
Commit your Changes (git commit -m 'Add some AmazingFeature')
Push to the Branch (git push origin feature/AmazingFeature)
Open a Pull Request

(back to top)

License

Distributed under the MIT License. See License for more information.

(back to top)

Contact

👤 Rwurdig: E-mail

Project Link: https://github.com/rwurdig/Databio_project

(back to top)

Documentation

Documentation of the Biomedical Engineering Project

tl;dr: The objective is to create a data pipeline that will ingest a UniProt XML file (data/Q9Y261.xml) and store the data in a Neo4j graph database.

Task

Read the XML file Q9Y261.xml located in the data directory. The XML file contains information about a protein. The task is to create a data pipeline that will ingest the XML file and store as much information as possible in a Neo4j graph database.

Requirements & Tools

Use Apache Airflow or a similar workflow management tool to orchestrate the pipeline
The pipeline should run on a local machine
Use open-source tools as much as possible

Source Data

Please use the XML file provided in the data directory. The XML file is a subset of the UniProt Knowledgebase.

The XML contains information about proteins, associdated genes and other biological entities. The root element of the XML is uniprot. Each uniprot element contains a entry element. Each entry element contains various elements such as protein, gene, organism and reference. Use this for the graph data model.

The full XML schema is available here.

Neo4j Target Database

Please run a Neo4j database locally. You can download Neo4j from https://neo4j.com/download-center/ or run it in Docker:

docker run \
  --publish=7474:7474 --publish=7687:7687 \
  --volume=$HOME/neo4j/data:/data \
  neo4j:latest

Getting Started with Neo4j: https://neo4j.com/docs/getting-started/current/

Data Model

The data model should contain nodes for proteins, genes, organisms, references, and more. The graph should contain edges for the relationships between these nodes. The relationships should be based on the XML schema. For example, the protein element contains a recommendedName element. The recommendedName element contains a fullName element. The fullName element contains the full name of the protein. The graph should contain an edge between the protein node and the fullName node.

Here is an example for the target data model:

Example Code

In the example_code directory, you will find some example Python code for loading data to Neo4j.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
build		build
dags		dags
data		data
img		img
logs		logs
neo4j		neo4j
plugins		plugins
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sh		build.sh
docker-compose.yaml		docker-compose.yaml
etl_uniprot_Q9Y261.png		etl_uniprot_Q9Y261.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Databio Project

About The Author

About The Project

Built With

Getting Started

Installation and Prerequisites

To run this code, you will need to have the following softwares and libraries installed:

After installing Python and pip, run the following command to install the necessary Python packages:

1. Install packages:

2. Clone the repository

3. Run the build.sh file with admin privileges.

4. The project will start and it will build all the images on the docker compose and run it.

Contributing

License

Contact

Documentation

Documentation of the Biomedical Engineering Project

Task

Requirements & Tools

Source Data

Neo4j Target Database

Data Model

Example Code

About

Releases

Packages

Languages

License

rwurdig/Databio_project

Folders and files

Latest commit

History

Repository files navigation

Databio Project

About The Author

About The Project

Built With

Getting Started

Installation and Prerequisites

To run this code, you will need to have the following softwares and libraries installed:

After installing Python and pip, run the following command to install the necessary Python packages:

1. Install packages:

2. Clone the repository

3. Run the build.sh file with admin privileges.

4. The project will start and it will build all the images on the docker compose and run it.

Contributing

License

Contact

Documentation

Documentation of the Biomedical Engineering Project

Task

Requirements & Tools

Source Data

Neo4j Target Database

Data Model

Example Code

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages