a GP-GCN framework for genomics

This is a python package for genomics study with a GP-GCN (Gapped Pattern Graph Convolutional Networks) framework.

Getting started

Prerequisite

cython
numpy
Biopython
editdistance
pytorch 1.7.1
pytorch_geometric 1.7.0

Install

pip install GCNFrame

Or

git clone https://github.com/deepomicslab/GCNFrame.git
cd GCNFrame/GCNFrame
python setup.py build_ext --inplace
cd ../

Examples

The framework makes it easy to train your customized models with a few lines of codes. The example data can be downloaded from Google Drive.

# This is an example to train a two-classes model.
from GCNFrame import Biodata, GCNmodel
import torch
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

data = Biodata(fasta_file="example_data/nature_2017.fasta", 
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")
dataset = data.encode(thread=20)
model = GCNmodel.model(label_num=2, other_feature_dim=206).to(device)
GCNmodel.train(dataset, model, weighted_sampling=True)
GCNmodel.test(model_name="GCN_model.pt", fasta_file="example_data/nature_2017.fasta", feature_file="example_data/CDD_protein_feature.txt")

The output is shown bellow:

Encoding sequences...
Epoch 0| Loss: 0.6335| Train accuracy: 0.7480| Validation accuracy: 0.8839
Epoch 1| Loss: 0.5605| Train accuracy: 0.8165| Validation accuracy: 0.7032
Epoch 2| Loss: 0.5042| Train accuracy: 0.8469| Validation accuracy: 0.8065
Epoch 3| Loss: 0.4873| Train accuracy: 0.8344| Validation accuracy: 0.7677
Epoch 4| Loss: 0.4559| Train accuracy: 0.8703| Validation accuracy: 0.8194
Epoch 5| Loss: 0.4533| Train accuracy: 0.8763| Validation accuracy: 0.7806
Epoch 6| Loss: 0.4372| Train accuracy: 0.8931| Validation accuracy: 0.8387
Epoch 7| Loss: 0.4409| Train accuracy: 0.8842| Validation accuracy: 0.8581
Epoch 8| Loss: 0.4357| Train accuracy: 0.8858| Validation accuracy: 0.8516
Epoch 9| Loss: 0.4314| Train accuracy: 0.8987| Validation accuracy: 0.8387
Epoch 10| Loss: 0.4246| Train accuracy: 0.8992| Validation accuracy: 0.8581
Epoch 11| Loss: 0.4085| Train accuracy: 0.9180| Validation accuracy: 0.8839
Epoch 12| Loss: 0.4071| Train accuracy: 0.9290| Validation accuracy: 0.8903
Epoch 13| Loss: 0.4095| Train accuracy: 0.9170| Validation accuracy: 0.8839
Epoch 14| Loss: 0.4019| Train accuracy: 0.9241| Validation accuracy: 0.8839
Epoch 15| Loss: 0.3960| Train accuracy: 0.9342| Validation accuracy: 0.9161

The model with best validation accuracy will be saved as GCN_model.pt

Also, the package provides users with functions to mine gapped patterns or motifs of more significant influence in prediction tasks.

# the pattern_contribution_score function returns a score list to record the contribution scores for the 4,096 gapped patterns. 
score_list = pattern_contribution_score(fasta_file="example_data/nature_2017.fasta",
        label_file="example_data/lifestyle_label.txt",
        feature_file="example_data/CDD_protein_feature.txt")

The scores for the gapped-patterns will also be saved in a txt file.

# the pattern_group_contribution_score function groups similar gapped patterns and analyzes the occurrence & scores for each group.
pattern_group_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", score_list=score_list)

The results are saved as figures.

# the motif_contribution_score calculate the contribution score for a given motif.
score = motif_contribution_score(fasta_file="example_data/nature_2017.fasta", label_file="example_data/lifestyle_label.txt", motif="AAAAAATTCG", feature_file="example_data/CDD_protein_feature.txt")
print("The contribution score for AAAAAATTCG is %s."%score)

Parameters

`class Biodata.Biodata`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

seqtype: The type of input sequence, DNA or RNA (default=DNA).

thread: The number of thread used for encoding (default=10).

save_dataset: Save the encoded dataset and use it next time (default=True).

save_path: The path for saving the encoded dataset (default="./").

`class GCNmodel.model`

label_num: The number of labels.

other_feature_dim: The dimension for other features, 0 if not available.

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

node_hidden_dim: The size for kmer nodes after transformation(default=3).

gcn_dim: The size of output of SAGEConv (default=128).

gcn_layer_num: The number of SAGEConv layers (default=2).

cnn_dim: The size of output of convolutional layers (default=64).

cnn_layer_num: The number of convolutional layers (default=3).

cnn_kernel_size: The kernel size of convolutional layers (default=8).

fc_dim: The number of neurons for the fully connected layers (default=100).

dropout_rate: The dropout rate (default=0.2).

pnode_nn: Whether transform primary nodes (default=True).

fnode_nn: Whether transform target nodes (default=True).

`GCNmodel.train`

learning_rate: The learning rate for training (default=1e-4).

batch_size: The batch_size for training (default=64).

epoch_n: The number of training epoches (default=20).

random_seed: The random seed for train-validation split (default=111).

val_split: The validation size (default=0.1).

weighted_sampling: Whether use weighted sampling for training (default=True).

model_name: The saved model name (default="GCN_model.pt").

`GCNmodel.test`

fasta_file: The DNA sequences used for test in fasta format.

model_name: The saved model name (default="GCN_model.pt").

feature_file: Other features (like gene density) for the DNA sequences for test (should have the same order as fasta_file) (default=None).

output_file: The output file name (default="test_output.txt").

thread: The number of thread used for encoding (default=10).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

`pattern_contribution_score`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

target_label: The label of the class being analyzed (default=1).

model_name: The saved model name (default="GCN_model.pt").

feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).

output_file: The output file name (default="pattern_contribution_score.txt").

thread: The number of thread used for encoding (default=10).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

`motif_contribution_score`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

motif: The motif to be analyzed.

target_label: The label of the class being analyzed (default=1).

model_name: The saved model name (default="GCN_model.pt").

feature_file: Other features (like gene density) for the DNA sequences for training and evaluation (should have the same order as fasta_file) (default=None).

thread: The number of thread used for encoding (default=10).

K: The length of K-mer for encoding (default=3).

d: The number of spaced distance used for encoding (default=3).

`pattern_group_contribution_score`

fasta_file: The DNA sequences used for training and evaluation in fasta format.

label_file: The labels for the DNA sequences for training and evaluation (should have the same order as fasta_file).

score_list: The contribution scores of the 4,096 gapped patterns.

target_label: The label of the class being analyzed (default=1).

d: The number of spaced distance used for encoding (default=3).

Version history

v0.1.2: Extension to RNA sequences & Enabling saving encoded data.
v0.1.1: Add contribution score functions.
v0.0.1: Initial version.

Maintainer

WANG Ruohan ruohawang2-c@my.cityu.edu.hk

Reference

Wang, R. H., Ng, Y. K., Zhang, X., Wang, J., & Li, S. C. (2024). Coding genomes with gapped pattern graph convolutional network. Bioinformatics, 40(4), btae188.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
GCNFrame		GCNFrame
GCNframework.png		GCNframework.png
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pos_neg_match_grouped.png		pos_neg_match_grouped.png
pos_neg_score_grouped.png		pos_neg_score_grouped.png
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

a GP-GCN framework for genomics

Getting started

Prerequisite

Install

Examples

Parameters

`class Biodata.Biodata`

`class GCNmodel.model`

`GCNmodel.train`

`GCNmodel.test`

`pattern_contribution_score`

`motif_contribution_score`

`pattern_group_contribution_score`

Version history

Maintainer

Reference

About

Releases

Packages

Languages

License

deepomicslab/GCNFrame

Folders and files

Latest commit

History

Repository files navigation

a GP-GCN framework for genomics

Getting started

Prerequisite

Install

Examples

Parameters

class Biodata.Biodata

class GCNmodel.model

GCNmodel.train

GCNmodel.test

pattern_contribution_score

motif_contribution_score

pattern_group_contribution_score

Version history

Maintainer

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

`class Biodata.Biodata`

`class GCNmodel.model`

`GCNmodel.train`

`GCNmodel.test`

`pattern_contribution_score`

`motif_contribution_score`

`pattern_group_contribution_score`

Packages