Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyani anim loads sequences into database before checking class/label files #129

Open
widdowquinn opened this issue Mar 14, 2019 · 3 comments
Assignees
Labels
enhancement something we'd like pyani to do that it doesn't already performance the issue relates to making pyani more efficient
Milestone

Comments

@widdowquinn
Copy link
Owner

Summary:

The pyani anim command should check for correct formatting before committing sequences to the database - this would save time

Description:

Currently, all sequences are processed and loaded into the database, and then label/class files are checked. This slows things down if there's an error. It would be better to check the formats first.

pyani Version:

v0.3.0dev

Python Version:

3.6

Operating System:

CentOS6

@widdowquinn widdowquinn self-assigned this Mar 14, 2019
@widdowquinn widdowquinn added the enhancement something we'd like pyani to do that it doesn't already label Mar 14, 2019
@widdowquinn widdowquinn added the performance the issue relates to making pyani more efficient label May 29, 2020
@widdowquinn widdowquinn added this to the 0.3.0 milestone May 29, 2020
@baileythegreen
Copy link
Contributor

@widdowquinn, Is the checking you mean here the if inhash in label_dict: bit? I'm trying to determine if this issue is still relevant, and if so, what should be done about it. I believe the function below is the only thing called from subcmd_anim that uses labels and classes.

The only thing that looks like it deals with the formatting of those labels and classes is the load_classes_labels() call near the top.

def add_run_genomes(
    session, run, indir: Path, classpath: Path, labelpath: Path, **kwargs
) -> List:
    """Add genomes for a run to the database.
    :param session:       live SQLAlchemy session of pyani database
    :param run:           Run object describing the parent pyani run
    :param indir:         path to the directory containing genomes
    :param classpath:     path to the file containing class information for each genome
    :param labelpath:     path to the file containing class information for each genome
    This function expects a single directory (indir) containing all FASTA files
    for a run, and optional paths to plain text files that contain information
    on class and label strings for each genome.
    If the genome already exists in the database, then a Genome object is recovered
    from the database. Otherwise, a new Genome object is created. All Genome objects
    will be associated with the passed Run object.
    The session changes are committed once all genomes and labels are added to the
    database without error, as a single transaction.
    """
    # Get list of genome files and paths to class and labels files
    infiles = get_fasta_and_hash_paths(indir)  # paired FASTA/hash files
    class_data = {}  # type: Dict[str,str]
    label_data = {}  # type: Dict[str,str]
    all_keys = []  # type: List[str]
    if classpath:
        class_data = load_classes_labels(classpath)
        all_keys += list(class_data.keys())
    if labelpath:
        label_data = load_classes_labels(labelpath)
        all_keys += list(label_data.keys())

    # Make dictionary of labels and/or classes
    new_keys = set(all_keys)
    label_dict = {}  # type: Dict
    for key in new_keys:
        label_dict[key] = LabelTuple(label_data[key] or "", class_data[key] or "")

    # Get hash and sequence description for each FASTA/hash pair, and add
    # to current session database
    genome_ids = []
    for fastafile, hashfile in infiles:
        try:
            inhash, _ = read_hash_string(hashfile)
            indesc = read_fasta_description(fastafile)
        except Exception:
            raise PyaniORMException("Could not read genome files for database import")
        abspath = fastafile.absolute()
        genome_len = get_genome_length(abspath)
        # If the genome is not already in the database, add it as a Genome object
        genome = session.query(Genome).filter(Genome.genome_hash == inhash).first()
        if not isinstance(genome, Genome):
            try:
                genome = Genome(
                    genome_hash=inhash,
                    path=str(abspath),
                    length=genome_len,
                    description=indesc,
                )
                session.add(genome)
            except Exception:
                raise PyaniORMException(f"Could not add genome {genome} to database")

        # Associate this genome with the current run
        try:
            genome.runs.append(run)
        except Exception:
            raise PyaniORMException(
                f"Could not associate genome {genome} with run {run}"
            )

        # If there's an associated class or label for the genome, add it
        if inhash in label_dict:
            try:
                session.add(
                    Label(
                        genome=genome,
                        run=run,
                        label=label_dict[inhash].label,
                        class_label=label_dict[inhash].class_label,
                    )
                )
            except Exception:
                raise PyaniORMException(
                    f"Could not add labels for {genome} to database."
                )
        genome_ids.append(genome.genome_id)

    try:
        session.commit()
    except Exception:
        raise PyaniORMException("Could not commit new genomes in database.")

    return 

@baileythegreen
Copy link
Contributor

@widdowquinn Is this issue still relevant?

@widdowquinn
Copy link
Owner Author

widdowquinn commented Apr 29, 2022

I think I was referring to the order of processing, which was - approximately:

(1)

  1. Parse genome file and process
  2. Add genome to the database
  3. Parse labels/classes and process
  4. Add labels/classes info to the database

This meant that it was possible to generate a new row in the database, but then for the run to fail because of a formatting or other error in the labels/classes files. This could probably be dealt with at the same time as #136

A more sensible order of processing would be:

(2)

  1. Parse genome files and process
  2. Parse labels/classes and process
  3. Add genomes/labels/classes info to the database

It's still relevant if the order of operations looks like (1) and not like (2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement something we'd like pyani to do that it doesn't already performance the issue relates to making pyani more efficient
Projects
None yet
Development

No branches or pull requests

2 participants