README

  
  MIRAGE: Multiple-sequence Isoform Alignment Tool Guided by Exon Boundaries
  --------------------------------------------------------------------------  


  CONTENTS
  --------
  1. Setup (Short)
  2. Setup (Detailed)
  3. Basic Usage
  4. More Usage
  5. About
  6. Source Files
  7. Included Directories
     - Using the BLAT binary instead of compiling from source


  SETUP (Short)
  -------------
  1.  Run the SETUP.pl script ('./SETUP.pl')
  2.  Add the Mirage directory to the PATH
  3.  Check setup by running './mirage --check'


  SETUP (Detailed)
  ----------------
  To setup the mirage software package, first run the SETUP.pl script as follows:

      $ perl SETUP.pl

  Once the setup script has reported successful completion, it is necessary to
  add the program 'spaln' to the PATH environment variable.  This can be done by 
  first running the command "pwd" to see the present working directory.  It 
  should look like this:

      $ pwd
      /Users/yourname/a/file/path/mirage

  Once you have the current directory, you can add it to your PATH variable.
  In BASH, you can add this line to the file ~/.bashrc:

      export PATH=(pwd_results):$PATH

  Refresh your path variable by setting the file that you edited as your source.
  Depending on whether PATH is set in .bash_profile or .bashrc, this will be done 
  with one of the following two commands:

      $ source ~/.bash_profile
            - OR -
      $ source ~/.bashrc

  To test whether everything is setup, you can run the following command from the
  directory containing mirage:

      $ mirage --check

  If setup was successful, this command will report success.  Otherwise, it will
  inform you of what is missing.


  BASIC USAGE
  -----------
  Mirage requires 2 arguments:  a FASTA-formatted protein database and a simple
  guide file directing the program to genomes and gtf indicies for each species
  being searched on.

      $ perl mirage2.pl  <Protein DB>  <Species Guide>

  It is expected that the names of the sequences in the protein database will
  consist of three fields, in order and separated by bar (|) characters:

       1. Species
       2. Gene family (or multiple families, separated by '/')
       3. A unique identifier within the species/family pair

  This can be followed by a '#', after which any additional comments can be
  written (and ignored by Mirage).

       For example: '>human|znf143/zn143|2 # Accession:P52747, zinc finger protein'

  The guide file is a simple text file where each line corresponds to a species
  and has the following fields (separated by whitespace):

       1. Species name, as spelled in the species fields of the protein database
       2. Path to genome, from the directory containing mirage.pl
       3. Path to gtf index file, from the directory containing mirage.pl

       For example: 'human  ../genomes/Human.DNA.fa  ../indices/human.gtf'

  The guide file can also include a line with a Newick-formatted tree to set the
  order in which inter-species alignment occurs.

       For example: '(dolphin,((mouse,rat),human))'

  For more information, use 'mirage --help'.


  ABOUT
  -----
  Mirage is a tool for generating exon-aware multiple sequence alignments (MSAs) 
  of protein isoforms within the same gene family and across species.  This is
  accomplished through 3 major phases: protein-to-genome alignment, intra-species
  protein alignment, and cross-species alignment of MSAs.

  During the first phase (protein-to-genome alignment), each protein in the database
  is searched against its species' genome, identifying all "hits" (places where some
  section of the protein is identical to one of its purported exons).  The program
  analyzes each protein's hit set and identifies an optimal way of "stitching" the
  hits together to achieve full coverage of the protein.  This is the work done
  by the 'Quilter2.pl' script, which relies on 'FastMap2' and SPALN (if installed)
  to determine the individual hits.

  The second phase consists of isolating each set of "stitched" hits for each 
  isoform within the same species and gene family and joining these hits together
  to form an intra-species multiple-sequence alignment.  Because the hits can be
  compared to one another by way of their positions in the genome, this alignment
  preserves the exon boundaries indicated by the hits identified in phase 1 and
  places special markers in the MSAs to mark where splice sites are observed.
  This work is done with the 'MapsToMSAs.pl' script.

  The third phase aligns the MSAs generated during the second phase across species
  (staying within gene families).  This is done with profile-to-profile alignment
  using an slightly modified version of the Needleman-Wunsch algorithm, using the
  'MultiSeqNW' program.


  SOURCE FILES (by order of progression through mirage pipeline)
  --------------------------------------------------------------
  + CleanMirageDB.pl

    A script used to verify that a FASTA-formatted database meets the formatting 
    requirements expected by mirage.  If a database does not meet the requirements,
    mirage will direct the user towards this program.

  + src/run_mirage2.sh

    A shell script used to run Mirage.

  + src/Mirage2.pl

    The top-level script used to generate MSAs.  This and CleanMirageDB.pl are 
    the only programs that the user is expected to engage with.
  
  + src/Quilter2.pl

    A script used to align proteins within a particular species to their
    genome (stitching together small hits for many proteins, hence the name).

  + src/FastMap2.c
   
    A program used to identify nearly-identical alignments of protein sequence
    to DNA (based on the Needleman-Wunsch dynamic programming algorithm, but only
    allowing match states).

  + src/BasicBio.c

    A collection of helper functions used by FastMap2 and MultiSeqNW.

  + src/MapsToMSAs.pl

    A script used to generate MSAs for all gene families within a species based
    on the results of Quilter.pl.

  + src/MultiSeqNW.c

    A program used to align two MSAs to one another by constructing profiles
    for each of the MSAs and then performing profile-to-profile alignment based
    on the Needleman-Wunsch dynamic programming algorithm.

  + src/FinalMSA.pl

    A script used to remove splice site markers and perform some common-sense
    MSA cleanup of the final MSAs.


  INCLUDED DIRECTORIES
  --------------------
  + inc/hsi
    
    A set of tools used mainly to extract sequences or sequence metadata from
    fasta-formatted files.

  + inc/spaln2.2.2

    A program used to generate intron-aware alignments of proteins to genomes.
    Slower than FastMap2 but allows for search without guidance from a GTF index.

  + inc/blat

    A program used to perform extremely fast translated local mapping.

    NOTE: BLAT's library requirements can be annoying to work around if
          you're in a compute environment where you don't have permission
	  to change /usr/include.  To get around BLAT make issues, you
	  can use the binary located at 'dependencies/x86_64-blat-binary'
	  Simply perform the following (from the 'mirage' directory):

               % mkdir build/blat
	       % mkdir build/blat/bin
	       % cp dependencies/x86_64-blat-binary build/blat/bin/blat