Merge branch 'dev' for 1.2.0 release.

pgxcentre · Jun 16, 2015 · cb14915 · cb14915
2 parents 5436fde + dd1955e
commit cb14915
Show file tree

Hide file tree

Showing 24 changed files with 1,307 additions and 148 deletions.
diff --git a/README.mkd b/README.mkd
@@ -4,7 +4,7 @@
 
 # genipe - A Python module to perform genome-wide imputation analysis
 
-*Version 1.1.0*
+*Version 1.2.0*
 
 The `genipe` module (standing for **GEN**ome-wide **I**mputation
 **P**ipelin**E**) includes a script (named `genipe-launcher`) that
@@ -101,11 +101,11 @@ usage: genipe-launcher [-h] [-v] [--debug] [--thread THREAD] --bfile PREFIX
                        --legend-template TEMPLATE --map-template TEMPLATE
                        --sample-file FILE [--filtering-rules RULE [RULE ...]]
                        [--probability FLOAT] [--completion FLOAT]
-                       [--report-number NB] [--report-title TITLE]
-                       [--report-author AUTHOR]
+                       [--info FLOAT] [--report-number NB]
+                       [--report-title TITLE] [--report-author AUTHOR]
 
 Execute the genome-wide imputation pipeline. This script is part of the
-'genipe' package, version 1.1.0.
+'genipe' package, version 1.2.0.
 
 optional arguments:
   -h, --help            show this help message and exit
@@ -164,6 +164,9 @@ IMPUTE2 Merger Options:
   --probability FLOAT   The probability threshold for no calls. [<0.9]
   --completion FLOAT    The completion rate threshold for site exclusion.
                         [<0.98]
+  --info FLOAT          The measure of the observed statistical information
+                        associated with the allele frequency estimate
+                        threshold for site exclusion. [<0.00]
 
 Automatic Report Options:
   --report-number NB    The report number. [genipe automatic report]
@@ -209,7 +212,7 @@ usage: imputed-stats [-h] [-v] {cox,linear,logistic,mixedlm,skat} ...
 
 Performs statistical analysis on imputed data (either SKAT analysis, or
 linear, logistic or survival regression). This script is part of the 'genipe'
-package, version 1.1.0).
+package, version 1.2.0).
 
 optional arguments:
   -h, --help            show this help message and exit

diff --git a/docs/_static/tutorial/report.pdf b/docs/_static/tutorial/report.pdf
diff --git a/docs/index.rst b/docs/index.rst
@@ -14,10 +14,17 @@ The :py:mod:`genipe` (GENome-wide Imputation PipelinE) module provides an easy
 an efficient way of performing genome-wide imputation analysis using the three
 commonly used softwares `PLINK <http://pngu.mgh.harvard.edu/~purcell/plink/>`_,
 `SHAPEIT <https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html>`_ and
-`IMPUTE2 <https://mathgen.stats.ox.ac.uk/impute/impute_v2.html>`_. It also
-provides a useful standalone tool to perform statistical analysis on imputed
-(dosage) data (such as linear, logistic or survival regressions, or
-`SKAT <http://www.hsph.harvard.edu/skat/>`_ analysis of rare variants).
+`IMPUTE2 <https://mathgen.stats.ox.ac.uk/impute/impute_v2.html>`_.
+
+A quality metrics report is automatically generated at the end of the
+imputation process to easily assess the quality of the analysis. The report is
+compiled into a PDF. For information on how to compile the report, refer to the
+:ref:`genipe-tut-compile-report` section in the main :ref:`genipe-tut-page`.
+
+Finally, it also provides a useful standalone tool to perform statistical
+analysis on imputed (dosage) data (such as linear, logistic or survival
+regressions, or `SKAT <http://www.hsph.harvard.edu/skat/>`_ analysis of rare
+variants).
 
 .. toctree::
    :maxdepth: 2
@@ -46,11 +53,11 @@ Usage
                           --legend-template TEMPLATE --map-template TEMPLATE
                           --sample-file FILE [--filtering-rules RULE [RULE ...]]
                           [--probability FLOAT] [--completion FLOAT]
-                          [--report-number NB] [--report-title TITLE]
-                          [--report-author AUTHOR]
+                          [--info FLOAT] [--report-number NB]
+                          [--report-title TITLE] [--report-author AUTHOR]
 
    Execute the genome-wide imputation pipeline. This script is part of the
-   'genipe' package, version 1.1.0.
+   'genipe' package, version 1.2.0.
 
    optional arguments:
      -h, --help            show this help message and exit
@@ -109,6 +116,9 @@ Usage
      --probability FLOAT   The probability threshold for no calls. [<0.9]
      --completion FLOAT    The completion rate threshold for site exclusion.
                            [<0.98]
+     --info FLOAT          The measure of the observed statistical information
+                           associated with the allele frequency estimate
+                           threshold for site exclusion. [<0.00]
 
    Automatic Report Options:
      --report-number NB    The report number. [genipe automatic report]

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -154,17 +154,12 @@ Testing the installation
 -------------------------
 
 To test the installation, make sure that the virtual environment is activated.
-Then, launch python and use the following commands:
+Then, launch Python and use the following python commands:
 
 .. code-block:: python
 
    >>> import genipe
    >>> genipe.test()
-   ......................ss.ss.......................ss...ss...s.s.........
-   ----------------------------------------------------------------------
-   Ran 72 tests in 107.268s
-   
-   OK (skipped=10)
 
 
 .. _install-update:

diff --git a/docs/module_content/genipe.tests.rst b/docs/module_content/genipe.tests.rst
@@ -42,6 +42,15 @@ genipe.tests.test_formats module
     :show-inheritance:
 
 
+genipe.tests.test_impute2_extractor module
+-------------------------------------------
+
+.. automodule:: genipe.tests.test_impute2_extractor
+    :members:
+    :undoc-members:
+    :show-inheritance:
+
+
 genipe.tests.test_impute2_merger module
 ----------------------------------------
 

diff --git a/docs/output_files.rst b/docs/output_files.rst
@@ -37,6 +37,7 @@ In summary, here is the structure of the output files. Again, refer to the
    │       ├── chr1.imputed.completion_rates
    │       ├── chr1.imputed.good_sites
    │       ├── chr1.imputed.impute2.gz
+   │       ├── chr1.imputed.impute2_info
    │       ├── chr1.imputed.imputed_sites
    │       ├── chr1.imputed.log
    │       ├── chr1.imputed.maf

diff --git a/docs/tutorials/tutorial_cox.rst b/docs/tutorials/tutorial_cox.rst
@@ -233,7 +233,7 @@ in the console:
                             NAME
 
    Performs a survival regression on imputed data using Cox's proportional hazard
-   model. This script is part of the 'genipe' package, version 1.1.0).
+   model. This script is part of the 'genipe' package, version 1.2.0).
 
    optional arguments:
      -h, --help            show this help message and exit

diff --git a/docs/tutorials/tutorial_extract.rst b/docs/tutorials/tutorial_extract.rst
@@ -16,11 +16,12 @@ Site extraction
 
 Genome-wide imputation dataset might be huge. Often, it is required to extract
 a subset of imputed sites (*e.g.* specific markers, genomic location, or
-markers with a specific minor allele frequency). Also, different format might
-be required, depending of the underlying analysis (*e.g.* hard calls or dosage
-values). We provide an easy tool to perform site extraction of multiple
-*impute2* files using either marker identification number, or genomic location
-and/or minor allele frequency and/or call rate.
+markers with a specific minor allele frequency, information value or completion
+rate). Also, different format might be required, depending of the underlying
+analysis (*e.g.* hard calls or dosage values). We provide an easy tool to
+perform site extraction of multiple *impute2* files using either marker
+identification number, or genomic location and/or minor allele frequency and/or
+call rate and/or information value.
 
 We suppose that you have followed the main :ref:`genipe-tut-page`. The
 following command will create the working directory for this tutorial.
@@ -42,7 +43,7 @@ extraction tools are automatically created in the ``final_impute2`` directories
 
 The files that are required in these directories depends of what kind of
 extraction is required (by name, or by genomic location and/or by minor allele
-frequency and/or by calling rate).
+frequency and/or by calling rate and/or by information value).
 
 Once the required *impute2* files are provided to the tool, the other required
 files will be automatically fetched (if required).
@@ -56,12 +57,13 @@ Executing the extraction
 The first time the tool is used on a set of *impute2* files, indexation will
 automatically occur (to speed of the analysis for future extraction). There are
 two ways to extract markers: using their identification number (``--extract``),
-or properties (``--genomic``, ``--maf`` or ``rate``).
+or using their properties (``--genomic``, ``--maf``, ``--rate`` and/or
+``--info``).
 
 .. note::
 
    It is possible to extract from multiple *impute2* files at the same time (by
-   specifying multiple input files.
+   specifying multiple input files).
 
 
 Extraction by ID
@@ -84,7 +86,7 @@ This ``marker_list.txt`` file will contain the following:
    rs76139713:51137523:C:T
    rs372879164:17037188:A:G
 
-Then, the following command (using the ``--extract`` option will extract those
+Then, the following command (using the ``--extract`` option) will extract those
 two markers from the *impute2* file.
 
 .. code-block:: bash
@@ -103,13 +105,14 @@ two markers from the *impute2* file.
 Extraction by characteristics
 """"""""""""""""""""""""""""""
 
-There are three ways to extract markers according to their characteristics. The
+There are four ways to extract markers according to their characteristics. The
 first way is to specify the genomic location of the markers to extract (*i.e.*
 the ``--genomic`` option). The second way is to specify a minor allele
-frequency threshold (*i.e.* the ``--maf`` option). The third and final way is
-to specify a call rate threshold (*i.e.* the ``--rate`` option). Those three
-ways can be used at the same time (*e.g.* to get markers in a specific genomic
-range and a specific call rate).
+frequency threshold (*i.e.* the ``--maf`` option). The third way is to specify
+a call rate threshold (*i.e.* the ``--rate`` option). The fourth and final way
+is to specify an information value threshold (*i.e.* the ``--info`` option).
+Those four ways can be used at the same time (*e.g.* to get markers in a
+specific genomic range and a specific call rate).
 
 For example, to extract markers with a MAF :math:`\geq` 0.05 located in the
 *CYP2D6* gene, perform the following command:
@@ -233,10 +236,10 @@ analysis in the console:
                             [--out PREFIX] [--format FORMAT [FORMAT ...]]
                             [--prob FLOAT] [--extract FILE]
                             [--genomic CHR:START-END] [--maf FLOAT]
-                            [--rate FLOAT]
+                            [--rate FLOAT] [--info FLOAT]
 
    Extract imputed markers located in a specific genomic region. This script is
-   part of the 'genipe' package, version 1.1.0).
+   part of the 'genipe' package, version 1.2.0).
 
    optional arguments:
      -h, --help            show this help message and exit
@@ -262,11 +265,15 @@ analysis in the console:
      --extract FILE        File containing marker names to extract.
      --genomic CHR:START-END
                            The range to extract (e.g. 22 1000000 1500000). Can be
-                           use in combination with '--rate' and '--maf'.
+                           use in combination with '--rate', '--maf' and '--
+                           info'.
      --maf FLOAT           Extract markers with a minor allele frequency equal or
                            higher than the specified threshold. Can be use in
-                           combination with '--rate' and '--genomic'.
+                           combination with '--rate', '--info' and '--genomic'.
      --rate FLOAT          Extract markers with a completion rate equal or higher
                            to the specified threshold. Can be use in combination
-                           with '--maf' and '--genomic'.
+                           with '--maf', '--info' and '--genomic'.
+     --info FLOAT          Extract markers with an information equal or higher to
+                           the specified threshold. Can be use in combination
+                           with '--maf', '--rate' and '--genomic'.
 
diff --git a/docs/tutorials/tutorial_genipe.rst b/docs/tutorials/tutorial_genipe.rst
@@ -10,7 +10,8 @@ Quick navigation
 1. :ref:`genipe-tut-softwares`
 2. :ref:`genipe-tut-input-files`
 3. :ref:`genipe-tut-execute`
-4. :ref:`genipe-tut-output-files`
+4. :ref:`genipe-tut-compile-report`
+5. :ref:`genipe-tut-output-files`
 
 Genome-wide imputation pipeline
 --------------------------------
@@ -410,6 +411,28 @@ previous command (see the :ref:`genipe-usage` section for a full list):
    subsequent steps).
 
 
+.. _genipe-tut-compile-report:
+
+Compiling the report
+^^^^^^^^^^^^^^^^^^^^^
+
+A report containing useful information (such as quality metrics and execution
+time, among others) is automatically generated once the imputation process is
+completed. To compile the report, perform the following commands:
+
+.. code-block:: bash
+
+   cd $HOME/genipe_tutorial/genipe/report
+
+   make && make clean
+
+This will generate the following
+`PDF report <http://pgxcentre.github.io/genipe/_static/tutorial/report.pdf>`_
+(which is named ``report.pdf``). It is always possible to modify the original
+``report.tex`` file to include analysis specific details (*e.g.* cohort
+description).
+
+
 .. _genipe-tut-output-files:
 
 Output files
@@ -448,6 +471,7 @@ files.
    │       ├── chr1.imputed.completion_rates
    │       ├── chr1.imputed.good_sites
    │       ├── chr1.imputed.impute2.gz
+   │       ├── chr1.imputed.impute2_info
    │       ├── chr1.imputed.imputed_sites
    │       ├── chr1.imputed.log
    │       ├── chr1.imputed.maf
@@ -552,12 +576,20 @@ autosomal chromosomes. They will contain the following files:
     |                               | the user, where the default is higher   |
     |                               | and equal to 0.9).                      |
     +-------------------------------+-----------------------------------------+
-    | ``.imputed.impute2``          | Imputation results (merged from the     |
-    |                               | individual segment files. This file     |
+    | ``.imputed.impute2`` or       | Imputation results (merged from the     |
+    | ``.imputed.impute2.gz``       | individual segment files. This file     |
     |                               | might be compress (with the ``.gz``     |
     |                               | extension) if the ``--bgzip`` option was|
     |                               | used when launching the pipeline.       |
     +-------------------------------+-----------------------------------------+
+    | ``.imputed.impute2_info``     | Marker-wise information file with one   |
+    |                               | line per marker and a single header line|
+    |                               | at the begening. It contains, among     |
+    |                               | others, the information value which is a|
+    |                               | measure of the observed statistical     |
+    |                               | information associated with the allele  |
+    |                               | frequency estimate.                     |
+    +-------------------------------+-----------------------------------------+
     | ``.imputed.imputed_sites``    | List of imputed sites (excluding sites  |
     |                               | that were previously genotyped in the   |
     |                               | study cohort).                          |

diff --git a/docs/tutorials/tutorial_linear.rst b/docs/tutorials/tutorial_linear.rst
@@ -245,7 +245,7 @@ analysis in the console:
                                --pheno-name NAME
 
    Performs a linear regression (ordinary least squares) on imputed data. This
-   script is part of the 'genipe' package, version 1.1.0).
+   script is part of the 'genipe' package, version 1.2.0).
 
    optional arguments:
      -h, --help            show this help message and exit

diff --git a/docs/tutorials/tutorial_logistic.rst b/docs/tutorials/tutorial_logistic.rst
@@ -236,7 +236,7 @@ regression analysis in the console:
                                  --pheno-name NAME
 
    Performs a logistic regression on imputed data using a GLM with a binomial
-   distribution. This script is part of the 'genipe' package, version 1.1.0).
+   distribution. This script is part of the 'genipe' package, version 1.2.0).
 
    optional arguments:
      -h, --help            show this help message and exit

diff --git a/docs/tutorials/tutorial_mixedlm.rst b/docs/tutorials/tutorial_mixedlm.rst
@@ -255,7 +255,7 @@ effects analysis in the console:
 
    Performs a linear mixed effects regression on imputed data using a random
    intercept for each group. This script is part of the 'genipe' package, version
-   1.1.0).
+   1.2.0).
 
    optional arguments:
      -h, --help            show this help message and exit