more release-related fiddling

pydata · Jul 10, 2012 · 240eb13 · 240eb13
1 parent af24ea1
commit 240eb13
Show file tree

Hide file tree

Showing 6 changed files with 74 additions and 49 deletions.
diff --git a/doc/Makefile b/doc/Makefile
@@ -1,8 +1,6 @@
 # Makefile for Sphinx documentation
 #
 
-PYTHONPATH := $(CURDIR)/..:$(PYTHONPATH)
-
 # You can set these variables from the command line.
 SPHINXOPTS    =
 SPHINXBUILD   = sphinx-build

diff --git a/doc/conf.py b/doc/conf.py
@@ -4,11 +4,6 @@
 project = u'patsy'
 copyright = u'2011-2012, Nathaniel J. Smith'
 
-# The version info for the project you're documenting, acts as replacement for
-# |version| and |release|, also used in various other places throughout the
-# built documents.
-#
-# The short X.Y version.
 try:
     import numpy
     print "numpy: %s, %s" % (numpy.__version__, numpy.__file__)
@@ -25,6 +20,11 @@
 except ImportError:
     print "no ipython"
 
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
 import sys, os
 sys.path.insert(0, os.getcwd() + "/..")
 import patsy

diff --git a/doc/formulas.rst b/doc/formulas.rst
@@ -45,7 +45,7 @@ convenient, just like it turns out to be convenient to define the
 <https://en.wikipedia.org/wiki/Empty_product>`_ to be ``np.prod([]) ==
 1``.)
 
-.. warning:: In the context of Patsy, the word **factor** does
+.. note:: In the context of Patsy, the word **factor** does
    *not* refer specifically to categorical data. What we call a
    "factor" can represent either categorical or numerical data. Think
    of factors like in multiplying factors together, not like in
@@ -326,7 +326,7 @@ Here's some code to try out at the Python prompt to get started::
                          "+ (x + {6: x3, 8 + 1: x4}[3 * i])", env)
 
 Sometimes it might be easier to read if you put the processed formula
-back into formula notation::
+back into formula notation using :meth:`ModelDesc.describe`::
 
   desc = ModelDesc.from_formula("y ~ (a + b + c + d) ** 2", env)
   desc.describe()
@@ -660,9 +660,9 @@ interactions, we can control how this overall variance is
 partitioned into distinct terms.
 
    Exercise: create the similar diagram for a formula that includes a
-   three-way interaction, like ``1 + a + a:b + a:b:c``. Hint: it's a
-   cube. Then, send us your diagram for inclusion in this documentation
-   [#shameless]_.
+   three-way interaction, like ``1 + a + a:b + a:b:c`` or ``1 +
+   a:b:c``. Hint: it's a cube. Then, send us your diagram for
+   inclusion in this documentation [#shameless]_.
 
 Finally, we've so far only discussed purely categorical
 interactions. Bringing numerical interactions into the mix doesn't
@@ -677,9 +677,11 @@ Technical details
 -----------------
 
 The actual algorithm Patsy uses to produce the above coding is very
-simple. Within each unique set of it breaks the categorical portion of
-each interaction down into minimal pieces, e.g. `a:b` is replaced by
-`1 + (a-) + (b-) + (a-):(b-)`:
+simple. Within the group of terms associated with each combination of
+numerical factors, it works from left to right. For each term it
+encounters, it breaks the categorical part of the interaction down
+into minimal pieces, e.g. `a:b` is replaced by `1 + (a-) + (b-) +
+(a-):(b-)`:
 
 .. container:: align-center
 
@@ -688,8 +690,10 @@ each interaction down into minimal pieces, e.g. `a:b` is replaced by
    .. |arrow| image:: figures/redundancy-arrow.png
    .. |1 a- b- a-:b-| image:: figures/redundancy-1-ar-br-arbr.png
 
-Then, any pieces which have previously been used within this formula
-are deleted:
+(Technically, these "minimal pieces" are the set of all subsets of the
+original interaction.) Then, any of the minimal pieces which were used
+by a previous term within this group are deleted, since they are
+redundant:
 
 .. container:: align-center
 
@@ -698,54 +702,65 @@ are deleted:
    .. |a- a-:b-| image:: figures/redundancy-ar-arbr.png
 
 and then we greedily recombine the pieces that are left
-by repeatedly merging adjacent pieces:
+by repeatedly merging adjacent pieces according to the rule `ANYTHING
++ ANYTHING : FACTOR- = ANYTHING : FACTOR`:
 
 .. container:: align-center
 
    |a- a-:b-| |arrow| |a-:b|
 
 ..
 
+  Exercise: Prove formally that the space spanned by `ANYTHING +
+  ANYTHING : FACTOR-` is identical to the space spanned by `ANYTHING :
+  FACTOR`.
+
   Exercise: Either show that the greedy algorithm here is produces
   optimal encodings in some sense (e.g., smallest number of pieces
   used), or else find a better algorithm. (Extra credit: implement
   your algorithm and submit a pull request [#still-shameless]_.)
 
-This is justified by the following theorem:
+Is this algorithm correct? A full formal proof would be too tedious
+for this reference manual, but here's a sketch of the analysis.
+
+Recall that our goal is to maintain two invariants: the design matrix
+column space should include the space associated with each term, and
+should avoid "structural redundancy", i.e. it should be full rank on
+at least some data sets. It's easy to see the this algorithm will
+never "lose" columns, since the only time it eliminates a subspace is
+when it has previously processed that exact subspace within the same
+design. (So long as the subspace merging is correctly specified etc.;
+feel free to check if you doubt.) But will it always detect all the
+redundancies that are present?
+
+This is guaranteed by the following theorem:
 
 Theorem: Let two sets of factors, :math:`F = {f_1, \dots, f_n}` and
 :math:`G = {g_1, \dots, g_m}` be given, and let :math:`F =
 F_{\text{num}} \cup F_{\text{categ}}` be the numerical and categorical
 factors, respectively (and similarly for :math:`G = G_{\text{num}}
-\cup G_{\text{categ}}`. Then the interaction :math:`f_1 : \cdots :
-f_n` represents a subspace of the space represented by the interaction
-:math:`g_1 : \cdots : g_m` if:
+\cup G_{\text{categ}}`. Then the space represented by the interaction
+:math:`f_1 : \cdots : f_n` always has a non-trivial intersection
+with the space represented by the interaction :math:`g_1 : \cdots :
+g_m` whenever:
 
 * :math:`F_{\text{num}} = G_{\text{num}}`, and
-* :math:`F_{\text{categ}} \subset G_{\text{categ}}`
+* :math:`F_{\text{categ}} \cap G_{\text{categ}}`
 
-and furthermore, there is some assignment of values to the factors
-which makes this condition necessary as well as sufficient.
+and furthermore, there is an assignment of values to the factors which
+makes this condition necessary as well as sufficient.
 
   Exercise: Prove it.
 
-Corollary 1: Patsy's strategy of dividing terms into groups based
-on the numerical factors they contain and coding them separately will
-never cause it to either ignore or introduce any structural
-redundancies.
+  Exercise: Show that given a sufficient number of rows, the set of
+  factor assignments on which :math:`f_1 : \cdots : f_n` represents a
+  subspace of :math:`g_1 : \cdots : g_n` without the above conditions
+  being satisfied is actually a zero set.
 
-Corollary 2: Patsy's handling of categorical interactions by
-considering each possible subset will never ignore or introduce any
-structural redundancy.
-
-Conclusion: Patsy satisfies the invariant described above, of
-always producing (structurally) full-rank design matrices whose column
-span includes the vector space represented by every included term.
-
-  Exercise: Show that in a sufficiently high-dimensional space, the
-  set of factor assignments on which :math:`f_1 : \cdots : f_n`
-  represents a subspace of :math:`g_1 : \cdots : g_n` without the
-  above conditions being satisfied is a zero set.
+Corollary: Patsy's strategy of dividing into groups by numerical
+factors, and then comparing all subsets of the remaining categorical
+factors, allows it to precisely identify and avoid structural
+redundancies.
 
 Footnotes
 ---------
@@ -754,6 +769,6 @@ Footnotes
    which produces incorrect output in this case (see
    :ref:`R-comparison`).
 
-.. [#shameless] Yes, I'm shameless.
+.. [#shameless] Yes, I'm lazy. And shameless.
 
 .. [#still-shameless] Yes, still shameless.
diff --git a/doc/quickstart.rst b/doc/quickstart.rst
@@ -87,7 +87,7 @@ We can transform variables using arbitrary Python code:
    dmatrix("x1 + np.log(x2 + 10)", data)
 
 Notice that `np.log` is being pulled out of the environment where
-:func:`dmatrix` was called -- if `np.log` is accessible because we did
+:func:`dmatrix` was called -- `np.log` is accessible because we did
 ``import numpy as np`` up above. Any functions or variables that you
 could reference when calling :func:`dmatrix` can also be used inside
 the formula passed to :func:`dmatrix`. For example:

diff --git a/doc/sphinxext/requirements.txt b/doc/sphinxext/requirements.txt
diff --git a/setup.py b/setup.py
@@ -3,8 +3,12 @@
 import sys
 from setuptools import setup
 
-DESC = """A Python library for describing statistical models and for
-building model matrices."""
+DESC = """A Python package for describing statistical models and for
+building design matrices."""
+
+LONG_DESC = DESC + """ It is closely inspired by and compatible with the
+  'formula' mini-language used in `R <http://www.r-project.org/>`_ and `S
+  <https://secure.wikimedia.org/wikipedia/en/wiki/S_programming_language>`_."""
 
 # Compatibility code for handling both setuptools and distribute on Python 3,
 # as suggested here: http://packages.python.org/distribute/python3.html
@@ -16,10 +20,21 @@
     name="patsy",
     version="0.1.0",
     description=DESC,
+    long_description=LONG_DESC,
     author="Nathaniel J. Smith",
     author_email="njs@pobox.com",
     license="2-clause BSD",
     packages=["patsy"],
-    url="https://github.com/patsy/patsy",
+    url="https://github.com/pydata/patsy",
     install_requires=["numpy"],
+    classifiers =
+      [ "Development Status :: 4 - Beta",
+        "Intended Audience :: Developers",
+        "Intended Audience :: Science/Research",
+        "Intended Audience :: Financial and Insurance Industry",
+        "License :: OSI Approved :: BSD License",
+        "Programming Language :: Python :: 2",
+        "Programming Language :: Python :: 3",
+        "Topic :: Scientific/Engineering",
+        ],
     **extra)