Skip to content

Commit

Permalink
Initial release of Generalised Brown source code.
Browse files Browse the repository at this point in the history
  • Loading branch information
sean-chester committed Nov 14, 2015
0 parents commit f3ee0ad
Show file tree
Hide file tree
Showing 54 changed files with 4,773 additions and 0 deletions.
165 changes: 165 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
## generalised-brown
version 1.0
© 2015 Sean Chester and Leon Derczynski

-------------------------------------------
### Table of Contents

* [Introduction](#introduction)
* [Requirements](#requirements)
* [Installation](#installation)
* [Usage](#usage)
* [License](#license)
* [Contact](#contact)


------------------------------------
### Introduction
<a name="introduction" ></a>

The *generalised-brown* software suite clusters word types by
distributional similarity in two phases. It first generates a list
of merges based on the well-known Brown clustering algorithm and
then recalls historical states to vary the granularity of the
clusters. For example, given the following corpus:

> Alice likes dogs and Bob likes cats while Alice hates snakes and Bob hates spiders
Greedily clustering word types based on *average mutual information*
(i.e., running the *C++ merge generator*) produces the following
merge list (assuming _a_ = _|V|_ = 10):

> snakes spiders 8
> dogs cats 7
> Alice Bob 6
> and while 5
> likes hates 4
> dogs snakes 3
> dogs and 2
> dogs Alice 1
> dogs likes 0
One can then recall any historical state of the computation in order to
produce a set of clusters (i.e., run the *python cluster generator*).
For example, with _c_ = 5, we recall the state _c_ - 1 = 4 to produce
the following clusters:

> {snakes, spiders}
> {dogs, cats}
> {Alice, Bob}
> {likes, hates}
> {and, while}
This approach (setting separate values of _a_ and _c_) we refer to as
*Roll-up feature generation*. By contrast, traditional Brown clustering
would produce the following five clusters (equivalent to running the
*C++ merge generator* with _a_ = 5 **and** the *python cluster generator*
with _c_ = 5):

> {likes, hates}
> {snakes, spiders, cats, dogs}
> {and, while}
> {Alice}
> {Bob}
For details about the concepts implemented in this software, please
read our recent AAAI paper:

> L. Derczynski and S. Chester. 2016. "Generalised Brown Clustering
> and Roll-up Feature Generation." In: Proceedings of the
> Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16).
> 7 pages. To appear.
For details about traditional Brown clustering, consult the article
in which it was introduced:

> PF Brown et al. 1992. "Class-based n-gram models of natural language."
> Computational Linguistics 18(4): 467--479.
or the implementation that our *C++ merge generator* forked:

> [wcluster](https://github.com/percyliang/brown-cluster).

------------------------------------
### Requirements
<a name="requirements" ></a>

*generalised-brown* relies on the following applications:

+ For compiling the *C++ merge generator*: A C++ compiler that
is compatible with C++ 11 and OpenMP (e.g., the newest
[GNU compiler](https://gcc.gnu.org/)) and the *make* program

+ For running the *python cluster generator*: A *python*
interpreter

------------------------------------
### Installation
<a name="installation" ></a>

The *python cluster generator* does not need to be compiled.
To compile the *C++ merge generator*, navigate to the
*merge_generator/* subdirectory of the project and type:

>make
------------------------------------
### Usage
<a name="usage" ></a>

To produce a set of features for a corpus, you will first want to use
Generalised Brown (i.e., the *C++ merge generator*) to create a merge list.
Then, you can create c clusters by running the *python cluster generator*
on the merge list. This second step can be done for as many values of _c_
as you like, but we recommend that each value of _c_ is not larger than the
value of _a_ used to generate the merge list.

To run the *C++ merge generator*, type:

>./merge_generator/wcluster --text [input_file] --a [active_set_size]
The resultant merges will be recorded in:

>./[input_file]-c[active_set_size]-p1.out/merges
To run the *python cluster generator*, type:

>python ./cluster_generator/cluster.py -in ./[input_file]-c[active_set_size]-p1.out/merges -c 3
Each word type will be printed to *stdout* with its cluster id.

The *C++ merge generator* runs in _O(|V| a^2)_ time, where _|V|_ is the number
of distinct word types in the corpus (i.e., the size of the vocabulary) and
_a_ is a bound on the algorithm's search space. The *python cluster generator*
runs in _O(|V|)_ time.


------------------------------------
### License
<a name="license"></a>

This software consists of two sub-modules, each released under a
different license:

+ The *python cluster generator* is subject to the terms of
[The MIT License](http://opensource.org/licenses/MIT)

+ The *C++ merge generator* follows the original licensing terms
of [wcluster](https://github.com/percyliang/brown-cluster).

See the relevant sub-directories of this repository for the
specific details of each license.



------------------------------------
### Contact
<a name="contact"></a>

This software suite will undergo a major revision; so, you are encouraged
to ensure that this is still the latest version. Please do not hesitate to
contact the authors if you have comments, questions, or bugs to report.
>[generalised-brown on GitHub](https://github.com/sean-chester/generalised-brown)
------------------------------------
19 changes: 19 additions & 0 deletions cluster_generator/LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Copyright (c) 2015 Sean Chester and Leon Derczynski

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
70 changes: 70 additions & 0 deletions cluster_generator/cluster.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/usr/bin/env python
# cluster.py
# &copy; Sean Chester (sean.chester@idi.ntnu.no)
# 22 July 2015

import csv
import argparse

# Input parsing
parser = argparse.ArgumentParser(
description='Prints out a tree with a specified number of leaves, given an ' + \
'input file with an ordered list of merges. Each unique path identifies ' + \
'one leaf. All word types that have the same path as each other belong to the ' + \
'same leaf (and correspond to one Brown cluster).', \
epilog='If the output is to be read by humans, consider piping results to ' + \
'the sort command to print the leaves in depth-first order. (Then ' + \
'similar leaves/clusters will appear nearer each other in the output.)')
parser.add_argument(
'-in', '--input-file', \
help="Input file containing ordered merges", \
required=True, \
dest='input', \
metavar='INPUT_FILE')
parser.add_argument(
'-c', '--num-classes', \
type=int, \
help="Number of leaves/classes/clusters to produce", \
required=True, \
dest='leaves', \
metavar='NUM_CLASSES')
parser.add_argument(
'-d', '--depth', \
type=int, \
help="Truncation depth for paths (i.e., no leaf appears farther than d-1 hops from the " + \
"root). Note: setting this parametre likely results in fewer than NUM_CLASSES leaves, " + \
"because the --num-classes filter is (logically) applied first.", \
required=False, \
dest='depth')
args = parser.parse_args()

# If depth wasn't passed as a parametre, give it a default value of being
# equal to --num-classes.
if args.depth is None:
args.depth = args.leaves

# Actual processing -- read merge list in reverse and map each encountered
# word type onto a tree path in a dictionary.
tree = {}
with open( args.input ) as tsv:
for line in reversed(list(csv.reader(tsv, delimiter="\t", quotechar=None))):
merge_into = line[0]
merge_from = line[1]
if not tree.has_key(merge_into):
tree[merge_into] = "0"
tree[merge_from] = "1"
args.leaves = args.leaves - 2
elif args.leaves > 0:
parent = tree[merge_into]
if len( parent ) < args.depth:
tree[merge_from] = parent + "1"
tree[merge_into] = parent + "0"
else:
tree[merge_from] = parent
args.leaves = args.leaves - 1
else:
tree[merge_from] = tree[merge_into]

for (cluster, path) in tree.items():
print( path + "\t" + cluster )

22 changes: 22 additions & 0 deletions merge_generator/CHANGE_LOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Change Log

--------------------

## 1.3.1: [Sean Chester](https://github.com/sean-chester)
+ Added conceptual generalisation whereby every merge is logged so that
historical states can be recalled with ../cluster_generator/cluster.py.
+ Added more parallelism (courtesy of
[Kenneth S Bøgh](https://dk.linkedin.com/in/kenneth-sejdenfaden-bøgh-58915524)).
+ Aliased the input parametre _c_ as _a_ to fit the conceptual generalisation
(while maintaining backwards compatibility).

## 1.3: [Percy Liang](https://github.com/percyliang)
+ compatibility updates for newer versions of g++ (courtesy of Chris Dyer).

## 1.2: [Percy Liang](https://github.com/percyliang)
+ make compatible with MacOS (replaced timespec with timeval and changed order of linking).

## 1.1: [Percy Liang](https://github.com/percyliang)
+ Removed deprecated operators so it works with GCC 4.3.

--------------------
15 changes: 15 additions & 0 deletions merge_generator/LICENSE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
(C) Copyright 2015 (Sean Chester)[https://github.com/sean-chester]
and (Leon Derczynski)[http://derczynski.com/]
(C) Copyright 2007-2012, Percy Liang

http://cs.stanford.edu/~pliang

Permission is granted for anyone to copy, use, or modify these programs and
accompanying documents for purposes of research or education, provided this
copyright notice is retained, and note is made of any changes that have been
made.

These programs and documents are distributed without any warranty, express or
implied. As the programs were written for research purposes only, they have
not been tested to the degree that would be advisable in any important
application. All use of these programs is entirely at the user's own risk.
16 changes: 16 additions & 0 deletions merge_generator/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# 1.2: need to make sure opt.o goes in the right order to get the right scope on the command-line arguments
# Use this for Linux
ifeq ($(shell uname),Linux)
files=$(subst .cc,.o,basic/logging.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v logging.cc))
else
files=$(subst .cc,.o,basic/opt.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v opt.cc))
endif

wcluster: $(files)
g++ -Wall -g -std=c++0x -O3 -fopenmp -o wcluster $(files) -lpthread

%.o: %.cc
g++ -Wall -g -O3 -fopenmp -std=c++0x -o $@ -c $<

clean:
rm wcluster basic/*.o *.o
Loading

0 comments on commit f3ee0ad

Please sign in to comment.