-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Initial release of Generalised Brown source code.
- Loading branch information
0 parents
commit f3ee0ad
Showing
54 changed files
with
4,773 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
## generalised-brown | ||
version 1.0 | ||
© 2015 Sean Chester and Leon Derczynski | ||
|
||
------------------------------------------- | ||
### Table of Contents | ||
|
||
* [Introduction](#introduction) | ||
* [Requirements](#requirements) | ||
* [Installation](#installation) | ||
* [Usage](#usage) | ||
* [License](#license) | ||
* [Contact](#contact) | ||
|
||
|
||
------------------------------------ | ||
### Introduction | ||
<a name="introduction" ></a> | ||
|
||
The *generalised-brown* software suite clusters word types by | ||
distributional similarity in two phases. It first generates a list | ||
of merges based on the well-known Brown clustering algorithm and | ||
then recalls historical states to vary the granularity of the | ||
clusters. For example, given the following corpus: | ||
|
||
> Alice likes dogs and Bob likes cats while Alice hates snakes and Bob hates spiders | ||
Greedily clustering word types based on *average mutual information* | ||
(i.e., running the *C++ merge generator*) produces the following | ||
merge list (assuming _a_ = _|V|_ = 10): | ||
|
||
> snakes spiders 8 | ||
> dogs cats 7 | ||
> Alice Bob 6 | ||
> and while 5 | ||
> likes hates 4 | ||
> dogs snakes 3 | ||
> dogs and 2 | ||
> dogs Alice 1 | ||
> dogs likes 0 | ||
One can then recall any historical state of the computation in order to | ||
produce a set of clusters (i.e., run the *python cluster generator*). | ||
For example, with _c_ = 5, we recall the state _c_ - 1 = 4 to produce | ||
the following clusters: | ||
|
||
> {snakes, spiders} | ||
> {dogs, cats} | ||
> {Alice, Bob} | ||
> {likes, hates} | ||
> {and, while} | ||
This approach (setting separate values of _a_ and _c_) we refer to as | ||
*Roll-up feature generation*. By contrast, traditional Brown clustering | ||
would produce the following five clusters (equivalent to running the | ||
*C++ merge generator* with _a_ = 5 **and** the *python cluster generator* | ||
with _c_ = 5): | ||
|
||
> {likes, hates} | ||
> {snakes, spiders, cats, dogs} | ||
> {and, while} | ||
> {Alice} | ||
> {Bob} | ||
For details about the concepts implemented in this software, please | ||
read our recent AAAI paper: | ||
|
||
> L. Derczynski and S. Chester. 2016. "Generalised Brown Clustering | ||
> and Roll-up Feature Generation." In: Proceedings of the | ||
> Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16). | ||
> 7 pages. To appear. | ||
For details about traditional Brown clustering, consult the article | ||
in which it was introduced: | ||
|
||
> PF Brown et al. 1992. "Class-based n-gram models of natural language." | ||
> Computational Linguistics 18(4): 467--479. | ||
or the implementation that our *C++ merge generator* forked: | ||
|
||
> [wcluster](https://github.com/percyliang/brown-cluster). | ||
|
||
------------------------------------ | ||
### Requirements | ||
<a name="requirements" ></a> | ||
|
||
*generalised-brown* relies on the following applications: | ||
|
||
+ For compiling the *C++ merge generator*: A C++ compiler that | ||
is compatible with C++ 11 and OpenMP (e.g., the newest | ||
[GNU compiler](https://gcc.gnu.org/)) and the *make* program | ||
|
||
+ For running the *python cluster generator*: A *python* | ||
interpreter | ||
|
||
------------------------------------ | ||
### Installation | ||
<a name="installation" ></a> | ||
|
||
The *python cluster generator* does not need to be compiled. | ||
To compile the *C++ merge generator*, navigate to the | ||
*merge_generator/* subdirectory of the project and type: | ||
|
||
>make | ||
------------------------------------ | ||
### Usage | ||
<a name="usage" ></a> | ||
|
||
To produce a set of features for a corpus, you will first want to use | ||
Generalised Brown (i.e., the *C++ merge generator*) to create a merge list. | ||
Then, you can create c clusters by running the *python cluster generator* | ||
on the merge list. This second step can be done for as many values of _c_ | ||
as you like, but we recommend that each value of _c_ is not larger than the | ||
value of _a_ used to generate the merge list. | ||
|
||
To run the *C++ merge generator*, type: | ||
|
||
>./merge_generator/wcluster --text [input_file] --a [active_set_size] | ||
The resultant merges will be recorded in: | ||
|
||
>./[input_file]-c[active_set_size]-p1.out/merges | ||
To run the *python cluster generator*, type: | ||
|
||
>python ./cluster_generator/cluster.py -in ./[input_file]-c[active_set_size]-p1.out/merges -c 3 | ||
Each word type will be printed to *stdout* with its cluster id. | ||
|
||
The *C++ merge generator* runs in _O(|V| a^2)_ time, where _|V|_ is the number | ||
of distinct word types in the corpus (i.e., the size of the vocabulary) and | ||
_a_ is a bound on the algorithm's search space. The *python cluster generator* | ||
runs in _O(|V|)_ time. | ||
|
||
|
||
------------------------------------ | ||
### License | ||
<a name="license"></a> | ||
|
||
This software consists of two sub-modules, each released under a | ||
different license: | ||
|
||
+ The *python cluster generator* is subject to the terms of | ||
[The MIT License](http://opensource.org/licenses/MIT) | ||
|
||
+ The *C++ merge generator* follows the original licensing terms | ||
of [wcluster](https://github.com/percyliang/brown-cluster). | ||
|
||
See the relevant sub-directories of this repository for the | ||
specific details of each license. | ||
|
||
|
||
|
||
------------------------------------ | ||
### Contact | ||
<a name="contact"></a> | ||
|
||
This software suite will undergo a major revision; so, you are encouraged | ||
to ensure that this is still the latest version. Please do not hesitate to | ||
contact the authors if you have comments, questions, or bugs to report. | ||
>[generalised-brown on GitHub](https://github.com/sean-chester/generalised-brown) | ||
------------------------------------ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Copyright (c) 2015 Sean Chester and Leon Derczynski | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
#!/usr/bin/env python | ||
# cluster.py | ||
# © Sean Chester (sean.chester@idi.ntnu.no) | ||
# 22 July 2015 | ||
|
||
import csv | ||
import argparse | ||
|
||
# Input parsing | ||
parser = argparse.ArgumentParser( | ||
description='Prints out a tree with a specified number of leaves, given an ' + \ | ||
'input file with an ordered list of merges. Each unique path identifies ' + \ | ||
'one leaf. All word types that have the same path as each other belong to the ' + \ | ||
'same leaf (and correspond to one Brown cluster).', \ | ||
epilog='If the output is to be read by humans, consider piping results to ' + \ | ||
'the sort command to print the leaves in depth-first order. (Then ' + \ | ||
'similar leaves/clusters will appear nearer each other in the output.)') | ||
parser.add_argument( | ||
'-in', '--input-file', \ | ||
help="Input file containing ordered merges", \ | ||
required=True, \ | ||
dest='input', \ | ||
metavar='INPUT_FILE') | ||
parser.add_argument( | ||
'-c', '--num-classes', \ | ||
type=int, \ | ||
help="Number of leaves/classes/clusters to produce", \ | ||
required=True, \ | ||
dest='leaves', \ | ||
metavar='NUM_CLASSES') | ||
parser.add_argument( | ||
'-d', '--depth', \ | ||
type=int, \ | ||
help="Truncation depth for paths (i.e., no leaf appears farther than d-1 hops from the " + \ | ||
"root). Note: setting this parametre likely results in fewer than NUM_CLASSES leaves, " + \ | ||
"because the --num-classes filter is (logically) applied first.", \ | ||
required=False, \ | ||
dest='depth') | ||
args = parser.parse_args() | ||
|
||
# If depth wasn't passed as a parametre, give it a default value of being | ||
# equal to --num-classes. | ||
if args.depth is None: | ||
args.depth = args.leaves | ||
|
||
# Actual processing -- read merge list in reverse and map each encountered | ||
# word type onto a tree path in a dictionary. | ||
tree = {} | ||
with open( args.input ) as tsv: | ||
for line in reversed(list(csv.reader(tsv, delimiter="\t", quotechar=None))): | ||
merge_into = line[0] | ||
merge_from = line[1] | ||
if not tree.has_key(merge_into): | ||
tree[merge_into] = "0" | ||
tree[merge_from] = "1" | ||
args.leaves = args.leaves - 2 | ||
elif args.leaves > 0: | ||
parent = tree[merge_into] | ||
if len( parent ) < args.depth: | ||
tree[merge_from] = parent + "1" | ||
tree[merge_into] = parent + "0" | ||
else: | ||
tree[merge_from] = parent | ||
args.leaves = args.leaves - 1 | ||
else: | ||
tree[merge_from] = tree[merge_into] | ||
|
||
for (cluster, path) in tree.items(): | ||
print( path + "\t" + cluster ) | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Change Log | ||
|
||
-------------------- | ||
|
||
## 1.3.1: [Sean Chester](https://github.com/sean-chester) | ||
+ Added conceptual generalisation whereby every merge is logged so that | ||
historical states can be recalled with ../cluster_generator/cluster.py. | ||
+ Added more parallelism (courtesy of | ||
[Kenneth S Bøgh](https://dk.linkedin.com/in/kenneth-sejdenfaden-bøgh-58915524)). | ||
+ Aliased the input parametre _c_ as _a_ to fit the conceptual generalisation | ||
(while maintaining backwards compatibility). | ||
|
||
## 1.3: [Percy Liang](https://github.com/percyliang) | ||
+ compatibility updates for newer versions of g++ (courtesy of Chris Dyer). | ||
|
||
## 1.2: [Percy Liang](https://github.com/percyliang) | ||
+ make compatible with MacOS (replaced timespec with timeval and changed order of linking). | ||
|
||
## 1.1: [Percy Liang](https://github.com/percyliang) | ||
+ Removed deprecated operators so it works with GCC 4.3. | ||
|
||
-------------------- |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
(C) Copyright 2015 (Sean Chester)[https://github.com/sean-chester] | ||
and (Leon Derczynski)[http://derczynski.com/] | ||
(C) Copyright 2007-2012, Percy Liang | ||
|
||
http://cs.stanford.edu/~pliang | ||
|
||
Permission is granted for anyone to copy, use, or modify these programs and | ||
accompanying documents for purposes of research or education, provided this | ||
copyright notice is retained, and note is made of any changes that have been | ||
made. | ||
|
||
These programs and documents are distributed without any warranty, express or | ||
implied. As the programs were written for research purposes only, they have | ||
not been tested to the degree that would be advisable in any important | ||
application. All use of these programs is entirely at the user's own risk. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# 1.2: need to make sure opt.o goes in the right order to get the right scope on the command-line arguments | ||
# Use this for Linux | ||
ifeq ($(shell uname),Linux) | ||
files=$(subst .cc,.o,basic/logging.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v logging.cc)) | ||
else | ||
files=$(subst .cc,.o,basic/opt.cc $(shell /bin/ls *.cc) $(shell /bin/ls basic/*.cc | grep -v opt.cc)) | ||
endif | ||
|
||
wcluster: $(files) | ||
g++ -Wall -g -std=c++0x -O3 -fopenmp -o wcluster $(files) -lpthread | ||
|
||
%.o: %.cc | ||
g++ -Wall -g -O3 -fopenmp -std=c++0x -o $@ -c $< | ||
|
||
clean: | ||
rm wcluster basic/*.o *.o |
Oops, something went wrong.