LabelModel should support loading sparse matrices #1625

talolard · 2021-02-07T17:25:48Z

Problem I Want To solve

I've found it easy to generate millions of labels with label functions, but loading them into Snorkel is hard.
The problem is the conversion to augmented format and (for training) the calculation of the O matrix.

Describe the solution you'd like

In addition to letting the user load the full label matrix (n_docs,n_funcs), we can let the user load the indicator matrix (n_docs,n_funcsn_labels) in sparse format.
e.g. user would input a list of tuples (doc_id,func_idnum_labels+label_id) and populate a sparse matrix.
This makes the L.T@L calculation cheap, and saves lots of time and memory building indicator matrix.

Torch supports Sparse matrices, so we could even do training and inference without the memory hassle of the dense L matrix.

Example:

I calculate and store the label functions in SQL, so it's easy to generate that list of tuples.

Caveat

This would make modelling dependencies between LFs harder, but since _create_tree is degenerate that doesn't seem to be an issue in practice.

Describe alternatives you've considered

The other alternative is some "big-data" solution, but that's a lot of friction for something I can do so simply.

Additional context

I'm implementing this anyway for my own fun, happy to contribute it back if theirs interest

bhancock8 · 2021-02-09T06:08:52Z

Thanks for suggesting this, @talolard! Updating this operation to accept sparse matrix inputs is something we've had in our backlog, so a PR here is certainly welcome.

talolard · 2021-02-09T17:19:31Z

Awesome.
API design question:
The way things work now, there are calls to set_constants in the fit and predict methods.
That doesn't work cleanly with a sparse format.

My personal taste is to add new methods like

def fit_from_sparse_tuples(self,indicies_tuple):
   ...

def fit_from_pre_computed_objective_matrix(self, pre_computed_O):
   ...

And these would call fit, passing a paramater that would go to set_constants .

Does that sound OK ?

bhancock8 · 2021-02-09T21:02:29Z

Yeah, that sounds good to me—separate explicit methods rather than overloading the args for the default fit method.

bhancock8 assigned fredsala Feb 9, 2021

talolard mentioned this issue Feb 12, 2021

Support Training With Sparse Matrices #1629

Closed

5 tasks

henryre added feature request help wanted no-stale Auto-stale bot skips this issue labels Feb 13, 2021

talolard mentioned this issue Feb 15, 2021

TestLabelModelAdvanced fails when changing cardinality #1631

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LabelModel should support loading sparse matrices #1625

LabelModel should support loading sparse matrices #1625

talolard commented Feb 7, 2021

bhancock8 commented Feb 9, 2021

talolard commented Feb 9, 2021

bhancock8 commented Feb 9, 2021

LabelModel should support loading sparse matrices #1625

LabelModel should support loading sparse matrices #1625

Comments

talolard commented Feb 7, 2021

Problem I Want To solve

Describe the solution you'd like

Example:

Caveat

Describe alternatives you've considered

Additional context

bhancock8 commented Feb 9, 2021

talolard commented Feb 9, 2021

bhancock8 commented Feb 9, 2021