You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've found it easy to generate millions of labels with label functions, but loading them into Snorkel is hard.
The problem is the conversion to augmented format and (for training) the calculation of the O matrix.
Describe the solution you'd like
In addition to letting the user load the full label matrix (n_docs,n_funcs), we can let the user load the indicator matrix (n_docs,n_funcsn_labels) in sparse format.
e.g. user would input a list of tuples (doc_id,func_idnum_labels+label_id) and populate a sparse matrix.
This makes the L.T@L calculation cheap, and saves lots of time and memory building indicator matrix.
Torch supports Sparse matrices, so we could even do training and inference without the memory hassle of the dense L matrix.
Example:
I calculate and store the label functions in SQL, so it's easy to generate that list of tuples.
Caveat
This would make modelling dependencies between LFs harder, but since _create_tree is degenerate that doesn't seem to be an issue in practice.
Describe alternatives you've considered
The other alternative is some "big-data" solution, but that's a lot of friction for something I can do so simply.
Additional context
I'm implementing this anyway for my own fun, happy to contribute it back if theirs interest
The text was updated successfully, but these errors were encountered:
Thanks for suggesting this, @talolard! Updating this operation to accept sparse matrix inputs is something we've had in our backlog, so a PR here is certainly welcome.
Awesome.
API design question:
The way things work now, there are calls to set_constants in the fit and predict methods.
That doesn't work cleanly with a sparse format.
Problem I Want To solve
I've found it easy to generate millions of labels with label functions, but loading them into Snorkel is hard.
The problem is the conversion to augmented format and (for training) the calculation of the O matrix.
Describe the solution you'd like
In addition to letting the user load the full label matrix (n_docs,n_funcs), we can let the user load the indicator matrix (n_docs,n_funcsn_labels) in sparse format.
e.g. user would input a list of tuples (doc_id,func_idnum_labels+label_id) and populate a sparse matrix.
This makes the L.T@L calculation cheap, and saves lots of time and memory building indicator matrix.
Torch supports Sparse matrices, so we could even do training and inference without the memory hassle of the dense L matrix.
Example:
I calculate and store the label functions in SQL, so it's easy to generate that list of tuples.
Caveat
This would make modelling dependencies between LFs harder, but since _create_tree is degenerate that doesn't seem to be an issue in practice.
Describe alternatives you've considered
The other alternative is some "big-data" solution, but that's a lot of friction for something I can do so simply.
Additional context
I'm implementing this anyway for my own fun, happy to contribute it back if theirs interest
The text was updated successfully, but these errors were encountered: