-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Zingg More Usable - Part 1. Blocking #902
Comments
this is a new phase. In |
if there are more than one sources, we need to do a group by of the hashes per source. |
see also #893 |
zingg.sh --phase debugBlocking --conf config.json --zinggDir /location what will the run command look like? |
—zinggDir is optional |
Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons.
For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.
Let us add a new phase
debugBlocking
which will block the incoming data and output)
We can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples
timestamp - same for both
The text was updated successfully, but these errors were encountered: