Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Zingg More Usable - Part 1. Blocking #902

Closed
sonalgoyal opened this issue Oct 2, 2024 · 5 comments
Closed

Make Zingg More Usable - Part 1. Blocking #902

sonalgoyal opened this issue Oct 2, 2024 · 5 comments
Assignees
Milestone

Comments

@sonalgoyal
Copy link
Member

sonalgoyal commented Oct 2, 2024

Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons.
For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.

Let us add a new phase debugBlocking which will block the incoming data and output

  • Counts per block( getPipeUtil().write(blocked.select(ColName.HASH_COL).groupByCount(ColName.HASH_COL, ColName.HASH_COL + "_count"), getPipeForDebugBlockingLocation(timestamp));
    )
  • 10% records of top 3 by count blocks so that people can see whcih records are contributing to the issue and add appropriate training

We can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples

timestamp - same for both

@sonalgoyal
Copy link
Member Author

this is a new phase.
define a new class Blocker which has the logic for blocking copied from matcher. It will take blocking tree and return blocks.
In Matcher. getBlocked. call new Blocker<S,D,R,C,T>,getBloched(getBlockingTreeutil)

In BlockingTreeDebugger, call same

@sonalgoyal
Copy link
Member Author

if there are more than one sources, we need to do a group by of the hashes per source.

@sonalgoyal
Copy link
Member Author

see also #893

@sania-16
Copy link
Contributor

sania-16 commented Oct 6, 2024

zingg.sh --phase debugBlocking --conf config.json --zinggDir /location

what will the run command look like?

@sonalgoyal
Copy link
Member Author

—zinggDir is optional

sania-16 added a commit to sania-16/zingg that referenced this issue Oct 7, 2024
sania-16 added a commit to sania-16/zingg that referenced this issue Oct 7, 2024
@sonalgoyal sonalgoyal added this to the 0.5.0 milestone Oct 16, 2024
sania-16 added a commit to sania-16/zingg that referenced this issue Oct 21, 2024
@sonalgoyal sonalgoyal added this to 0.5.0 Oct 23, 2024
@sonalgoyal sonalgoyal moved this to Todo in 0.5.0 Oct 23, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in 0.5.0 Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

2 participants