Make Zingg More Usable - Part 1. Blocking #902

sonalgoyal · 2024-10-02T08:50:25Z

Sometimes Zingg jobs are slow or fail due to a poorly learnt blocking tree. This can happen due to a variety of reasons.
For example when a user adds sgnificantly larger trainingSamples compared to Zingg learnt labeling. Or due to a natural bias in the data with lots of null columns used in matching. Having an understanding of how blocking is working may be a good step before deciding to run a match or link job.

Let us add a new phase debugBlocking which will block the incoming data and output

Counts per block( getPipeUtil().write(blocked.select(ColName.HASH_COL).groupByCount(ColName.HASH_COL, ColName.HASH_COL + "_count"), getPipeForDebugBlockingLocation(timestamp));
)
10% records of top 3 by count blocks so that people can see whcih records are contributing to the issue and add appropriate training

We can save results in zinggDir/modelId/blocks/timestamp/counts and zinggDir/modelId/blocks/timestamp/blockSamples

timestamp - same for both

The text was updated successfully, but these errors were encountered:

sonalgoyal · 2024-10-03T07:36:51Z

this is a new phase.
define a new class Blocker which has the logic for blocking copied from matcher. It will take blocking tree and return blocks.
In Matcher. getBlocked. call new Blocker<S,D,R,C,T>,getBloched(getBlockingTreeutil)

In BlockingTreeDebugger, call same

sonalgoyal · 2024-10-03T07:37:22Z

if there are more than one sources, we need to do a group by of the hashes per source.

sonalgoyal · 2024-10-03T11:39:38Z

see also #893

sania-16 · 2024-10-06T13:50:36Z

zingg.sh --phase debugBlocking --conf config.json --zinggDir /location

what will the run command look like?

sonalgoyal · 2024-10-06T16:50:18Z

—zinggDir is optional

sonalgoyal assigned sania-16 Oct 2, 2024

sania-16 added a commit to sania-16/zingg that referenced this issue Oct 7, 2024

first draft issue zinggAI#902

9064316

sania-16 added a commit to sania-16/zingg that referenced this issue Oct 7, 2024

review changes issue zinggAI#902

4d3fc48

sonalgoyal added this to the 0.5.0 milestone Oct 16, 2024

sania-16 added a commit to sania-16/zingg that referenced this issue Oct 21, 2024

working verifyBlocking issue zinggAI#902

cc60303

sonalgoyal added this to 0.5.0 Oct 23, 2024

sonalgoyal moved this to Todo in 0.5.0 Oct 23, 2024

sania-16 closed this as completed Nov 13, 2024

github-project-automation bot moved this from In Progress to Done in 0.5.0 Nov 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Zingg More Usable - Part 1. Blocking #902

Make Zingg More Usable - Part 1. Blocking #902

sonalgoyal commented Oct 2, 2024 •

edited

Loading

sonalgoyal commented Oct 3, 2024

sonalgoyal commented Oct 3, 2024

sonalgoyal commented Oct 3, 2024

sania-16 commented Oct 6, 2024

sonalgoyal commented Oct 6, 2024

Make Zingg More Usable - Part 1. Blocking #902

Make Zingg More Usable - Part 1. Blocking #902

Comments

sonalgoyal commented Oct 2, 2024 • edited Loading

sonalgoyal commented Oct 3, 2024

sonalgoyal commented Oct 3, 2024

sonalgoyal commented Oct 3, 2024

sania-16 commented Oct 6, 2024

sonalgoyal commented Oct 6, 2024

sonalgoyal commented Oct 2, 2024 •

edited

Loading