Streaming data deduplication #265

sridharpattem · 2018-11-11T17:27:26Z

Hi,
Is it possible to check for duplicates within an unbounded streaming data set, not checking against another static data source but against the data that has streamed so far?

The flow is as follows.

Source Database ---> CDC ---> Kafka ----> Stream Processing (invoke Duke for duplicate check) -> Target Database

I would like to build the index as data is streaming in from the CDC, keep incrementing the index with new data and search the index at the same time for each message coming along. What is the way to do this? Or, do we always need at least two static data sets to find duplicates?

Thank you.

larsga · 2018-11-11T18:13:53Z

Yes, this is possible, if you index the records as you process them. The most efficient approach is to take some batch of records (say 10,000 records), index them all, commit the index, then search for duplicates. The API has methods for this.

ashubitm · 2018-11-12T09:18:16Z

i have a similar scenario where i have to dedupe records coming in streams against couchbase data as quickly as possible .
Do we have couchbase source which can use index and call findCandidateMatches() against the couchbase and do quick deduping .

uderline · 2018-11-13T08:36:09Z

Hi @sridharpattem

I had the same issue with a data flow DB -> NiFi -> Logstash -> Elastic.
I basically made an elastic plugin. If you want to have an idea on how to implement Duke in you code, feel free: https://github.com/minibigio/miniduke/blob/d0b51619cf2f080348a2f17f6c7932ce3617f89c/src/main/java/io/minibig/miniduke/ingest/MinidukeProcessor.java#L143

Good luck !

@larsga I have a question concerning the batch size . If you don't have any idea on how many records you are going to receive, what value do you assign ? How much does it matter if the batch size is too high ?

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming data deduplication #265

Streaming data deduplication #265

sridharpattem commented Nov 11, 2018

larsga commented Nov 11, 2018

ashubitm commented Nov 12, 2018

uderline commented Nov 13, 2018

Streaming data deduplication #265

Streaming data deduplication #265

Comments

sridharpattem commented Nov 11, 2018

larsga commented Nov 11, 2018

ashubitm commented Nov 12, 2018

uderline commented Nov 13, 2018