Skip to content

Latest commit

 

History

History
439 lines (332 loc) · 14.9 KB

entity-resolution.adoc

File metadata and controls

439 lines (332 loc) · 14.9 KB

Entity Resolution

Entity Resolution, Record Linkage and Similarity wise recommendation with Neo4j

What is Entity Resolution?

Entity Resolution (ER) is the process of disambiguating data to determine if multiple digital records represent the same real-world entity such as a person, organization, place, or other type of object.

For example, say you have information on persons coming from different e-commerce platforms, or say same users with different profiles in a system.

They may have slightly different contact information, with addresses formatted differently, using different forms/abbreviations of names, etc.

A human may be able to tell if the records actually belong to the same underlying entity but given the number of possible combinations and matching that can be had, there is a need for an intelligent automated approach to doing so, which is where ER systems come into play.

There can be numerous use cases for Entity Resolution across industries like patient history linkage in Healthcare, customers' preferences and history in Insurance and Financial Services, identifying similar entities or groups for recommendation in E-commerce and Digital Marketing etc.

Demo Use Case

This demo guide covers a similar use case of performing Entity Resolution.

We have taken an example of a dummy online movie streaming platform. For ease of understanding, we have taken only movies and users datasets.

Users can have one or more accounts on a movie streaming platform.

We are performing Entity Resolution over users’ data to identify similar/same users. We are also performing linking for users which are from same account (or group/family). Later, we are leveraging this linking to provide effective recommendations to individual users.

Data Model

model

Steps in this Guide

In this guide, we will perform below steps:

  • Load: Load nodes and relationship information from external CSV files and create entities

  • Relate: Establish more connections (relationships) between entities

  • Explore: Perform basic querying with Cypher on loaded data

  • ER: Perform Entity Resolution based on similarity and do record linkage

  • Recommend: Generate recommendation based on user similarities / preferences

  • Additional: Try couple of preference based similarities and recommendation examples

Notes

In this demonstration, we have used Neo4j APOC (Awesome Procedures on Cypher) and Neo4j GDS (Graph Data Science) libraries few Cypher queries. To execute the Cypher queries with APOC or GDS functions, you will need to add these libraries as plugins to your Neo4j database instance.

For more details on APOC and GDS, please refer below links.

Load Users, Ip Addresses and connect Users with IP Addresses

Load nodes and relationship information from external CSV files and create entities

// Constraints
CREATE CONSTRAINT user_id IF NOT EXISTS FOR (u:User) REQUIRE u.userId IS UNIQUE;
CREATE CONSTRAINT ip_address IF NOT EXISTS FOR (i:IpAddress) REQUIRE i.address IS UNIQUE;

// indexes
CREATE INDEX user_state IF NOT EXISTS FOR (u:User) ON (u.state);
CREATE INDEX user_firstName IF NOT EXISTS FOR (u:User) ON (u.firstName);
CREATE INDEX user_lastName IF NOT EXISTS FOR (u:User) ON (u.firstName);

// Data load
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-graph-examples/entity_resolution/main/data/csv/Users.csv" AS row

// Conditionally create User, set properties on first create
MERGE (u:User { userId: toInteger(row.userId) })
ON CREATE SET
u.firstName= row.firstName,
u.lastName= row.lastName,
u.gender= row.gender,
u.email= row.email,
u.phone= row.phone,
u.state= row.state,
u.country= row.country

WITH u, row
// create IpAddress if not exists
MERGE (ip:IpAddress { address: row.ipAddress })
// create unique relationship
MERGE (u)-[:USES]->(ip);
// Constraints
CREATE CONSTRAINT genre_name IF NOT EXISTS FOR (g:Genre) REQUIRE g.name IS UNIQUE;
CREATE CONSTRAINT movie_id IF NOT EXISTS FOR (m:Movie) REQUIRE m.movieId IS UNIQUE;
// Index
CREATE INDEX movie_title IF NOT EXISTS FOR (m:Movie) ON (m.title);

//Load Data
LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/neo4j-graph-examples/entity_resolution/main/data/csv/Movies.csv" AS row

// conditionally create movie and set properties on first creation
MERGE ( m:Movie { movieId: toInteger(row.movieId) })
ON CREATE SET
m.title = row.name,
m.year = toInteger(row.year)

WITH m, row
// create Genre if not exists
MERGE (g:Genre { name: row.genre } )
// create relationship if not exists
MERGE (m)-[:HAS]->(g)
RETURN m, g;

Establish more connections (relationships) between entities

Load data and create "WATCHED" relationships between Users who have watched whatever Movies

LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-graph-examples/entity_resolution/main/data/csv/WatchEvent.csv" AS row

// find user and movie
MATCH (u:User {userId: toInteger(row.userId)})
MATCH (m:Movie {movieId: toInteger(row.movieId)})

// create relationship if not exists
MERGE (u)-[w:WATCHED]->(m)
// always update watchCount
SET w.watchCount = toInteger(row.watchCount);

Perform basic querying with Cypher on loaded data

Query users who have watched movie "The Boss Baby: Family Business"

MATCH (u:User)-[w:WATCHED]->(m:Movie {title: "The Boss Baby: Family Business"})
RETURN u, w, m LIMIT 5

Show users from "New York" and movies watched by them

MATCH (u:User {state: "New York"} )-[w:WATCHED]->(m)  RETURN u, w, m LIMIT 50

Show trending genres in Texas

MATCH (u:User {state: "Texas"} )-[:WATCHED]->(m)-[:HAS]->(g:Genre)
// group by genre, order by frequency
RETURN g.name as genre, count(g) as freq
ORDER BY freq DESC

Perform Entity Resolution based on similarity and perform record linkage

Users who have similar names

These are users who have same/similar names but different (redundant) profiles due to typos or abbreviations used for some instances.

We are using the Jaro Winkler Distance algorithm from the Neo4j APOC library.

References

MATCH (a:User)
MATCH (b:User)
// not the same user
WHERE a <> b

// users with full-names
WITH a, b, a.firstName + ' ' + a.lastName AS name1, b.firstName + ' ' + b.lastName AS name2

// compute different similiarities
WITH *,
toInteger(apoc.text.sorensenDiceSimilarity(name1, name2) * 100) AS nameSimilarity,
toInteger(apoc.text.sorensenDiceSimilarity(a.email, b.email) * 100) AS emailSimilarity,
toInteger(apoc.text.sorensenDiceSimilarity(a.phone, b.phone) * 100) AS phoneSimilarity

// compute a total similarity score
WITH a, b, name1, name2, toInteger((nameSimilarity + emailSimilarity + phoneSimilarity)/3) as similarity

// filter
WHERE similarity >= 90

RETURN name1, name2, a.email, b.email,  similarity

ORDER BY similarity DESC

Users belonging to same family

Users who have similar last names and live in same state, and use same IP address, that means they are either same users with redundant profile or belong to the same family

// shared IP address
MATCH (a:User)-->(ip:IpAddress)<--(b:User)
// same lastname and state
WHERE a.lastName = b.lastName
AND a.state = b.state AND a.country = b.country

// group by joint attributes, collect all member-names
WITH ip, a.country as country, a.state as state,
     a.lastName as familyName,
    collect(distinct b.firstName + ' '  + b.lastName) as members,
    count(distinct b) as memberCount

RETURN state, familyName, memberCount, members
ORDER BY memberCount DESC

Record Linkage: Create Family Nodes for each family and connect members. This is how we link the similar users and family members using a common Family node

// shared IP address
MATCH (a:User)-->(ip:IpAddress)<--(b:User)
// same lastname and state
WHERE a.lastName = b.lastName
AND a.state = b.state AND a.country = b.country

// group by joint attributes, collect all members
WITH ip, a.country as country, a.state as state,
     a.lastName as familyName,
     collect(distinct b) as familyMembers,
     count(distinct b) as totalMembers
WITH familyName, head(familyMembers) as first, tail(familyMembers) as rest
// not global family but within first member
MERGE (first)-[:BELONGS_TO]->(f:Family {name: familyName})
WITH f,rest

UNWIND rest as member

MERGE (member)-[r:BELONGS_TO]->(f)
RETURN count(*);

Check how may families are created

MATCH (f:Family)<-[b:BELONGS_TO]-(u:User)

RETURN f, b, u LIMIT 200

Generate recommendation based on user family or group similarities / preferences

Providing recommendation to the member based on his/her account/family members history. Get preferred genres by other account members and suggest top 5 movies from most watched genres.

MATCH (user:User {firstName: "Vilma", lastName: "De Mars"})
// other family members
MATCH (user)-[:BELONGS_TO]->(f)<-[:BELONGS_TO]-(otherMember)

// what have they watched and transitive via genre
MATCH (otherMember)-[:WATCHED]->(m1)-[:HAS]->(g:Genre)<-[:HAS]-(m2)

// aggregate by genre, sort by watch count
WITH g, count(*) as watched, m2
ORDER BY watched DESC

// count totals per genre, top-5 watched per genre
WITH g, count(distinct m2) as totalMovies, collect(m2.title)[0..5] as movies

// return 5 per genre
RETURN g.name as genre, totalMovies, movies as topFiveMovies
ORDER BY totalMovies DESC LIMIT 10

Using Neo4j Node Similarity Algorigthm to find similar users and get recommendations

Find users based on their movie watching preferences using Node Similarity algorithm

Step 1: For this, we will first create an in-memory graph with node and relationship specification to perform matching

CALL gds.graph.project(
    'similarityGraph',
    // labels
    ['User', 'Movie'],
    {
        // relationships
        WATCHED: {
            type: 'WATCHED',
            properties: {
                strength: {
                    property: 'watchCount',
                    defaultValue: 1
                }
            }
        }
    }
);

Step 2: Perform memory estimate for the matching to execute

CALL gds.nodeSimilarity.write.estimate('similarityGraph', {
  writeRelationshipType: 'SIMILAR',
  writeProperty: 'score'
})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory

Step 3: Execute algorithm and show results

CALL gds.nodeSimilarity.stream('similarityGraph')
// return ids and similarity
YIELD node1, node2, similarity
WITH * ORDER BY similarity DESC LIMIT 50
// fetch nodes by id
WITH gds.util.asNode(node1) AS person1, gds.util.asNode(node2) AS person2, similarity
RETURN
person1.firstName + ' ' +  person1.lastName as p1,
person2.firstName  + ' ' +   person2.lastName as p2, similarity;

Step 4: Get recommendations for a user based on similarity. For a user, fetch recommendations based on other similar users' preferences

MATCH (person1:User)
WHERE person1.firstName = 'Paulie' AND person1.lastName = 'Imesson'

CALL gds.nodeSimilarity.stream('similarityGraph')
YIELD node1, node2, similarity
// limit to our user
WHERE node1 = id(person1)

WITH person1, gds.util.asNode(node2) AS person2, similarity

// what did the other people watch
MATCH (person2)-[w:WATCHED]->(m)
// that our user hasn't seen
WHERE NOT exists { (person1)-[:WATCHED]->(m) }

RETURN m.title as movie, SUM(w.watchCount) as watchCount
ORDER BY watchCount DESC LIMIT 10