Entity Resolution (ER) is the process of disambiguating data to determine if multiple digital records represent the same real-world entity such as a person, organization, place, or other type of object.
For example, say you have information on persons coming from different e-commerce platforms, or say same users with different profiles in a system.
They may have slightly different contact information, with addresses formatted differently, using different forms/abbreviations of names, etc.
A human may be able to tell if the records actually belong to the same underlying entity but given the number of possible combinations and matching that can be had, there is a need for an intelligent automated approach to doing so, which is where ER systems come into play.
There can be numerous use cases for Entity Resolution across industries like patient history linkage in Healthcare, customers' preferences and history in Insurance and Financial Services, identifying similar entities or groups for recommendation in E-commerce and Digital Marketing etc.
This demo guide covers a similar use case of performing Entity Resolution.
We have taken an example of a dummy online movie streaming platform. For ease of understanding, we have taken only movies and users datasets.
Users can have one or more accounts on a movie streaming platform.
We are performing Entity Resolution over users’ data to identify similar/same users. We are also performing linking for users which are from same account (or group/family). Later, we are leveraging this linking to provide effective recommendations to individual users.
In this guide, we will perform below steps:
-
Load: Load nodes and relationship information from external CSV files and create entities
-
Relate: Establish more connections (relationships) between entities
-
Explore: Perform basic querying with Cypher on loaded data
-
ER: Perform Entity Resolution based on similarity and do record linkage
-
Recommend: Generate recommendation based on user similarities / preferences
-
Additional: Try couple of preference based similarities and recommendation examples
In this demonstration, we have used Neo4j APOC (Awesome Procedures on Cypher) and Neo4j GDS (Graph Data Science) libraries few Cypher queries. To execute the Cypher queries with APOC or GDS functions, you will need to add these libraries as plugins to your Neo4j database instance.
For more details on APOC and GDS, please refer below links.
Load nodes and relationship information from external CSV files and create entities
// Constraints
CREATE CONSTRAINT user_id IF NOT EXISTS FOR (u:User) REQUIRE u.userId IS UNIQUE;
CREATE CONSTRAINT ip_address IF NOT EXISTS FOR (i:IpAddress) REQUIRE i.address IS UNIQUE;
// indexes
CREATE INDEX user_state IF NOT EXISTS FOR (u:User) ON (u.state);
CREATE INDEX user_firstName IF NOT EXISTS FOR (u:User) ON (u.firstName);
CREATE INDEX user_lastName IF NOT EXISTS FOR (u:User) ON (u.firstName);
// Data load
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-graph-examples/entity_resolution/main/data/csv/Users.csv" AS row
// Conditionally create User, set properties on first create
MERGE (u:User { userId: toInteger(row.userId) })
ON CREATE SET
u.firstName= row.firstName,
u.lastName= row.lastName,
u.gender= row.gender,
u.email= row.email,
u.phone= row.phone,
u.state= row.state,
u.country= row.country
WITH u, row
// create IpAddress if not exists
MERGE (ip:IpAddress { address: row.ipAddress })
// create unique relationship
MERGE (u)-[:USES]->(ip);
// Constraints
CREATE CONSTRAINT genre_name IF NOT EXISTS FOR (g:Genre) REQUIRE g.name IS UNIQUE;
CREATE CONSTRAINT movie_id IF NOT EXISTS FOR (m:Movie) REQUIRE m.movieId IS UNIQUE;
// Index
CREATE INDEX movie_title IF NOT EXISTS FOR (m:Movie) ON (m.title);
//Load Data
LOAD CSV WITH HEADERS FROM
"https://raw.githubusercontent.com/neo4j-graph-examples/entity_resolution/main/data/csv/Movies.csv" AS row
// conditionally create movie and set properties on first creation
MERGE ( m:Movie { movieId: toInteger(row.movieId) })
ON CREATE SET
m.title = row.name,
m.year = toInteger(row.year)
WITH m, row
// create Genre if not exists
MERGE (g:Genre { name: row.genre } )
// create relationship if not exists
MERGE (m)-[:HAS]->(g)
RETURN m, g;
Load data and create "WATCHED" relationships between Users who have watched whatever Movies
LOAD CSV WITH HEADERS FROM "https://raw.githubusercontent.com/neo4j-graph-examples/entity_resolution/main/data/csv/WatchEvent.csv" AS row
// find user and movie
MATCH (u:User {userId: toInteger(row.userId)})
MATCH (m:Movie {movieId: toInteger(row.movieId)})
// create relationship if not exists
MERGE (u)-[w:WATCHED]->(m)
// always update watchCount
SET w.watchCount = toInteger(row.watchCount);
Query users who have watched movie "The Boss Baby: Family Business"
MATCH (u:User)-[w:WATCHED]->(m:Movie {title: "The Boss Baby: Family Business"})
RETURN u, w, m LIMIT 5
Show users from "New York" and movies watched by them
MATCH (u:User {state: "New York"} )-[w:WATCHED]->(m) RETURN u, w, m LIMIT 50
Show trending genres in Texas
MATCH (u:User {state: "Texas"} )-[:WATCHED]->(m)-[:HAS]->(g:Genre)
// group by genre, order by frequency
RETURN g.name as genre, count(g) as freq
ORDER BY freq DESC
These are users who have same/similar names but different (redundant) profiles due to typos or abbreviations used for some instances.
We are using the Jaro Winkler Distance algorithm from the Neo4j APOC library.
References
MATCH (a:User)
MATCH (b:User)
// not the same user
WHERE a <> b
// users with full-names
WITH a, b, a.firstName + ' ' + a.lastName AS name1, b.firstName + ' ' + b.lastName AS name2
// compute different similiarities
WITH *,
toInteger(apoc.text.sorensenDiceSimilarity(name1, name2) * 100) AS nameSimilarity,
toInteger(apoc.text.sorensenDiceSimilarity(a.email, b.email) * 100) AS emailSimilarity,
toInteger(apoc.text.sorensenDiceSimilarity(a.phone, b.phone) * 100) AS phoneSimilarity
// compute a total similarity score
WITH a, b, name1, name2, toInteger((nameSimilarity + emailSimilarity + phoneSimilarity)/3) as similarity
// filter
WHERE similarity >= 90
RETURN name1, name2, a.email, b.email, similarity
ORDER BY similarity DESC
Users who have similar last names and live in same state, and use same IP address, that means they are either same users with redundant profile or belong to the same family
// shared IP address
MATCH (a:User)-->(ip:IpAddress)<--(b:User)
// same lastname and state
WHERE a.lastName = b.lastName
AND a.state = b.state AND a.country = b.country
// group by joint attributes, collect all member-names
WITH ip, a.country as country, a.state as state,
a.lastName as familyName,
collect(distinct b.firstName + ' ' + b.lastName) as members,
count(distinct b) as memberCount
RETURN state, familyName, memberCount, members
ORDER BY memberCount DESC
Record Linkage: Create Family Nodes for each family and connect members. This is how we link the similar users and family members using a common Family node
// shared IP address
MATCH (a:User)-->(ip:IpAddress)<--(b:User)
// same lastname and state
WHERE a.lastName = b.lastName
AND a.state = b.state AND a.country = b.country
// group by joint attributes, collect all members
WITH ip, a.country as country, a.state as state,
a.lastName as familyName,
collect(distinct b) as familyMembers,
count(distinct b) as totalMembers
WITH familyName, head(familyMembers) as first, tail(familyMembers) as rest
// not global family but within first member
MERGE (first)-[:BELONGS_TO]->(f:Family {name: familyName})
WITH f,rest
UNWIND rest as member
MERGE (member)-[r:BELONGS_TO]->(f)
RETURN count(*);
Providing recommendation to the member based on his/her account/family members history. Get preferred genres by other account members and suggest top 5 movies from most watched genres.
MATCH (user:User {firstName: "Vilma", lastName: "De Mars"})
// other family members
MATCH (user)-[:BELONGS_TO]->(f)<-[:BELONGS_TO]-(otherMember)
// what have they watched and transitive via genre
MATCH (otherMember)-[:WATCHED]->(m1)-[:HAS]->(g:Genre)<-[:HAS]-(m2)
// aggregate by genre, sort by watch count
WITH g, count(*) as watched, m2
ORDER BY watched DESC
// count totals per genre, top-5 watched per genre
WITH g, count(distinct m2) as totalMovies, collect(m2.title)[0..5] as movies
// return 5 per genre
RETURN g.name as genre, totalMovies, movies as topFiveMovies
ORDER BY totalMovies DESC LIMIT 10
Find users based on their movie watching preferences using Node Similarity algorithm
Step 1: For this, we will first create an in-memory graph with node and relationship specification to perform matching
CALL gds.graph.project(
'similarityGraph',
// labels
['User', 'Movie'],
{
// relationships
WATCHED: {
type: 'WATCHED',
properties: {
strength: {
property: 'watchCount',
defaultValue: 1
}
}
}
}
);
Step 2: Perform memory estimate for the matching to execute
CALL gds.nodeSimilarity.write.estimate('similarityGraph', {
writeRelationshipType: 'SIMILAR',
writeProperty: 'score'
})
YIELD nodeCount, relationshipCount, bytesMin, bytesMax, requiredMemory
Step 3: Execute algorithm and show results
CALL gds.nodeSimilarity.stream('similarityGraph')
// return ids and similarity
YIELD node1, node2, similarity
WITH * ORDER BY similarity DESC LIMIT 50
// fetch nodes by id
WITH gds.util.asNode(node1) AS person1, gds.util.asNode(node2) AS person2, similarity
RETURN
person1.firstName + ' ' + person1.lastName as p1,
person2.firstName + ' ' + person2.lastName as p2, similarity;
Step 4: Get recommendations for a user based on similarity. For a user, fetch recommendations based on other similar users' preferences
MATCH (person1:User)
WHERE person1.firstName = 'Paulie' AND person1.lastName = 'Imesson'
CALL gds.nodeSimilarity.stream('similarityGraph')
YIELD node1, node2, similarity
// limit to our user
WHERE node1 = id(person1)
WITH person1, gds.util.asNode(node2) AS person2, similarity
// what did the other people watch
MATCH (person2)-[w:WATCHED]->(m)
// that our user hasn't seen
WHERE NOT exists { (person1)-[:WATCHED]->(m) }
RETURN m.title as movie, SUM(w.watchCount) as watchCount
ORDER BY watchCount DESC LIMIT 10