Graph Analyzing Algorithms

Gradoop provides a collection of classes for analyzing graphs by either wrapping Flink™ Gelly library methods or implementing own algorithms. Thereby it is possible to apply those methods to Gradoops extended property graph model. This section gives an overview of the implemented classes and their application.

Graph Analyzing Algorithms
Clustering Coefficient
Hyperlink-Induced Topic Search
Label Propagation
Page Rank
Random Jump
Single Source Shortest Path
Triangle Counting
Vertex Degrees
Weakly Connected Components

Clustering Coefficient

This algorithm computes the clustering coefficient as a measure of the degree to which vertices in a graph tend to build highly connected cluster. It is provided for both, directed and undirected graphs and 3 values will be computed, which are:

clustering coefficient	description
local	Measures the connectedness of a single vertex regarding the connections of its neighborhood. The value will be between 0.0 and 1.0, where 0.0 is assigned when there are no edges between the neighbors and 1.0 if all neighbors are fully connected with each other. Therefor the local clustering coefficient is the number of edges between neighbors divided by the number of all potential edges between neighbors. - use `GellyLocalClusteringCoefficientDirected` resp. `GellyLocalClusteringCoefficientUndirected`
average	This is the mean over all local values. - use `AverageClusteringCoefficient`
global	Measures the connectedness of the graph itself by computing the ratio from closed triplets (triangles) to all triplets, open and closed ones. Its value is between 0.0 and 1.0, where 0.0 is assigned if there aren't any closed triplets and 1.0 if all triplets are closed. - use `GellyGlobalClusteringCoefficientDirected` resp. `GellyGlobalClusteringCoefficientUndirected`

The values for the local coefficient are written to the corresponding vertices as property. The values for the average and global coefficient are written to the graph head as property.

The application for a directed graph and the local clustering coefficient works as followed:

// graph with 3 fully connected vertices
String graphString = "graph[" +
      "/* fully connected clique */" +
      "(v0 {id:0, value:\"A\"})" +
      "(v1 {id:1, value:\"B\"})" +
      "(v2 {id:2, value:\"C\"})" +
      "(v0)-[e0]->(v1)" +
      "(v1)-[e1]->(v0)" +
      "(v0)-[e2]->(v2)" +
      "(v2)-[e3]->(v0)" +
      "(v1)-[e4]->(v2)" +
      "(v2)-[e5]->(v1)" +
      "]";

// apply the algorithm
LogicalGraph graph = getLoaderFromString(graphString).getLogicalGraphByVariable("graph");
LogicalGraph resultGraph = graph.callForGraph(new GellyLocalClusteringCoefficientDirected());

// read the coefficient values
List<Vertex> vertices = resultGraph.getVertices().collect();
for (Vertex v : vertices) {
    double local = v.getPropertyValue(ClusteringCoefficientBase.PROPERTY_KEY_LOCAL).getDouble();
    System.out.println(v.getPropertyValue("id").toString() + ": " + local);
}
    
/* this will print
0: 1.0
1: 1.0
2: 1.0
*/

Hyperlink-Induced Topic Search (HITS)

HITS, like PageRank, is a link analysis algorithm primary used to rate web pages. HITS differs between hubs and authorities, where in the graph analyzing domain a hub represents a vertex with outgoing edges to many other vertices and an authority represents a vertex with incoming edges from many different vertices. Therefor, two interdependent scores are computed and assigned to the vertices, the hub-score and the authority-score, both normalized by taking their square root. The algorithm terminates after either the given number of maximum iterations, when the total change in hub- and authority-scores over all vertices falls to or below a given convergence threshold value, or both.

HITS takes up to 4 arguments, which are:

a property key to store and access the authority-score
a property key to store and access the hub-score
a maximum of iterations (default: Integer.MAX_VALUE)
a convergence threshold (default: Double.MAX_VALUE)

The application of HITS works as followed:

String graphString = "graph[" +
    "(v0 {id:0})" +
    "(v1 {id:1})" +
    "(v2 {id:2})" +
    "(v0)-[e0]->(v1)" +
    "(v2)-[e1]->(v1)" +

// read graph
LogicalGraph graph = getLoaderFromString(graphString).getLogicalGraphByVariable("graph");

// apply algorithm
String propertyKeyAuthorityScore = "aScore";
String propertyKeyHubScore = "hScore";
int maxIterations = 1;

LogicalGraph resultGraph = graph.callForGraph(new HITS(
    propertyKeyAuthorityScore, propertyKeyHubScore, maxIterations));

/*
 annotated vertices in result graph will be:
 (v0 {id:0, aScore: 0.0d, hScore: 0.7071067811865475d}) with hScore is normalized sqrt(0.5)
 (v1 {id:1, aScore: 1.0d, hScore: 0.0d})
 (v2 {id:2, aScore: 0.0d, hScore: 0.7071067811865475d}) with hScore is normalized sqrt(0.5)
*/

Label Propagation

Label Propagation is used to discover communities in a graph by iteratively propagating labels between adjacent vertices. At each step each vertex adopts the value sent by the majority of its neighbors. If multiple values occur with the same frequency, the smaller resp. greater value will be selected as new label, depending on the implementations listed below. If a vertex changes its value, this value will be propagated to all neighbors again. The algorithm converges when no vertex changes its value or the maximum number of iterations has been reached.

Following wrapper-classes are provided, both of them are implemented as a Scatter-Gather Iteration and return the initial Logical Graph with the labeled vertices.

wrapper-class	description
`GellyLabelPropagation`	Wraps the original Gelly algorithm. If multiple values occur with the same frequency, the greater value will be selected. Arguments are: the maximum number of iterations and the property key to access the label value.
`GradoopLabelPropagation`	Wraps an own implementation of a Gelly Scatter-Gather Iteration. If multiple values occur with the same frequency, the smaller value will be selected. Arguments are: the maximum number of iterations and the property key to access the label value.

The application of GradoopLabelPropagation works as followed:

String graphString = "graph[" +
      "/* first community */" +
      "(v0 {id:0, value:\"A\"})" +
      "(v1 {id:1, value:\"A\"})" +
      "(v2 {id:2, value:\"B\"})" +
      "(v0)-[e0]->(v1)" +
      "(v1)-[e1]->(v2)" +
      "(v2)-[e2]->(v0)" +
      "/* second community */" +
      "(v3 {id:3, value:\"C\"})" +
      "(v4 {id:4, value:\"D\"})" +
      "(v5 {id:5, value:\"E\"})" +
      "(v6 {id:6, value:\"F\"})" +
      "(v3)-[e3]->(v1)" +
      "(v3)-[e4]->(v4)" +
      "(v3)-[e5]->(v5)" +
      "(v3)-[e6]->(v6)" +
      "(v4)-[e7]->(v3)" +
      "(v4)-[e8]->(v5)" +
      "(v4)-[e9]->(v6)" +
      "(v5)-[e10]->(v3)" +
      "(v5)-[e11]->(v4)" +
      "(v5)-[e12]->(v6)" +
      "(v6)-[e13]->(v3)" +
      "(v6)-[e14]->(v4)" +
      "(v6)-[e15]->(v5)" +
      "]";

// read graph
LogicalGraph graph = getLoaderFromString(graphString).getLogicalGraphByVariable("graph")

// apply algorithm
int maxIteration = 10;
String propertyKeyLabel = "value";
LogicalGraph resultGraph = graph.callForGraph(new GradoopLabelPropagation(
    maxIteration, propertyKeyLabel));

/*
 labeled vertices in resultGraph will be:
 (v0 {id:0, value:\"A\"})
 (v1 {id:1, value:\"A\"})
 (v2 {id:2, value:\"A\"})
 (v3 {id:3, value:\"C\"})
 (v4 {id:4, value:\"C\"})
 (v5 {id:5, value:\"C\"})
 (v6 {id:6, value:\"C\"})
*/

PageRank

PageRank is actually an algorithm primary used to rank web search engine results by a calculated score, but it is also used in graph application domains for years now. The idea is that important vertices tend to link to other important vertices, with importance determined by a score. The algorithm operates in iterations, where vertices distribute their scores to their neighbors on outgoing edges, with the score divided evenly among those edges. Subsequently, the vertices update their scores based on the sum of values they receive.

The initial score is set as the inverse of the number of vertices, the result scores will sum up to 1.0, therefor all scores range between 0.0 and 1.0. The algorithm terminates after the given number of maximum iterations. PageRank returns the initial LogicalGraph with the PageRank scores written to the corresponding vertices as property. It takes 3 arguments, which are:

a property key to store and access the PageRank score
a dampening factor (double between 0.0 and 1.0), as the probability of following an out-link, otherwise jump to a random vertex
a maximum of iterations

The application of PageRank works as followed:

String graphString = "graph:[" +
  "(eve {name:\"Eve\", city :\"Dresden\"})" +
  "(alice {name:\"Alice\", city:\"Leipzig\"})" +
  "(bob {name:\"Bob\", city:\"Leipzig\"})" +
  "(eve)-[e0 {since:2013}]->(alice)" +
  "(eve)-[e2 {since:2015}]->(bob)" +
  "(alice)-[e1 {since:2014}]->(bob)" +
  "(bob)-[e3 {since:2014}]->(alice)" +
  "]";

// read graph
LogicalGraph graph = getLoaderFromString(graphString).getLogicalGraphByVariable("graph");

// apply algorithm
String propertyKeyScore = "pageRankScore";
double dampeningFactor = 0.85;
int maxIterations = 40;

LogicalGraph resultGraph = graph.callForGraph(new PageRank(
    propertyKeyScore, dampeningFactor, maxIterations));

/*
 PageRank scores in result graph will converge at 
 eve: 0.05 (only outgoing edges -> "unimportant")
 alice: 0.475
 bob: 0.475
*/

Random Jump

following soon

Single Source Shortest Path

This class wrapps Gellys Single Source Shortest Path algorithm for directed weighted graphs, where the edge weight value is of type Double. It computes the shortest path from a given source vertex to all other vertices in the graph. The Gelly algorithm is implemented as a Scatter-Gather Iteration, where at each iteration each vertex sends the sum of its current distance and the weight of an outgoing edge to the neighbor connected by this edge. If there is a new minimum distance afterwards, a vertex updates its value. The computation terminates after the specified maximum number of iterations or when there are no value updates.

SingleSourceShortestPaths returns the initial graph with the shortest path from the source vertex to each other vertex written as property to the corresponding vertices. Arguments are:

source vertex id
property key to access the edge weight value
maximum number of iterations
property key to access the computed and stored shortest path in a vertex

The application of SingleSourceShortestPaths works as followed:

    // Load a graph
    TemporalGraph graph = new TemporalCSVDataSource("path", config).getTemporalGraph();

    // Add a weight to edges via transformation
    graph = graph.transformEdges((e1, e2) -> {
      e1.setProperty("weight", 1);
      return e1;
    });

    // Determine the Gradoop-Id of a SOURCE vertex for SSSP 
    TemporalVertex v =
      graph.getVertices()
        .filter( ver -> ver.hasProperty("id"))
        .filter( ver -> ver.getPropertyValue("id").getString().equals("470")).collect()
        .get(0);

    // Apply SSSP, the result is again a temporal graph
    graph = graph
      .callForGraph(
        new SingleSourceShortestPaths<>(
          // The source vertex id
          v.getId(),
          // The name of the weiht property of the edges
          "weight", 
          // The number of iterations
          20, 
          // The property value, where the distance is stored in the result
          "distance"));

    // Print the vertices
    graph.getVertices().print();
  }

Triangle Counting

following soon

Vertex Degrees

The gradoop wrapper-class DistinctVertexDegrees calls Flink Gellys VertexDegrees for directed graphs, which annotates all vertices of the graph with their degree, in-degree and out-degree. The degree value here is the number of adjacent vertices. It returns the initial LogicalGraph with the distinct degrees written to the vertices as properties. Arguments are:

property key to store and access the degree value
property key to store and access the in-degree value
property key to store and access the out-degree value
boolean to determine, if vertices with an in-degree of zero are included (optional, default: true)

The application for DistinctVertexDegrees works as followed:

String graphString = "graph[" +
    "(v0 {id:0})" +
    "(v1 {id:1})" +
    "(v2 {id:2})" +
    "(v3 {id:3})" +
    "(v0)-[e0]->(v1)" +
    "(v0)-[e1]->(v2)" +
    "(v2)-[e2]->(v3)" +
    "(v2)-[e3]->(v1)" +
    "(v3)-[e4]->(v2)" +
    "]";

// read graph
LogicalGraph graph = getLoaderFromString(graphString).getLogicalGraphByVariable("graph");

// apply algorithm
String propertyKeyDegree = "degree";
String propertyKeyInDegree = "inDegree";
String propertyKeyOutDegree = "outDegree";
boolean includeZeroDegreeVertices = true;

LogicalGraph resultGraph = graph.callForGraph(new DistinctVertexDegrees(
    propertyKeyDegree, propertyKeyInDegree, propertyKeyOutDegree, includeZeroDegreeVertices));

/*
 vertex annotations in result graph are
 (v0 {id:0, degree:2L, inDegree:0L, outDegree:2L})
 (v1 {id:1, degree:2L, inDegree:2L, outDegree:0L})
 (v2 {id:2, degree:3L, inDegree:2L, outDegree:2L})
 (v3 {id:3, degree:1L, inDegree:1L, outDegree:1L})
*/

Weakly Connected Components

This algorithm detects the components of a graph. Two vertices belong to the same component, if they are connected by an edge, without taking the edge direction into account. The Gelly algorithm is implemented as a Scatter-Gather Iteration. The algorithm converges when no vertex changes the value for its component or the maximum number of iterations has been reached.

Following Gradoop wrapper-classes are provided:

wrapper-class	description
`AnnotateWeaklyConnectedComponents`	Calls the Gelly algorithm to annotate the vertices (and edges if intended) with the corresponding component id. Arguments are: the maximum number of iterations, the property key to store the annotation and a boolean to determine if the edges are annotated, too.
`WeaklyConnectedComponentsAsCollection`	Calls AnnotateWeaklyConnectedComponents and splits the Logical Graph into a Graph Collection by the given annotation property key, where each graph in this collection represents a weakly connected component.
`ConnectedComponentsDistribution`	Calls AnnotateWeaklyConnectedComponents and aggregates the number of vertices and edges for each component. Returns a `DataSet<Tuple3<String,Long,Long>>` where each entry contains the components id, the corresponding number of vertices and edges. If edges are not annotated, the edge value for each entry will be `-1`.

The application for WeaklyConnectedComponentsAsCollection and ConnectedComponentsDistribution works as followed:

// graph with 2 components
String graphString = "graph[" +
    // First component
    "(v0 {id:0, component:1})" +
    // Second component
    "(v1 {id:1, component:2})-[e0]->(v2 {id:2, component:2})" +
    "(v1)-[e1]->(v3 {id:3, component:2})" +
    "(v2)-[e2]->(v3)" +
    "(v3)-[e3]->(v4 {id:4, component:2})" +
    "(v4)-[e4]->(v5 {id:5, component:2})" +
    "]";

// read graph
LogicalGraph graph = getLoaderFromString(graphString).getLogicalGraphByVariable("graph");

// apply the algorithm
String propertyKeyWcc = "wcc_id";
boolean annotateEdges = true;
int maxIterations = 20;

GraphCollection result = graph.callForCollection(
      new WeaklyConnectedComponentsAsCollection(propertyKeyWcc, maxIterations));

/*
 result contains two logical graphs:
 first graph with vertex v0
 second graph with vertices v1 to v5 and corresponding edges
*/

DataSet<Tuple3<String,Long,Long>> componentDist = new ConnectedComponentsDistribution(
      propertyKey, maxIterations, annotateEdges).execute(graph);

/*
 componentDist contains two entries with <component-id, #vertices, #edges>:
 first entry with <v0, 1, 0>
 second entry with <v1, 5, 5>
*/