Apache Giraph

From Bauman National Library
This page was last modified on 24 December 2016, at 11:23.
</td></tr>
Apache Giraph
Developer(s) Apache Software Foundation
Stable release
1.1.0 / 2014/11/19
Repository {{#property:P1324}}
Development status Active
Written in Java
Operating system Cross-platform
Type graph processing
License Apache 2.0 Licence
Website giraph.apache.org

Apache Giraph is an iterative graph processing system built on top of Apache Hadoop for high scalability. Giraph originated as the open-source counterpart to Pregel, the graph processing architecture developed at Google and described in a 2010 paper. Both systems are inspired by the Bulk Synchronous Parallel model of distributed computation introduced by Leslie Valiant. Giraph adds several features beyond the basic Pregel model, including master computation, sharded aggregators, edge-oriented input, out-of-core computation, and more. With a steady development cycle and a growing community of users worldwide, Giraph is a natural choice for unleashing the potential of structured datasets at a massive scale.

Description

The input to a Giraph computation is a graph composed of vertices and directed edges, see Figure 1. For example vertices can represent people, and edges friend requests. Each vertex stores a value, so does each edge. The input, thus, not only determines the graph topology, but also the initial values of vertices and edges.

As an example, consider a computation that finds the distance from a predetermined source person s to any person in the social graph. In this computation, the value of an edge E is a floating point number denoting distance between adjacent people. The value of a vertex V is also a floating point number, representing an upper bound on the distance along a shortest path from the predetermined vertex s to v. The initial value of the predetermined source vertex s is 0, and the initial value for any other vertex is infinity.

Figure 1: An illustration of an execution of a single source shortest paths algorithm in Giraph. The input is a chain graph with three vertices (black) and two edges (green). The values of edges are 1 and 3 respectively. The algorithm computes distances from the leftmost vertex. The initial values of vertices are 0, ∞ and ∞ (top row). Distance upper bounds are sent as messages (blue), resulting in updates to vertex values (successive rows going down). The execution lasts three supersteps (separated by red lines).

Computation proceeds as a sequence of iterations, called supersteps in BSP. Initially, every vertex is active. In each superstep each active vertex invokes the Compute method provided by the user. The method implements the graph algorithm that will be executed on the input graph. Intuitively, you should think like a vertex when designing a Giraph algorithm, it is vertex oriented. A graph oriented approach is discussded in GIRAPH-818.

The Compute method:

  • receives messages sent to the vertex in the previous superstep,
  • computes using the messages, and the vertex and outgoing edge values, which may result in modifications to the values, and
  • may send messages to other vertices.

The Compute method does not have direct access to the values of other vertices and their outgoing edges. Inter-vertex communication occurs by sending messages.

External links