PARALLEL (DISTRIBUTED) SPLASH BELIEF PROPAGATION

This week I read two papers: Parallel Splash Belief Propagation by J. Gonzalez et al. and Understanding Belief Propagation and its Generalizations by J. S. Yedidia et al. The second was written eight years before the first one and I found that it is a very good introduction to the topic and a very well written paper (I appreciate a lot when a paper is easy to read).

Before starting with Belief Propagation, we need to understand the concept of inference and review graphical models (Bayesian Networks, Pairwise Markov Random Fields, Factor Graphs, etc.). I would recommend taking a look at Chapter 8 of Christopher M. Bishop’s book “Pattern Recognition and Machine Learning” (which can be downloaded for free from his website).

Back to the first paper, I highlight two basic concepts:

  • The Belief Propagation (BP) algorithm is a way to approximately solve inference problems -whose goal is to calculate probability distributions- based on statistical models represented using graphs. Many scientific domains have separately addressed different real-world problems that can be solved using inference algorithms such as BP.
  • Parallelisation is a programming technique consisting in performing more than one operation at the same time using the hardware resources available. As in everyday life, if independent tasks (operations) are divided among different people (processing units), the overall work is finished much quicker. Problems arise when tasks are not completely independent; some process may need to wait for another to finish or two operations may try modify the same object (data in memory) simultaneously. To deal with this, lots of scheduling algorithms exist.

As many other computationally heavy tasks, Belief Propagation can use parallelisation. This is useful to improve the overall performance of a machine learning system that uses an inference procedure. Parallel Splash Belief Propagation is an algorithm for approximate inference using high performance shared (multi-core) and distributed (cluster) memory systems. The proposed algorithm generalises the forward-backward scheduling to arbitrary cyclic graphical models and outperforms other techniques (in terms in running time before convergence and in total number of message computations).

To parallelise the inference, the update operations have to be divided between more than one processor. In the following animation we can see how the ChainSplash Belief Propagation algorithm, a simpler version of the Parallel Splash BP algorithm, partitions the a chain graphical model into p regions to apply BP in each subgraph:

ezgif-com-f489636cf0
Figure 1. Animated Illustration of the steps in the ChainSplash Belief Propagation algorithm. First, the chain (the simplest graphical model) of length n is divided evenly among p processors. Then, in the forward phase, the message towards the centre vertex are updated and while the messages in the opposite direction are updated in the backward sweep. When the sequential forward-backward BP algorithm finishes, the messages that cross the boundaries between processors are exchanged. This process is repeated until it converges.

A similar approach is used in the Parallel Splash BP algorithm to deal with arbitrary cyclic graphs. Of course, when dealing with more complex graphs problems may arise: size of partitions, update order, variable dependence, etc.

The Splash operation, given a root vertex v and a maximum size W (to upper bound the work necessary to run it), constructs a tree. In this subset of vertices, known as Splash, we can run forward-backward message scheduling as we saw in the previous animation: from leaves to root and from root to leaves. Splashes are the sets of vertices (not necessarily disjoint) in each tree. The Splash operation, performed on every vertex of the graph at least once, has the following structure:

  1. Add the current vertex  as root, v, to a FIFO queue.
  2. Until the accumulated work is greater than W and the queue is not empty, do:
    1.  Take the first element, w, of the FIFO queue.
    2. Add (not yet visited) neighbours of w in breadth first search (BFS) order.
    3. Mark w as visited.
  3. Starting by the last node added (a leaf of the Splash) to the FIFO queue, invoke the SendMessages operation to update the inbound messages.
  4. In the order of the FIFO queue, invoke the SendMessages operation to update the outbound messages.

If the Splashes are too large, the regions overlap and messages are unnecessarily recomputed multiple times; on the other hand, if the Splashes are too small optimality is lost. Instead of having to tune the Splash size, the authors introduce Dynamic Splashes in the Subsection 7.6 of the paper. The idea is to create large, evenly distributed divisions and iteratively prune the trees to reduce the size of the Splashes. After some phases, the Splashes become smaller and more irregular and focus on the challenging regions of the graph (see Figure 2.c).

The idea behind the Parallel (Distributed) Splash Belief Propagation algorithm (multiple processors, distributed memory setting) is to divide the graph and assign the partitions among the processors. Each processor will then run the Splash algorithm on the assigned subgraph, exchanging messages with other processors when necessary. From here arises the question of how to divide the graph; let us see an example.

A grid graph could represent a noisy image -for a denoising task- where the left half requires more message updates than the right half. We can see an example in the following illustration:

original-horz
Figure 2. (a) Original image that will have an irregular update pattern. (b) The image with Gaussian noise added. (c) Brighter regions represent higher cumulative vertex updates. Splashes focus on vertices more frequently updated.

If we evenly (uninformedly) partition the graph (see Figure 3.a), some processors (the ones that were assigned the leftmost regions) may deal with more work than others (the ones with the pixels on the right side); this is known as work imbalance. The authors propose a way to deal with this problem at the expense of communication cost: over-partitioning. If, for a small value of k, the graph is divided into k \times p and the partitions are assigned randomly to each processors, the nonuniform update patterns are evenly distributed, improving the work imbalance. Of course, if there are more partitions, the network communication increases and with it the communication cost. In the Experiments Section, the authors state that k=5 is typically sufficient (see Figure 3.c).

processors-2
Figure 3. (a) Graph divided among two processors, causing a work imbalance since in the leftmost half many vertex are updated frequently while in the other half not so many operations are needed. (b) Over-partitioning with k=3. If each subgraph is assigned randomly to a CPU (e.g., the first, fourth and fifth divisions to CPU 1 and the rest to CPU 2) the work imbalance decreases. (c) The authors empirically found k=5 as a good balance between work imbalance and communication cost.

More detailed information about the algorithms and the experimental results is available in the paper and in this presentation.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s