# Sampling triples from restricted networks using MCMC strategy

## Abstract :

In large networks, the connected triples are useful for solving various tasks including link prediction, community detection, and spam filtering. Existing works in this direction concern mostly with the exact or approximate counting of connected triples that are closed (aka, triangles). Evidently, the task of triple sampling has not been explored in depth, although sampling is a more fundamental task than counting, and the former is useful for solving various other tasks, including counting. In recent years, some works on triple sampling have been proposed that are based on direct sampling, solely for the purpose of triangle count approximation. They sample only from a uniform distribution, and are not effective for sampling triples from an arbitrary user-defined distribution. In this work we present two indirect triple sampling methods that are based on Markov Chain Monte Carlo (MCMC) sampling strategy. Both of the above methods are highly efficient compared to a direct sampling-based method, specifically for the task of sampling from a non-uniform probability distribution. Another significant advantage of the proposed methods is that they can sample triples from networks that have restricted access, on which a direct sampling based method is simply not applicable.

## Objective and Motivation :

Our objective is to obtain an efficient method for sampling triples from an arbitrary (user defined) probability distribution (say, *f*) defined over the set of triples in a network. The distribution *f* can be defined implicitly; for instance, one can only define a weighting function *w(·**)* over the set of triples, and *f* is simply the probability vector obtained from the weights of each of the triples. Any locally computable weight function should be admissible. Such a function can be formed by the topological properties of the vertices in the triples, or in case, the graph contains vertex or edge labels, the weight function can be designed based on the label composition of the vertices or edges of the triples.

To obtain a direct method for the sampling task that is defined in the above paragraph, we first need to compute *f* (probability mass function) from the weight function, and then obtain the *cmf* (cumulative mass function) of *f*. Obviously, this requires the knowledge of the entire sampling space (total number of vertices, edges, and triples). For a restricted graph, which can only be crawled by following the edges of the input network, such information is not available, so direct sampling is infeasible for solving the above sampling task on a restricted network.

The motivation for considering a restricted network comes from real-life consideration. Say, an analyst is using a crawler for crawling a Web graph, and he does not have the resources to store the entire graph in memory/disk. Under this setting, he may want to sample a set of triples (from a uniform distribution) alongside crawling so that he can approximate the transitivity of the Web graph. Clearly, without storing the entire network, he has no knowledge of the number of vertices, or edges in this network, let alone the number of triples. Also, for a hidden network, a user may not have access to an arbitrary node in the network for security reason, rather the desired node can only be accessed from another node which is one-hop away from it; such scenarios are common in real-life and are considered in some of the recent works that compute various network properties (degree distribution, average degree) by random walk over real-life networks, such as, Facebook.

In this work, we propose two methods for indirect triple sampling using Markov Chain Monte Carlo (MCMC) strategy. MCMC performs a random walk over the sample space such that the desired probability distribution (in this case, *f*) aligns with the stationary distribution of the random walk. Since MCMC computes the transition probability matrix of the random walk locally (on demand), it does not compute *f* explicitly; consequently, it does not need any information regarding the size of the sample space. As long as a state of the random walk can be visited from one of the neighboring states, an MCMC-based sampling works, which makes it an ideal candidate for sampling from a restricted network. Also, an MCMC-based method computes the transition probability matrix on-line, so it can accommodate addition or deletion of vertices (or edges) in a dynamic network, even when the sampling process is running.

## Method :

### Vertex-MCMC for Triple sampling :

Our first indirect method to address the above limitations is to use an MCMC sampling method that does not construct *cmf*, explicitly. Vertex-MCMC sampling uses a two-step sampling process. The first step samples a vertex *v* with appropriate distribution and the second step samples a triple from the set of triples at vertex *v*. For any MCMC algorithm, we need to define the states, the state transition process, the transition probability matrix, and the desired probability distribution. For vertex-MCMC, the set of states are the vertex-set *V* and the transition over the states happens along the edges (E). So the MCMC process is simply a random walk on the graph G. However, the stationary distribution of this walk is ζ— identical to the desired distribution of vertices for a two-step sampling. To achieve the desired sampling distribution we use Metropolis-Hastings (MH) algorithm.

### MCMC walk over Triples :

Our second indirect sampling method is named triple-MCMC, which performs MCMC walk over the triples. Triple-MCMC avoids computing *cmf* for both the steps. In fact, triple-MCMC is completely oblivious about the total number of triples in the graph. The set of states for this sampling algorithm is the set of all the triples, Π. Thus the random walk proceeds over the set of triples along a neighborhood graph which is defined below.

The neighbor of a triple is another triple with two common vertices. Thus, the triple-MCMC sampling obtains a sequence of dependent samples, where a sampled triple shares two vertices with the previous sampled triple. To compute the neighbor-set of a triple *t*, we need to find the other triples that can be obtained by replacing exactly one of the vertices of *t*.

## Dataset:

All the graphs listed in Table 2, are undirected, unweighted, simple and connected. We pre-process them to ensure these properties. The specification of the graphs (ver-tex count and edge count) may not match with the source, as in source, for some networks an undirected edge is represented by two directed edges in opposite directions; in our representation, for such edges we discard one edge of the edge-pairs. Additionally, we ensure that the graph is connected as MCMC algorithms perform a random walk over the graph. However, for all the graphs that we use, the largest connected component of a graph retains more than 90% of the edges.

# Result:

The performance of approximate triangle counting is measured by two metrics: execution time and accuracy. Typically, the method that wins in accuracy loses in running time. So, to make the comparison easy, we compare the accuracy (plotted on y axis) of different methods against different running time (plotted on x-axis). However, do note that for a given value for time, the number of triples that the methods sample differ. We show the results in Figure 5(a- c). We fitted the data with Bezier curve to show the trend of the algorithms. All the accuracy and execution times are average value that are computed from 10 runs of the algorithms. We can see that the direct sampling method performs the best, even for a small number of samples, and its performance improvement remains almost flat as the sample count increases; on the other hand, vertex-MCMC improves sharply as the number of samples increases, and for some graphs its performance even surpasses the performance of direct sampling. So, vertex-MCMC is particularly suitable for large graphs, where a sampling method can afford to take many sample, and yet can be competitive with an exact algorithm. For instance, in wiki20060925 graph, both direct sampling and vertex-MCMC obtain 90% counting accuracy for a 6 seconds execution time, but the exact method that uses an efficient edge iterator algorithm takes 67 seconds to execute. The charts in this figure also confirm that Buriol et el.’s method is not competitive with either of these methods.

## Code and Data:

https://github.com/DMGroup-IUPUI/TripleSamplingMCMC-Source

## Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mahmudur Rahman (prime059@gmail.com)

## Grant: NSF CAREER Award (IIS-1149851)

## Paper:

- Mahmudur Rahman, Mohammad Al Hasan. Sampling triples from restricted networks using MCMC strategy. CIKM ’14
- http://dl.acm.org/citation.cfm?id=2662075

## Citation:

@inproceedings{Rahman:2014, author = {Rahman, Mahmudur and Hasan, Mohammad Al}, title = {Sampling Triples from Restricted Networks Using MCMC Strategy}, booktitle = {Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management}, series = {CIKM '14}, year = {2014}, isbn = {978-1-4503-2598-1}, location = {Shanghai, China}, pages = {1519--1528}, numpages = {10}, url = {http://doi.acm.org/10.1145/2661829.2662075}, doi = {10.1145/2661829.2662075}, acmid = {2662075}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {approximate triangle counting, markov chain monte carlo sampling, triple sampling}, }