# GUISE: A Uniform Sampler for Constructing Frequency Histogram of Graphlets

## Abstract:

Graphlet frequency distribution (GFD) has recently become popular for characterizing large networks. However, the computation of GFD for a network requires the exact count of embedded graphlets in that network, which is a computationally expensive task. As a result, it is practically infeasible to compute the GFD for even a moderately large network. In this paper, we propose Guise, which uses a Markov Chain Monte Carlo (MCMC) sampling method for constructing the approximate GFD of a large network. Our experiments on networks with millions of nodes show that Guise obtains the GFD with very low rate of error within few minutes, whereas the exhaustive counting based approach takes several days.

## Motivation:

The motivation of this work comes from the objective of using graphlet frequencies for analyzing large graphs, specifically those that appear in the social science domain. However, huge computation cost is typically the deterrent which prevents the popular adoption of graphlet frequencies for analyzing large graphs in this domain. In the few exceptions that we have, the high computation cost is dealt with a compromise that considers graphlets that have up to four vertices. N. Przulj’s team built a software called GraphCrunch2, which is the only existing tool that counts the frequencies of all graphlets with three, four, and five vertices. We attempt to use GraphCrunch2 for obtaining graphlet frequencies on some moderate- sized social network graphs; for the majority of these graphs, the attempt was unsuccessful. For instance, we ran GraphCrunch2 on roadNet-PA data set (1,088,092 vertices, 1,541,898 edges) and soc-sign-Slashdot081106 data set (77,357 vertices, 468,554 edges) for nearly 80 hours on all cores 2 of a quad-core, 2.1 GHz machine; neither of the processes finishes more than 40% of the computation as reported on the software’s status bar. Typical graphs in the social network domain have millions of vertices and edges; the counting of various graphlet frequencies in such graphs will take months, if not years.

Fortunately, exact count of the occurrences of each of the graphlets is required only for a few scenarios; one such example is for designing graph kernels; for all the remaining usages of graphlets, specifically for building the GFD fingerprint of a network the relative frequencies among various graphlets suffices. Also note, if a frequency distribution among various graphlets is available, we can easily reconstruct the count of all the graphlets from the count of only one of the graphlets. For instance, if a GFD considers all three, four, and five size graphlets, from the information of triangle count 3, the count of all other graphlets can be recovered very easily. In reality, the exact count is only a reflection of the size of the input graph, so an analysis should only use relative frequencies among various graphlets so that the metric can be used across different graphs. in this paper, we show a sampling-based method that approximates (almost identically) the GFD of a network with millions of nodes in a minute, for which the exact counting of graphlet frequencies takes more than three days!

## Contribution:

We propose GUISE (GUISE is an anagram of the bold letters in **U**n**I**form **S**ampling of **G**aphl**E**ts ), an algorithm for constructing graphlet frequency distribution (GFD) that samples graphlets without enumerating the occurrences of each of the graphlets. GUISE uses a Markov Chain Monte Carlo (MCMC) sampling strategy to sample the embedding occurrences of each of the graphlets, such that each such embeddings is sampled with a uniform probability— this enables GUISE to build a frequency histogram of various graphlets by accumulating the respective counts of each of the graphlets from the obtained samples. We prove both theoretically and empirically that GUISE can build the GFD of a network which is all but identical to the one that we can achieve by enumerating (thus counting) the occurrences of all the graphlets. The key difference is that on a graph with millions of vertices and edges, Guise takes minutes, whereas a counting based approach may takes months, or even years.

## Method:

GUISE takes three parameters, an input graph *G*, the total number of samples (*SCount*) and mixing period (*STrial*). GUISE starts by picking (Line 1) any initial graphlet (*g_x*). In Line 2 it populates the neighborhood of *g_x* and saves the set of neighbors in a graphlet data structure (*d_gx* ). Then in an iterative way, it chooses a graphlet *g_y* from *g_x*’s neighbors with uniform distribution. To compute whether the move to the neighbor *g_y* is accepted or not, it also computes the neighbor count of *g_y* (Line 6). After computing the acceptance probability on Line 7, if the move is accepted, Guise replaces the current graphlet *g_x* by *g_y* (line 9 and 10), otherwise, *g_x* is kept unchanged. It also increments the number of embedding sampled (sampled) and visit count of the current graphlet type by one (line 14 and 15) after the end of mixing period of the walk; in this way, the sampling statistics of the mixing period is ignored. Guise terminates when the sampled count exceeds *SCount+STrial*.

## Dateset:

We use graphs of different sizes from different domains. The above table, lists the graphs/networks we use for our experiments, with vertex and edge counts. The graphs were collected from the websites http://snap.stanford.edu/data/index.html and http://www-personal.umich.edu/~mejn/netdata. All input graphs used in our experiments are made undirected (if necessary) during the preprocessing step. In the same table, we also report the total count of all 3-,4-, and 5-Graphlets in the third column as was obtained from GraphCrunch2; for the first four graphs, the process finishes within a reasonable amount of time. For ca-AstroPh dataset, GraphCrunch2 returns the exact count after 3 whole days of running. For the remaining graphs (marked with *), GraphCrunch2 fails to return the graphlet frequencies even after 3 days of running.

## Result:

We show that through uniform sampling, we can obtain a GFD, that converges to the true GFD. To show this convergence we use L1-norm distance measure. We use GraphCrunch2 to construct the true GFD and run Guise for 300 thousand iterations. Next, we compute the L1- Norm distance between the actual GFD vector and the Guise GFD vector and plot the measure against the iterations count (SCount). Figure 9 exhibits the results for four of the datasets. We can see that for each of the datasets, after 200K iterations L1-distance remains unchanged, which is the indication of the convergence of Guise. We also exhibit the visual similarity (GFD using line plot) between the actual GFD and the approximate-GFD for all these four graphs and also for an additional graph (ca-AstroPh) in Figure 10(a-e). The SCount values that we use to get the Guise GFDs are mentioned in the second column of Table 4 within parentheses. Note that, in Figure 10(f-h) we show the GFD that we obtain from Guise for an additional three large graphs. For these graphs, actual GFD was not available so no comparison is shown. But, the theoretical and empirical proof of Guise makes us confident to claim that these are the actual GFD signature of those graphs.

## Code and Data:

https://github.com/DMGroup-IUPUI/GUISE-Source

## Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mahmudur Rahman (prime059@gmail.com)

3. Mansurul Bhuiyan (mansurul1985@gmail.com)

4. Mahmuda Rahman (mrahma01@syr.edu)

## Grant: NSF CAREER Award (IIS-1149851)

## Papers:

- Mahmudur Rahman, Mansurul Alam Bhuiyan, Mahmuda Rahman, Mohammad Al Hasan: GUISE: a uniform sampler for constructing frequency histogram of graphlets. KAIS ’14
- http://link.springer.com/article/10.1007/s10115-013-0673-3
- Mansurul Bhuiyan, Mahmudur Rahman, Mahmuda Rahman, Mohammad Al Hasan: GUISE: Uniform Sampling of Graphlets for Large Graph Analysis. ICDM ’12
- http://dl.acm.org/citation.cfm?id=2471881.2472651

## Citations:

@Article{Rahman2014,

author="Rahman, Mahmudur and Bhuiyan, Mansurul Alam and Rahman, Mahmuda and Hasan, Mohammad Al",

title="GUISE: a uniform sampler for constructing frequency histogram of graphlets",

journal="Knowledge and Information Systems",

year="2014",

volume="38",

number="3",

pages="511--536",

issn="0219-3116",

doi="10.1007/s10115-013-0673-3",

url="http://dx.doi.org/10.1007/s10115-013-0673-3"

}

@inproceedings{Bhuiyan:2012,

author = {Bhuiyan, Mansurul A. and Rahman, Mahmudur and Rahman, Mahmuda and Al Hasan, Mohammad},

title = {GUISE: Uniform Sampling of Graphlets for Large Graph Analysis},

booktitle = {Proceedings of the 2012 IEEE 12th International Conference on Data Mining},

series = {ICDM '12},

year = {2012},

isbn = {978-0-7695-4905-7},

pages = {91--100},

numpages = {10},

url = {http://dx.doi.org/10.1109/ICDM.2012.87},

doi = {10.1109/ICDM.2012.87},

acmid = {2472651},

publisher = {IEEE Computer Society},

address = {Washington, DC, USA},

}