# GRAFT: an Approximate Graphlet Counting Algorithm for Large Graph Analysis

## Abstract:

Over the years, various graph generation models have been discovered, which reveal important properties of real-life networks. These discoveries enhance our understanding of real-life networks, or enable us to generate synthetic graphs with real-life graph properties. However, these models mostly reveal the global statistics, and are oblivious of the local structural details that are instrumental for solving real-life problems on social and information networks. In this work, we use graphlet frequency distribution (GFD) as an analysis tool for understanding the variance of local structure in a graph; we also show that it can help in comparing, and characterizing real-life networks. However, the main bottleneck to obtain GFD is the excessive computation cost for obtaining the frequency of each of the graphlets in a large network. To overcome this, we propose a simple, yet powerful algorithm, called GRAFT, that obtains the approximate graphlet frequency for all graphlets that have upto 5 vertices. Comparing to an exact counting algorithm, our algorithm achieves a speedup factor between 10 and 100 for a negligible counting error, which is, on average, less than 5%; For example, exact graphlet counting for ca-AstroPh takes approximately 3 days; but, GRAFT runs for 45 minutes to obtain a counting accuracy of 95.6%.

## Challenge and Contribution:

Computing graphlet frequencies of a large graph is a computationally expensive task. However, limiting the size of the graphlets makes the counting task polynomial; in fact, all induced embeddings of size-k graphlets in a graph G(V,E) can be enumerated (and counted) in O(|V |k) time by a brute-force search, which is prohibitively expensive for real-life networks having thousands of vertices, specifically for k = 4 or 5. However, in real-life large graphs are sparse, which helps designing graphlet counting algorithms that are significantly more efficient than a O(|V|5) algorithm. The idea is to iterate over the vertices of a graph, and along that process enumerate the graphlet embeddings that are associated to each of the vertices. In recent year, a software named GraphCrunch [8] has been released, which follows this approach for exact counting of graphlets in biological networks. However, exact counting though feasible for small graphs that arise in the domain of bioinformatics, such an approach may not be feasible for large graphs arising in the domains of social and information networks. For example, we ran GraphCrunch on Enronemail data set (38,692 vertices, 367,664 edges) and slashdot data set (77,357 vertices, 516,675 edges); neither of the counting processes finish after 5 days of running on a typical desktop computer.

In this research, we propose a method called GRAFT (GRAFT is an anagram of the bold letters in **A**pproximate **GR**aphlet **F**requency coun**T**ing) to perform the task of approximate counting of graphlets that have upto five vertices; in above figure, we show all such graphlets (modulo isomorphism). GRAFT samples a small number of edges uniformly and for each of the sampled edges it obtains a partial count for a graphlet such that the graphlet uses those sampled edges as its (induced) embedding; GRAFT then uses the partial count associated with each of the edges for approximating the total count of that graphlet. Experiments show that by sampling between 5% and 10% of the edges, we can easily achieve more than 95% of accuracy in counting the graphlet with a speedup factor between 20 and 10; for larger graphs, the sampling factor can be reduced to 1% (or less) to achieve similar accuracy and even higher speedup.

## Method:

GRAFT works as an EDGEITERATOR algorithm; in such an algorithm, the counting process iterates over the edges of the input graph, *G(V,E)*. For an edge *e ∈ E_G*, it finds the count of all induced embeddings of a graphlet *g* with the constraint that the edge *e* is part of the embeddings; we call this count a partial count of the graphlet with respect to the edge *e*. The partial count can be summed over all the edges to obtain a total count of the graphlet *g* in the input graph. However, in the above process, a distinct graphlet will be counted multiple times, by being accounted in different partial counts, so the above count needs to be corrected by dividing it with an appropriate normalization factor. Such a method yields an exact graphlet counting algorithm. GRAFT obtains an approximate graphlet counting algorithm by iterating over a random subset of edges instead of all the edges of the input graph. The fraction of edges in the random subset with respect to all the edges is called the sampling factor of GRAFT. The lower the sampling factor, the faster GRAFT runs. On the other hand, the higher the sampling factor, the better the accuracy of GRAFT. For a sampling factor of *1*, GRAFT returns an exact count.

## Dataset:

We perform several experiments to observe the performance of GRAFT. These experiments are performed on real life graphs obtained from the two websites http://snap.stanford.edu/data/index.html and http: //www-personal.umich.edu/~mejn/netdata . The name and statistics of these graphs are available from Table 1.

## Result:

We compare the speedup and percentage counting error of five real-world networks from collaboration, blog, and web domains that are shown in Table 1. For this, we use p = 0.1, which gives us about 1/p = 10 times speedup (except the case of ca-AstroPh, for which we use p = 0.01). The percentage error values are between 2.81% and 5.93%. Our method generally performs better as the graph becomes larger and denser. In the case of network ca-AstroPh (which is much bigger than other networks), we use p = 0.01 which gives a much higher speed-up while maintaining the same level of accuracy.

### Comparing the counting errors among different graphlets:

While counting with Graft, the counting error of various graphlets varies based on the complexity and the size of the graphlet. To compare the relative counting error among different graphlets, we use the ca-CondMat graph dataset with the edge selection probability of 0.1. We repeat the counting process for 10 times and report the average counting errors in Figure 7. Each column in this graph represents a distinct graphlet (which is shown as labels on the x-axis). The y-axis shows the percentage errors in counting. To show the variance of percentage error among different iterations, we show the results in a box-plot. From Figure 7 we observe that the counting error increases with the number of vertices in the graphlets. Also, complex graphlets that have more cycles are more error-prone than tree graphlets.

### GFD as a property for graph-clustering:

We demonstrate the usability of GFD in clustering graphs. Graphs can be generated from many sources e.g., citation graphs, collaboration graphs etc. Our objective is to do graph clustering on some of the graphs using GFD as a *29* dimensional metric. We use agglomerative hierarchical clustering with Euclidean distance as the distance metric (details available in paper).

## Code and Data:

https://github.com/DMGroup-IUPUI/GRAFT-Source

## Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mahmudur Rahman (prime059@gmail.com)

3. Mansurul Bhuiyan (mansurul1985@gmail.com)

## Grant: NSF CAREER Award (IIS-1149851)

## Papers:

- Mahmudur Rahman, Mansurul Alam Bhuiyan, Mohammad Al Hasan: Graft: An Efficient Graphlet Counting Method for Large Graph Analysis. TKDE ’14
- http://ieeexplore.ieee.org/document/6702435/?reload=true&arnumber=6702435
- Mahmudur Rahman, Mansurul Bhuiyan, Mohammad Al Hasan: GRAFT: an approximate graphlet counting algorithm for large graph analysis. CIKM ’12
- http://dl.acm.org/citation.cfm?id=2398454

## Citations:

@ARTICLE{RahmanBhuiyan2014,

author={M. Rahman and M. A. Bhuiyan and M. Al Hasan},

journal={IEEE Transactions on Knowledge and Data Engineering},

title={Graft: An Efficient Graphlet Counting Method for Large Graph Analysis},

year={2014},

volume={26},

number={10},

pages={2466-2478},

keywords={graph theory;network theory (graphs);GFD;GRAFT algorithm;diameter;exact counting algorithm;excessive computation cost;global network topology;graph Laplacian spectra;graphlet counting method;graphlet frequency distribution;large graph analysis;local topological structures;network analysis study;power-law exponent;real-life graph properties;synthetic graphs;Accuracy;Algorithm design and analysis;Approximation algorithms;Approximation methods;Biology;Software algorithms;Topology;GFD;Graph mining;graphlet counting},

doi={10.1109/TKDE.2013.2297929},

ISSN={1041-4347},

month={Oct},

}

@inproceedings{RahmanBhuiyan2012,

author = {Rahman, Mahmudur and Bhuiyan, Mansurul and Hasan, Mohammad Al},

title = {GRAFT: An Approximate Graphlet Counting Algorithm for Large Graph Analysis},

booktitle = {Proceedings of the 21st ACM International Conference on Information and Knowledge Management},

series = {CIKM '12},

year = {2012},

isbn = {978-1-4503-1156-4},

location = {Maui, Hawaii, USA},

pages = {1467--1471},

numpages = {5},

url = {http://doi.acm.org/10.1145/2396761.2398454},

doi = {10.1145/2396761.2398454},

acmid = {2398454},

publisher = {ACM},

address = {New York, NY, USA},

keywords = {approximate graphlet counting, graph analysis, graphlet frequency distribution},

}