# An Iterative MapReduce based Frequent Subgraph Mining Algorithm

## Abstract:

Frequent subgraph mining (FSM) is an important task for exploratory data analysis on graph data. Over the years, many algorithms have been proposed to solve this task. These algorithms assume that the data structure of the mining task is small enough to fit in the main memory of a computer. However, as the real-world graph data grows, both in size and quantity, such an assumption does not hold any longer. To overcome this, some graph database-centric methods have been proposed in recent years for solving FSM; however, a distributed solution using MapReduce paradigm has not been explored extensively. Since MapReduce is becoming the de-facto paradigm for computation on massive data, an efficient FSM algorithm on this paradigm is of huge demand. In this work, we propose a frequent subgraph mining algorithm called FSM-H which uses an iterative MapReduce-based framework. FSM-H is complete as it returns all the frequent subgraphs for a given user-defined support, and it is efficient as it applies all the optimizations that the latest FSM algorithms adopt. Our experiments with real life and large synthetic datasets validate the effectiveness of FSM-H for mining frequent subgraphs from large graph datasets.

## Contribution and motivation:

Frequent subgraph mining (FSM) is an important task for exploratory data analysis on graph data. Over the years, many algorithms have been proposed to solve this task. These algorithms assume that the data structure of the mining task is small enough to fit in the main memory of a computer. However, as the real-world graph data grows, both in size and quantity, such an assumption does not hold any longer. To overcome this, some graph database-centric methods have been proposed in recent years for solving FSM; however, a distributed solution using MapReduce paradigm has not been explored extensively. Since, MapReduce is becoming the de-facto paradigm for computation on massive data, an efficient FSM algorithm on this paradigm is of huge demand.

In this paper, we propose, FSM-H, a distributed frequent subgraph mining method over MapReduce. Given a graph database, and a minimum support threshold, FSM-H generates a complete set of frequent subgraphs. To ensure completeness, it constructs and retains all patterns in a partition that have a non-zero support in the map phase of the mining, and then in the reduce phase, it decides whether a pattern is frequent by aggregating its support computed in other partitions from different computing nodes. To overcome the dependency among the states of a mining process, FSM-H runs in an iterative fashion, where the output from the reducers of iteration i−1 is used as an input for the mappers in the iteration i. The mappers of iteration i generate candidate subgraphs of size i (number of edge), and also compute the local support of the candidate pattern. The reducers of iteration i then find the true frequent subgraphs (of size i) by aggregating their local supports. They also write the data in disk that are processed in subsequent iterations.

## Framework:

The execution starts from the mappers as they read the key-value pair of size k patterns in partition p from the HDFS. As shown in the Figure, the mappers generate all possible k + 1-size candidate patterns and perform the isomorphism checking within partition p. For a pattern of size k+1 that passes the isomorphism test and has a non-zero occurrence, the mapper builds its key-value pair and emits that for the reducers. These key-value pairs are shuffled and sorted by the key field and each reducer receives a list of values with the same key field. The reducers then compute the support of the candidate pattern by aggregating the support value computed in the partitions where the respective pattern is successfully extended. If a pattern is frequent, the reducer writes appropriate key-value pairs in the HDFS for the mappers of the next iteration. If the number of frequent k + 1 size pattern is zero, execution of FSM-H halts.

## Pseudo Code:

**Mapper_FSG**: The mapper then generates all possible candidates of size k + 1 (Line 1) by extending each of the patterns in F^{p}_{k} (say, c), the mapper performs isomorphism checking to confirm whether c is generated from a valid generation path; in other words, it tests whether c passes the min-dfs-code based isomorphism test (Line 3). For successful candidates, the mapper populates their occurrence list (Line 4) over the database graphs in the partition p. If the occurrence list of a candidate pattern is non-empty, the mapper constructs a key-value pair, such as, (c.min-dfs-code, c.obj) and emits the constructed pair for the reducers to receive (Line 6).

**Reducer_FSG**: The reducer receives a set of key-value pairs, where the key is the min-dfs-code of a pattern namely c.min-dfs-code and the value is a list of c.obj’s constructed from all partitions where the pattern c has a non-zero support. Reducer then iterates (Line 1) over every c.obj and from the length of the occurrence list of each c.obj it computes the aggregated support of c. If the aggregated support is equal or higher than the minimum support threshold (Line 3), the reducer writes each element in the list paired with the min-dfs-code of c in HDFS for the mappers of the next iteration.

## Data:

We use six real-world graph datasets which are taken from http://www.cs.ucsb.edu/xyan/dataset.htm that contains graphs extracted from the PubChem. PubChem provides information on biological activities of small molecules and the graph datasets from PubChem represent atomic structure of different molecules. We also build a graph dataset from DBLP Computer Science Bibliography data 6 considering publications in the year range 1992-2003. Each graph in this dataset is the ego-network of a distinct author (say, u); vertices in this graph is u and his collaborators, and an edge between a pair of vertices represents one or more co-authorship events between the corresponding authors. We also create four synthetic datasets using a tool called “Graphgen”. The number of graphs in these datasets range from 100K to 1000K and each graph contains on average 25 − 30 edges.

## Results:

### Performance evaluation:

we demonstrate how FSM-H’s runtime varies with the number of active data nodes (slaves). We use the Yeast dataset and 20% minimum support threshold and keep 100 database graphs in one partition. We vary the count of data nodes among 3, 5, 7 and 9 and record the execution time for each of the configurations. As shown in Figure (a) the runtime reduces significantly with an increasing number of data nodes. In Figure (b) we plot the speedup that FSM-H achieves with an increasing

### Comparison:

we compare the execution time of FSM-H with that of Hill et al.’s implementation that we obtain from the authors. We run both the algorithms on three real world biological dataset for 40% minimum support. Table 2 compares the execution time between the two algorithms. As we can see FSM-H performs significantly better that the other algorithm. The reason behind the better performance of FSM-H is that, Hill et al’s mining strategy is naive and it generates duplicate patterns. Like FSM-H , Hill et al’’s algorithm mine patterns in map phase and count in reduce phase. Due to the duplicate patterns, the load over the network increases and subsequently, the reducers will have to wait for more time for the mappers to complete.

## Code:

## Code and Data: https://github.com/DMGroup-IUPUI/FSMH

## Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mansurul Bhuiyan (rabbi_buet@yahoo.com)

Grant: NSF CAREER Award (IIS-1149851)

## Citation:

@article{bhuiyan2015iterative,

title={An iterative MapReduce based frequent subgraph mining algorithm},

author={Bhuiyan, Mansurul A and Al Hasan, Mohammad},

journal={IEEE Transactions on Knowledge and Data Engineering},

volume={27},

number={3},

pages={608--620},

year={2015},

publisher={IEEE}

}