Representing Graphs as Bag of Vertices and Partitions for Graph Classification

Abstract:

Graph classification is a challenging task as finding a good feature representation for graphs is challenging. Existing methods use topological metrics or local subgraphs as features. Compiling an exhaustive set of features that can discriminate a graph from one class to another is difficult. Moreover, exact computation of some of the metrics i.e. diameter, shortest path is computationally expensive. Another issue of graph classification is that in real-life even though each of the graphs can be very large, the number of distinct graphs that are available for training a classification model is limited. Such scarcity of graph data resources yields models that have much fewer instances than the model parameters, which leads to poor classification performance. In this work, we propose a novel approach for solving graph classification using two alternative graph representations, which are the bag of vertices and the bag of partitions. For the first representation, we use deep learning based node features and for the second, we use traditional metric based features. Both representations are an example of inflating training data with distorted data samples that have become popular in recent years, and they overcome the limitations that exist with the current graph classification approaches. Our experiments with 43 real-life graphs from seven different domains show that the bag representation of a graph improves the performance of graph classification significantly. We have demonstrated 4% to 75% improvement on the vertex based and 4% to 51% improvement on partition-based approach over the existing best methods. Besides, our vertex and partition multi-instance methods are on average 75 and 11 times faster in feature construction time than the current best, respectively.

Contribution and motivation:

Graph classification is a challenging task as finding a good feature representation for graphs is challenging. Existing methods use topological metrics or local subgraphs as features. Compiling an exhaustive set of features that can discriminate a graph from one class to another is difficult. Moreover, exact computation of some of the metrics i.e. diameter, shortest path is computationally expensive. Another issue of graph classification is that in real-life even though each of the graphs can be very large, the number of distinct graphs that are available for training a classification model is limited. Such scarcity of graph data resources yields models that have much fewer instances than the model parameters, which leads to poor classification performance. In this work, we propose a novel approach for solving graph classification using two alternative graph representations, which are the bag of vertices and the bag of partitions. For the first representation, we use deep learning based node features and for the second, we use traditional metric based features. Both representations are an example of inflating training data with distorted data samples that have become popular in recent years, and they overcome the limitations that exist with the current graph classification approaches. Our experiments with 43 real-life graphs from seven different domains show that the bag representation of a graph improves the performance of graph classification significantly. We have demonstrated 4% to 75% improvement on the vertex based and 4% to 51% improvement on partition-based approach over the existing best methods. Besides, our vertex and partition multi-instance methods are on average 75 and 11 times faster in feature construction time than the current best, respectively.

Method:

Consider a graph database G = {Gi} 1≤i≤n . Each graph Gi is associated with a category label L(Gi). For a graph G i , we use Gi.V and Gi.Pk to denote the vertices and k partition induced subgraphs of that graph. The task of supervised graph classification is to learn a classification model from a set of training graph instances. The main challenge of graph classification is to obtain a good feature representation for the graphs in G for solving this classification task.
Our solution to the feature representation for graph classification is to map a graph to a bag of multiple vertices or a bag of multiple subgraph instances, such that each of the instances in a bag becomes a distinct row in the classification training data. The instances in a bag inherit the label from the parent graph which they represent. Thus, if v ∈ Gi.V is used as a bag instance for the Graph Gi , the label of v is L(Gi). If for a graph Gi , all the vertices are used in the bag, one row of a traditional graph classification dataset becomes Gi.V rows in our data representation each sharing the same label L(Gi). Instead of vertices, we can also use partition induced subgraphs as the bag instances. In this case, one row in a traditional graph classification dataset becomes Gi.Pk rows in our data representation each sharing the same label L(Gi). For a large input graph, we do not need to fill the bag with all the vertices of that graph, rather we can include only a random subset of vertices in the bag.

Vertex Feature Representation using Random Walk:

feature engineering

In Figure 1, we illustrate how we compute feature representation of the vertices of an input graph G(V, E). The toy graph that we use in this example has 12 vertices and 12 edges. At first, we uniformly select a set of target vertices v (v ⊂ V ) for each of which we will perform the random walk of length l for t times. Suppose, we pick vertex 1, 8 and 12 and l is 5 and t is 2. The figure shows two random walks for vertex 1 only. We use different colors (red, blue, violet and orange) to illustrate different random walks. Once we have the random walks, we treat them as sentences in a document (step 2 in Figure 1). In 3rd step, we pass the document in the text modeler (shown as a black box in Figure 1). As an output (step 4 in Figure 1), text modeler produces feature representation of a given length (d−dimension) for all words, i.e. vertices (1, 2, · · · 12) in this case.

Pseudo Code:

Code1

The input of the algorithm is a graph database G where each graph Gi is associated with a category label L(Gi). Algorithm starts by iterating over each graph Gi . For each graph after populating random walks, it executes Word2Vec over the collection of walks to find the feature representation of the vertices of the graph and stores in Bag(G i .V ) (Line 2-4). Then the algorithm labels each of the instances in the Bag(Gj.V ) by the graph Gi’s category label L(Gi). In the end, Bag(Gj.V) of labeled data instances is stored in a list called Data. When the iteration finishes, the algorithm applies k−fold cross validation to split Data in the Bag level to generate the train(Datatrain) and test(Datatest ) fold. Algorithm then executes the training phase using Data train to train model Hθ (Line 6). Finally, the algorithm predicts the label of data points from the test fold (Data test ) using Hθ and output label of the test graphs represented by the bags of vertices in the test folds using majority voting (Line 7-8).

Code2

Steps of the partition multi-instance technique is similar to the vertex multi-instance, except we use the graph partition algorithm to partition a graph Gi (line 3), say into k partitions. Then for each partition induced subgraph in Gi.Pk , the partition multi-instance algorithm computes seven topology based features in Bag(Gi.Pk) (line 4-5). Once the algorithm labels each instance in Bag(Gi.Pk ) with the corresponding graph category label and stores in a global list Data, it performs the k−fold train-test scheme for Softmax classification (line 6-9) and output label of the test graphs using majority voting (Line 10).

Data:

Data

Results:

Result1

We perform overall 14 classification tasks between different domains as shown in Column 1 of Table II, for example, Citation vs Coauthorship, Animal vs Human Social, etc. The vertex multi-instance classification is an excellent method as we can see in Column 2 of Table II; for example, in the coauthorship-communication task, this method achieves 97.0% accuracy. In Column 3 of the same table, we show the percentage of improvement on accuracy over current best approach (either NetSimile or Li because we are unable to use RgMiner over all graphs from Table I) for the vertex multi-instance method. As we can see, vertex multi- instance approach achieves 4% to 75% improvement over current best method. In Column 4 and 5, we report percentage accuracy and percentage of improvement for the partition multi-instance method. This method also delivers excellent classification performance, for example, in the Coauthorship-Infrastructure task, it achieves 97.5% accuracy. The partition multi-instance approach reaches 4% to 51.8% improvement over current best method. In all 14 classification tasks vertex, multi-instance and Partition multi-instance approaches show superior or equal performance over the existing methods. In the last column of Table II, we only mention the best existing work, either NetSimile or Li.

Result2

Code:

https://github.iu.edu/DMGroup-IUPUI/Graph_Classification

Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mansurul Bhuiyan (mansurul1985@gmail.com)