DyLink2Vec: Effective Feature Representation for Link Prediction in Dynamic Networks 

Abstract:

The temporal dynamics of a complex system such as a social network or a communication network can be studied by understanding the patterns of link appearance and disappearance over time. A critical task along this understanding is to predict the link state of the network at a future time given a collection of link states at earlier time points. In existing literature, this task is known as link prediction in dynamic networks. Solving this task is more difficult than its counterpart in static networks because an effective feature representation of node-pair instances for the case of dynamic network is hard to obtain. To overcome this problem, we propose a novel method for metric embedding of node-pair instances of a dynamic network. The proposed method models the metric embedding task as an optimal coding problem where the objective is to minimize the reconstruction error, and it solves this optimization task using a gradient descent method. We validate the effectiveness of the learned feature representation by utilizing it for link prediction in various real-life dynamic networks. Specifically, we show that our proposed link prediction model, which uses the extracted feature representation for the training instances, outperforms several existing methods that use well-known link prediction features.

Contribution:

In this work, we propose DyLink2Vec (Link to Vector in a Dynamic network.), a novel learning method for obtaining a feature representation of node-pair instances, which is specifically suitable for the task of link prediction in a dynamic network. DyLink2Vec considers the feature learning task as an optimal coding problem, such that the optimal code of a node-pair is the desired feature representation. The learning process can be considered as a two-step compression-reconstruction step, where the first step compresses the input representation of a node-pair into a code by a non-linear transformation, and the second step reconstructs the input representation from the code by a reverse process and the optimal code is the one which yields the least amount of reconstruction error. The input representation of a node-pair is constructed using the connection history and the neighborhood information of the corresponding nodes. After obtaining an appropriate feature representation of the node-pairs, a standard supervised learning technique can be used (we use AdaBoost) for predicting link states at future times in the given dynamic network.

Method:

Figure1

A key challenge for dynamic link prediction is choosing an effective metric embedding for a given node-pair. Earlier works construct feature vector by adapting various topological similarity metrics for static link prediction or by considering the feature values of different snapshots as a time-series. DyLink2Vec, on the other hand, learns the feature embedding of the node-pairs by using an optimization framework, considering both network topology and link history. Assume a node-pair (u,v) for which we are computing the metric embedding α^(uv) ∈ R_d. Since we want α^(uv) to facilitate link prediction on dynamic graphs, the vector α^(uv) must capture two aspects that influence the possibility of a link between u and v in G_(t+1). The first aspect is the similarity between u and v in terms of graph topology across different timestamps, and the second aspect is the history of collaboration between u and v—both in the graph snapshots G_1, · · · , G_t.

Figure2

Link Prediction using Proposed Metric Eembedding:

For link prediction task in a dynamic network, G = {G_1 , G_2 ,...,G_t}; we split the snapshots into two overlapping time windows, [1, t − 1] and [2, t]. Training dataset􏰃 is feature representation for time snapshots [1, t − 1], the ground truth is constructed from G_t. DyLink2Vec learns optimal embedding h(·) using training dataset􏰃. After training a supervised classification model using α􏰃 = h (·) and y􏰃, prediction dataset is used to predict links at G_(t+1). For this supervised prediction task, we experiment with several classification algorithms. Among them SVM (support vector machine) and AdaBoost perform the best.

Figure3

Dataset Descriptions:

Figure4

Here we discuss the construction and characteristics of the datasets used for experiments. Enron email corpus consists of email exchanges between Enron employees. The Enron dataset has 11 time stamps and 16,836 possible node- pairs; the task is to use first 10 snapshots for predicting links in the 11th snapshot. We aggregate data into time steps of 1 week. We use the data from weeks 147 to 157 of the data trace for the experiments. The reason for choosing that window is that the snapshot of the graph at week 157 has the highest number of edges.

Collaboration dataset has 10 time stamps with author collaboration information about 49,455 author-pairs. The Collaboration dataset is constructed from citation data containing 1.4 million papers. We process the data to construct a network of authors with edges between them if they coauthor a paper. Considering each year as a time stamp, the data of years 2000-2009 (10 time stamps) is used for this experiment, where the data from the first nine time stamps is used for training and the last for prediction. Since this data is very sparse, we preprocess the data to retain only the active authors, who have last published papers on or after year 2010; moreover, the selected authors participate in at least two edges in seven or more time stamps.

Facebook1 and Facebook2 are network of Facebook wall posts. Each vertex is a Facebook user account and an edge represents the event that one user posts a message on the wall of another user. Both Facebook1 and Facebook2 has 9 time stamps. Facebook1 has 219,453 node-pairs. Facebook2 is an extended version of Facebook1 dataset with 883,785 node-pairs. Wall posts of 90 days are aggregated in one time step. We filter out all people who are active for less than 6 of the 9 time steps, along with the people who have degree less than 30. Facebook2 is created using a similar method, but a larger sample of Facebook wall posts is used for this dataset.

Result:

Figure5

We present the performance comparison results of DyLink2Vec based link prediction method with the four kinds of competing methods that we have discussed earlier. The figure have eight bar charts. The bar charts from the top to the bottom rows display the results for Enron, Collaboration, Facebook1 and Facebook2 datasets, respectively. The bar charts in a row show comparison results using PRAUC (left), and NDCG50 (right) metrics. Each chart has twelve bars, each representing a link prediction method, where the height of a bar is indicative of the performance metric value of the corresponding method. In each chart, from left to right, the first five bars (blue) correspond to the topological feature based methods, the next four (green) represent time series based methods, the tenth bar (black) is for DeepWalk, the eleventh bar (brown) represents tensor factorization based method CP, and the final bar (purple) represents the proposed method DyLink2Vec.

Code and Data:

To be updated after publication

Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mahmudur Rahman (prime059@gmail.com)

3. Kevin S. Xu (kevin.xu@utoledo.edu )

4. Chandan K. Reddy (reddy@cs.vt.edu)

Grant: NSF CAREER Award (IIS-1149851)

Citation:

To be updated after publication