PRIIME: A Generic Framework for Interactive Personalized Interesting Pattern Discovery

Abstract:

In this work, we propose an interactive pattern discovery framework named PRIIME which identifies a set of interesting patterns for a specific user without requiring any prior input on the interestingness measure of patterns from the user. The proposed framework is generic to support discovery of the interesting set, sequence and graph type patterns. We develop a softmax classification based iterative learning algorithm that uses a limited number of interactive feedback from the user to learn her interestingness profile, and use this profile for pattern recommendation. To handle sequence and graph type patterns PRIIME adopts a neural net (NN) based unsupervised feature construction approach. We also develop a strategy that combines exploration and exploitation to select patterns for feedback. We show experimental results on several real-life datasets to validate the performance of the proposed method. We also compare with the existing methods of interactive pattern discovery to show that our method is substantially superior in performance. To portray the applicability of the framework, we present a case study from the real-estate domain.

Contribution and motivation:

The traditional frequent pattern mining algorithms generate an exponentially large number of patterns of which a substantial proportion are not much significant for many data analysis endeavors. Discovery of a small number of personalized interesting patterns from the large output set according to a particular user’s interest is an important as well as challenging task. Existing works on pattern summarization do not solve this problem from the personalization viewpoint. In this work, we propose an interactive pattern discovery framework named PRIIME which identifies a set of interesting patterns for a specific user without requiring any prior input on the interestingness measure of patterns from the user. The proposed framework is generic to support discovery of the interesting set, sequence and graph type patterns. We develop a softmax classification based iterative learning algorithm that uses a limited number of interactive feedback from the user to learn her interestingness profile, and use this profile for pattern recommendation. To handle sequence and graph type patterns PRIIME adopts a neural net (NN) based unsupervised feature construction approach. We also develop a strategy that combines exploration and exploitation to select patterns for feedback.

System Architecture:

Priime Architecture

The IPIPD framework of PRIIME is partitioned into two main blocks: Learning and Discovery (Figure 1). Learning learns a model for a user; it contains following five modules: Frequent Pattern Miner (FPM), Learner, Feature Generator (FG) and Feedback-Collector (FC), and Model. The FPM works as a bridge between the Learner and the data. FPM module can be any off-the-shelf pattern mining algorithm that mines frequent patterns given a normalized minimum support threshold. Any of the existing off the shelf pattern mining algorithms ranging from sequential to randomized, can be used. The Learner learns u’s interestingness function (f ) over the pattern-set O in multiple stages while utilizing user’s feedback on a small number of patterns selected from a partition of O in each stage. The Feature Generator (FG) module is responsible for generating efficient and appropriate feature vector representation of the incoming patterns for the Feedback Controller and the Model to use. FG uses the bag of words method for set patterns and NN-based unsupervised language model for sequence and graph patterns. The Feedback-Collector (FC)’s responsibility is to identify the patterns using a exploitation- exploitation based strategy that are sent to the user for feedback. Note that, patterns are sent to the user for feedback in its original form, not in the feature vector representation. Discovery delivers personalized interesting patterns to the user using the learned model.

Pseudo Code:

code

PRIIME starts by training the unsupervised feature generation model Mfeat using the transaction data D and initializes the learning model w. Afterward in each iteration, it uses Mfeat to infer the feature representation of each pattern in the incoming batch B. Then the algorithm selects patterns for feedback and sends them to the user. Once it receives the feedback, it updates the current learning model. The algorithm continues until the improvement of the learning model falls below a user-defined threshold.

Neural Net based feature representation of graph patterns:

feature engineering

In Step 1, we have a graph (transaction) with 6 nodes and 6 edges, where each of the nodes has a label. We assume all the edges share the same label 1 (not shown in the figure). Step 2 obtains six dfs-walk edge sequences, on which two are shown in the figure—one dfs-walk originated from vertex 1 (red ink) and the other originated from vertex 2 (blue ink and dashed). Note that, it is possible to have dfs-walk with different ordering from a vertex but in this work, we allow one walk per vertex. Once we have such representation of the entire data D, we feed the text corpus to the language model (Step 3) and finally for a given feature length d, it trains the model (Mfeat ) and produces feature vectors of all M dfs-walks in the text corpus. During the classification stage for a pattern graph, we create only one dfs-walk from a randomly selected vertex and generate feature vector representation by using Mfeat .

Data:

Data

We use four set datasets, two sequence datasets, and two graph datasets to validate the performance of PRIIME. Two of the set datasets, Chess and Mushroom, are real-life datasets collected from FIMI repository (http://fimi.ua.ac.be/data/). The third one is an artificially generated electronic health record~(EHR) dataset created by Uri Kartoun from Harvard University (http://bit.ly/1Db1yHo ), which has health record of 10,000 patients. To generate a class label for each patient, we identify whether the patient is diagnosed with arthritis, type-I/II diabetics or not. The last set dataset is a drug side effects dataset collected from CIDER repository (http://sideeffects.embl.de/download/). This data contains 996 drugs and associated side effects. We consider each drug as a transaction and side effects as the items. Each drug also has a class label based on the primary disease which it cures: cancer, heart diseases, brain diseases, pain, and others.

The two sequence dataset Reuters1 and Reuters2 is collected according to the procedure mention in a sequence classification work “Itemset based sequence classification.”. These datasets are prepared by extracting specific keywords from the entries in the Reuters-21578 corpus (http://web.ist.utl.pt/~acardoso/datasets/r8-train-stemmed.txt). Reuters1 consists of word sequences labeled by either “earn” or “acq”. Reuters2 consists of sequences labeled by “interest”, “money-fx”, “crude” and “trade”. Out of the two graph datasets, one is called “pdb”, which is collected from predictive toxicology challenge (http://www.predictive-toxicology.org/ptc/). The other graph dataset, Mutagenicity-II is obtained from the authors of “Don’t be afraid of simpler pattern”.

Results:

Overall performance:

Result1

We show how the performance of the learning model improves with the iterations of feedback. As discussed earlier in each iteration, the learner selects ten patterns for feedback and iteratively updates the current model. After each iteration, we measure weighted F-Score of our currently learned model with the test fold. Then we plot F-score of the learner across the number of iterations executed so far.

Effectiness of representation learning:

Result2

In this experiment, we evaluate the effectiveness of the proposed unsupervised feature construction based approach with alternative feature representation for sequence and graph patterns. For sequence pattern, we use n-gram based technique, where we extract all 2-grams and 3-grams from the data and use those as features. For graph patterns, we obtain 20 topological metrics i.e. centralities, diameter, radius, clustering coefficient and consider these metrics as features for graph patterns. We run PRIIME for 10 iterative sessions (100 feedback) and use the weighted F-score for comparison. In figure 4, we show the findings using bar chart. First two groups of the bar chart are for sequence datasets and the remaining two groups are for graph datasets. As we can see, PRIIME’s unsupervised feature learning performs better than the alternative feature representation of both sequence and graph patterns. Specifically, PRIIME’s feature embedding for graph datasets is significantly better than that of the competitors’.

Code and Data:

https://github.iu.edu/DMGroup-IUPUI/PRIIME

Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mansurul Bhuiyan (mansurul1985@gmail.com)


Grant: NSF CAREER Award (IIS-1149851)

Citation: 

@article{bhuiyan2016Priime,
         title={PRIIME: A Generic Framework for Interactive Personalized Interesting Pattern Discovery},
         author={Bhuiyan, Mansurul A and Al Hasan, Mohammad},
         journal={IEEE International Conference on Big Data},
         year={2016},
         publisher={IEEE}
    }