Interactive Pattern Mining on Hidden Data: A Sampling-based Solution

Abstract:

Mining frequent patterns from a hidden dataset is an important task with various real-life applications. In this research, we propose a solution to this problem that is based on Markov Chain Monte Carlo (MCMC) sampling of frequent patterns. Instead of returning all the frequent patterns, the proposed paradigm returns a small set of randomly selected patterns so that the clandestinity of the dataset can be maintained. Our solution also allows interactive sampling, so that the sampled patterns can fulfill the user's requirement effectively. We show experimental results from several real life datasets to validate the capability and usefulness of our solution; in particular, we show examples that by using our proposed solution, an eCommerce marketplace can allow pattern mining on user session data without disclosing the data to the public; such a mining paradigm helps the sellers of the marketplace, which eventually boost the marketplace's own revenue.

Overview and Motivation:

In this work, we propose an interactive pattern mining system that is based on sampling of frequent patterns. Instead of enumerating all the frequent patterns in the dataset, the above system uses a sampler for returning a small set of randomly selected frequent patterns to the analyst. Through an interactive console, it (the system) enables the analyst to provide feedbacks on the quality of the sampled patterns. Based on these feedbacks, the mining engine updates its sampling criteria so that in subsequent iterations it samples patterns that are of better quality.

The interactive pattern mining (IPM) system enables mining of frequent patterns from hidden data. It is evident that the sampler only relies on user’s feedback to drive the pattern selection process, so a user does not need access to the actual dataset on which the mining is being accomplished. Rather, the user exploits the interactive pattern mining system to obtain the desired patterns. Our sampling-based pattern mining paradigm also overcomes the information overload problem that is caused by the large number of frequent patterns in a traditional pattern mining task.

Method:

IPD_Sampling_Framework

The IPM framework that we devise to mine patterns from a hidden dataset. In the Figure 1, we show a set of k analysts, u1,u2,. . . , uk , that are simultaneously accessing frequent patterns from a hidden dataset. For a specific user, ui , a dedicated sampler, s i is assigned to facilitate the interactive mining session. Based on ui’s feedback on the sampled patterns, si updates the sampling distribution so that the analyst ui is served in the most fruitful way. It is easy to see that the overall framework is generic and is easily applicable for sampling itemsets, trees, sequences and graphs.

We use Metropolis Hastings  (MH) as the underlying sampling algorithm to mine patterns from the hidden database, D. Each sampler si in Figure 1 runs an instance of MH sampling with the objective that MH’s target sampling distribution (γ) matches with the desired sampling distribution of the user ui . Achieving the above objective is challenging because the sampler has no knowledge of the desired sampling distribution. To overcome this difficulty, we sample patterns using MH with a default sampling distribution, and update the default distribution using each of the user feedbacks so that the effective distribution converges towards γ with the updates.

Data:

Data

We use six real-life datasets shown in Table 1. The first four among them are itemset datasets and the remaining two are graph datasets. We obtain Mushroom, chess and connect dataset from the FIMI repository (http://fimi.ua.ac.be/data/) . eBay Query is the remaining itemset dataset that we generate by scrapping queries from the “Related Searches” feature of eBay. We build a query-graph where related queries are neighbors and populate the row of an itemset dataset by performing short random walks on this graph; in this way, the list of the queries that such a walk visits makes a transaction in the eBay Query dataset. Since related search suggestion is built by using user session data, this dataset is a metaphorical user session data, where a transaction can be regarded as the set of queries that a user executes in a user session. The remaining two datasets, Mutagenicity-I and Mutagenicity-II are chemical (graph) datasets that we obtained from the authors of “Don’t be afraid of simpler pattern”. All formatted datasets used in this work can be found in this github location.

Result:

Result

We first show that with an increasing number of feedbacks, the sampler converges to a distribution, where the interesting patterns are more likely to be sampled. For this, we run the sampler for a large number of iterations and compute the percentage of iterations in which the sampler visits an interesting pattern; we call this percentage samplers precision (SP). We compute SP from many independent runs, each using a different value for the feedback count. Then we find the correlation between the number of feedbacks and SP. Since the process is randomized, SP value for each feedback count is also computed by averaging over 10 runs. We show our results in Figure 2. As we can see, the number of feedback and the samplers precision are highly correlated—SP increases along with the feedback count. Also note that,the strength of correlation varies across various datasets—specifically, the correlations are poor for graph dataset (Mutagenicity-II), and good for itemset dataset (Mushroom).

Code and Data:

Interactive Itemset Sampling: https://github.iu.edu/DMGroup-IUPUI/Interactive_Itemset_Sampling

Interactive Graph Sampling: https://github.iu.edu/DMGroup-IUPUI/Interactive_graphPattern_Sampling

Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mansurul Bhuiyan (mansurul1985@gmail.com)

Grant: NSF CAREER Award (IIS-1149851)

Citation:

@inproceedings{Bhuiyanhasan:12,

         title={Interactive pattern mining on hidden data: a sampling-based solution},

         author={Bhuiyan, Mansurul and Mukhopadhyay, Snehasis and Hasan, Mohammad Al},

         booktitle={Proceedings of the International Conference on Information and Knowledge Management},

         pages={95--104},

         year={2012},

        organization={ACM},

}