Waiting to be Sold: Prediction of Time-Dependent House Selling Probability

Abstract:

Buying or selling a house is one of the important decisions in a person’s life. Online listing websites like “zillow.comamp;#8221;, “trulia.comamp;#8221;, and “realtor.comamp;#8221; etc. provide significant and effective assistance during the buy/sell process. However, they fail to supply one important information of a house that is, approximately how long will it take for a house to be sold after it first appears in the listing? This information is equally important for both a potential buyer and the seller. With this information the seller will have an understanding of what she can do to expedite the sale, i.e. reduce the asking price, renovate/remodel some home features, etc. On the other hand, a potential buyer will have an idea of the available time for her to react i.e. to place an offer. In this work, we propose a supervised regression (Cox regression) model inspired by survival analysis to predict the sale probability of a house given historical home sale information within an observation time window. We use real-life housing data collected from “trulia.comamp;#8221; to validate the proposed prediction algorithm and show its superior performance over traditional regression methods. We also show how the sale probability of a house is influenced by the values of basic house features, such as, price, size, # of bedroom, # of bathroom, and school quality.

Contribution and motivation:

Buying or selling a house is one of the important decisions in a person’s life. Online listing websites like “zillow.comamp;#8221;, “trulia.comamp;#8221;, and “realtor.comamp;#8221; etc. provide significant and effective assistance during the buy/sell process. However, they fail to supply one important information of a house that is, approximately how long will it take for a house to be sold after it first appears in the listing? This information is equally important for both a potential buyer and the seller. With this information the seller will have an understanding of what she can do to expedite the sale, i.e. reduce the asking price, renovate/remodel some home features, etc. On the other hand, a potential buyer will have an idea of the available time for her to react i.e. to place an offer. In this work, we propose a supervised regression (Cox regression) model inspired by survival analysis to predict the sale probability of a house given historical home sale information within an observation time window. We use real-life housing data collected from “trulia.comamp;#8221; to validate the proposed prediction algorithm and show its superior performance over traditional regression methods. We also show how the sale probability of a house is influenced by the values of basic house features, such as price, size, # of bedrooms, # of bathrooms, and school quality.

Method:

For a house H in a real-estate listing site, F is the set of features indicating different aspects of H, such as price, size, age, neighborhood amenities, school quality, exterior and interior. Suppose H appears on the site at time tappear and gets sold at time tsold . The interval between appearance and sale events is defined as Int = tsold − tappear ; using survival analysis terminologies Int is the time period for which a house is available to be sold i.e. it survives before getting sold. In this work, our objective is to predict the probability that a house H, represented by the feature vector F will be sold within a given interval time. We use a supervised learning algorithm from survival analysis domain for solving this prediction task.

Data:

Data

we collect housing data from “trulia.comamp;#8221;. First, we pick five major cities: Fishers, Carmel, Indianapolis, Zionsville and Noblesville in the central Indiana region and crawl information of all listed houses in those cities. For each house, we crawl raw text description, school, and crime information and structured bulleted basic house features i.e price, year of build, number of rooms, type of house, days in trulia etc. Another piece of information that we crawl in order to make the dataset suitable for survival analysis is the current house status i.e. “for sale”, “pending”(offer accepted), “active contingent”(offer placed), and “public record”(sold). In this work, we consider “public record”, “pending”, and “active contingent” status as sold.

To observe the interval period between the listing and sale date of a house, we crawl trulia.com in multiple phases, each apart by one week. In the first crawl (1st week of November 2015), we get “days in Trulia” features along with the status. We ignore houses that are already sold in the first crawl. One week later, we crawl the houses again and check their status, we do this five more times. Finally, we have the data where each house has the number of days it is listed in trulia.comalong with a status of sold or not. Note that, trulia.com marks the time period of a listed house to 25+ weeks or 180+ days if the time period exceeds 25 weeks or 180 days. In total, we crawl 7, 216 houses.

Among the information that we crawl, basic house features, school, and crime information are already formatted and clean. Cleaning is required for the house details text data. To clean text, we use standardized data cleaning approach, i.e. remove stop words, stemming and lemmatization of words. We also clean some keywords that are written in unstructured abbreviated form. For example, keyword fireplace is written as frplc, or firplc in many house details.

Data Distributions:

datadist1

datadist2

Results:

Overall performance:

2

                                                                                                 Table 1: Performance using Cindex

Cindex: we use concordance index (C-Index) , which is widely used in survival analysis models. By definition, C-Index has the same scale as the area under the ROC curve. The concordance index (C-Index) or concordance probability measures the effectiveness of a prediction model in survival analysis. Consider a pair of observations (yi , yi′ ) and (yj , yj′ ), where yi is the actual observation, and yi′ is the predicted one.

we show the performance of the developed algorithm. To compare and contrast, we design eight different versions of the regression model. Below we briefly discuss these versions.

Version 1: No Features ((CoxPH): This is the most straight forward version of cox regression model without using the notion of co-variates/features.

Version 2: Basic House Features (CoxPH): This version of Cox regression model only uses basic house features presented.

Version 3: Basic House Feature + Topic Modeling (CoxPH + LDA): In this version, we extend basic house features with the features learned from house description data. This version leverage a technique of topic modeling known as Latent Dirichlet Allocation (LDA) to find the topic distribution of the documents in the dataset. In this case, each document is a description of a house.

Version 4: Basic House Feature + Deep Learning (CoxPH + Doc2Vec): This version is slightly different than the previous version. In this version, we use a deep unsupervised feature learning algorithm called Doc2Vec to learn feature from house description.

Version 5: Basic House Feature + Topic Modeling (FastCox + LDA): In this version, we keep the setting of feature selection as Version 3 but we use a elastic net based regularized version of Cox regression model known as FastCox.

Version 6, Version 7 and Version 8: In these versions, we keep the setting of feature selection as version 3 but we use various traditional regression algorithms i.e SVR (Support Vector Regression), Linear regression with elastic net and lasso regularization criteria. Note that all these versions are trained on the non-censored data instances only.

Influence of Different Basic Features on Survival Probability of a House:

Result1

Result2

Publications

Waiting to be Sold: Prediction of Time-Dependent House Selling Probability(), In IEEE DSAA: Third International Conference on Data Science and Advanced Analytics.

Code and Data:  

https://github.iu.edu/DMGroup-IUPUI/Home_Surviva

Credit link 

https://www.trulia.com/for_rent/Indianapolis,IN/ 

Contact:

1. Mohammad Al Hasan (alhasan@cs.iupui.edu)

2. Mansurul Bhuiyan (mansurul1985@gmail.com)