Thumbnail
Access Restriction
Open

Author Shu, Liangcai ♦ Lin, Can ♦ Meng, Weiyi ♦ Han, Yue ♦ Yu, Clement T. ♦ Smalheiser, Neil R.
Source CiteSeerX
Content type Text
File Format PDF
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Entity Resolution ♦ Bayesian Network ♦ Attribute Value ♦ Record Matcher ♦ Data Set ♦ Attribute Matcher ♦ State-of-the-art Blocking Algorithm ♦ Different Blocking ♦ Markov Blanket ♦ Book Domain ♦ Data Object ♦ Good Performance ♦ Spectral Neighborhood ♦ Decision Tree ♦ Web Data Integration ♦ Real World ♦ Naive Bayes Classifier ♦ Generic Framework ♦ Relational Data Set ♦ Different Data Source ♦ Support Vector Machine ♦ Context Sensitive Value Matching Library ♦ Experimental Result
Abstract In applications of Web data integration, we frequently need to identify whether data objects in different data sources represent the same entity in the real world. This problem is known as entity resolution. In this paper, we propose a generic framework for entity resolution for relational data sets, called BARM, consisting of the Blocker, Attribute matchers and the Record Matcher. BARM is convenient for different blocking and matching algorithms to fit into it. For the blocker, we apply the SPectrAl Neighborhood (SPAN), a state-of-the-art blocking algorithm, to our data sets and show that SPAN is effective and efficient. For attribute matchers, we propose the Context Sensitive Value Matching Library (CSVML) for matching attribute values and also an approach to evaluate the goodness of matching functions. CSVML takes the meaning and context of attribute values into consideration and therefore has good performance, as shown in experimental results. We adopt Bayesian network as the record matcher in the framework and propose a method of inference from Bayesian network based on Markov blanket of the network. As a comparison, we also apply three other classifiers, including Decision Tree, Support Vector Machines, and the Naive Bayes classifier to our data sets. Experiments show that Bayesian network is advantageous in the book domain. 1.
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study