Thumbnail
Access Restriction
Open

Author Skobeltsyn, Gleb ♦ Luu, Toan ♦ Zarko, Ivana Podnar ˇ ♦ Rajman, Martin ♦ Aberer, Karl
Source CiteSeerX
Content type Text
File Format PDF
Language English
Subject Domain (in DDC) Computer science, information & general works ♦ Data processing & computer science
Subject Keyword Large Scale P2p Text Retrieval ♦ Single Term Indexing ♦ Query Distribution ♦ Substantial Reduction ♦ Theoretical Analysis ♦ Websize Document Collection ♦ Distributed Indexing ♦ Bandwidth Consumption ♦ Specific Activation Mechanism ♦ Marginal Loss ♦ Retrieval Performance ♦ Rare Query ♦ Standard P2p Approach ♦ Major Problem ♦ State-of-the-art Centralized Query Engine ♦ Query-driven Algorithm ♦ Query-driven Indexing Structure ♦ Distributed Index ♦ Large Document Collection ♦ Query-driven Indexing Strategy ♦ Scalable Peer-to-peer Text Retrieval ♦ Indexing Information ♦ Possible Term Combination ♦ Top-k Document Reference ♦ Document Collection ♦ Experimental Result ♦ Query Statistic ♦ Generated Indexing Retrieval Traffic ♦ P2p Network ♦ Query-driven Indexing ♦ Indexing Term Combination
Description We present a query-driven algorithm for the distributed indexing of large document collections within structured P2P networks. To cope with bandwidth consumption that has been identified as the major problem for the standard P2P approach with single term indexing, we leverage a distributed index that stores up to top-k document references only for carefully chosen indexing term combinations. In addition, since the number of possible term combinations extracted from a document collection can be very large, we propose to use query statistics to index only such combinations that are indeed frequently requested by the users. Thus, by avoiding the maintenance of superfluous indexing information, we achieve a substantial reduction in bandwidth and storage. A specific activation mechanism is applied to continuously update the indexing information according to changes in the query distribution, resulting in an efficient, constantly evolving query-driven indexing structure. We show that the size of the index and the generated indexing/retrieval traffic remains manageable even for websize document collections at the price of a marginal loss in precision for rare queries. Our theoretical analysis and experimental results provide convincing evidence about the feasibility of the query-driven indexing strategy for large scale P2P text retrieval. Moreover, our experiments confirm that the retrieval performance is only slightly lower than the one obtained with state-of-the-art centralized query engines.
Educational Role Student ♦ Teacher
Age Range above 22 year
Educational Use Research
Education Level UG and PG ♦ Career/Technical Study
Learning Resource Type Article
Publisher Date 2007-01-01