Open Access Open Access  Restricted Access Subscription Access

A METHODOLOGY FOR DATABASE AND DOCUMENT SELECTION

Raj Gaurang Tiwari, Mohd. Husain, Anil Agrawal

Abstract


As web users are facing the problems of information overload and drowning due to the significant and rapid growth in the amount of information and the number of users so there is need to provide Web users the more exactly needed information which is becoming a critical issue in web-based information retrieval and Web applications. In this work, we aspire to improve the performance of Web information retrieval and Web presentation through developing and employing Web data mining paradigms. Every search engine has a corresponding database that defines the set of documents that can be searched by the search engine. Generally, an index for all documents in the database is created and stored in the search engine. Text data in the Internet can be partitioned into several databases naturally. Proficient retrieval of preferred data can be attained if we can exactly predict the usefulness of each database, because with such information, we only need to retrieve potentially useful documents from useful databases. For a given query ‘q’ the usefulness of a text database is defined to be the no. of documents in the database that are sufficiently relevant to the query ‘q’. In this paper, we propose new approaches for database selection and documents selection. We also implement these algorithms using .net framework. Our experimental results indicate that these methods can yield substantial improvements over existing techniques.

Keywords


Metasearch Engine; Distributed query processing; Document selection.

Full Text:

PDF

References


L. Gravano and H. Garcia-Molina. Generalizing GlOSS to Vector-Space databases and Broker Hierarchies. Int’l Conf. Very Large Data Bases, Sep. 1995, pp. 78-89.

B. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real Life Information Retrieval: A Study of User Queries on the Web. Proc. ACM Special Interest Group on Information Retrieval Forum, (32) 1 (1998).

B. Yuwono and D. Lee. Server Ranking for Distributed Text Resource Systems on the Internet. Proc. Fifth Int’l Conf. Database Systems for Advanced Applications, Apr. 1997, pp. 391-400.

J. Callan, Z. Lu, and W. Bruce Croft. Searching Distributed Collections with Inference Networks. Proc. ACM Special Interest Group on Information Retrieval Conf. July 1995, pp. 21-28.

Patricia Correia Saraiva, Edleno Silva deMoura, Nivio Ziviani,Wagner Meira, Rodrigo Fonseca, and Berthier Ribeiro-Neto. Rank-Preserving Two-Level Caching for Scalable Search Engines. In ACM, editor, Proceedings of the SIGIR2001 conference, New Orleans, LA, September 2001. SIGIR.

C. Badue, R. Baeza-Yates, B. Ribeiro-Neto, and N. Ziviani. Distributed query processing using partitioned inverted ?les. In Proc. of the 9th String Processing and Information Retrieval Symposium (SPIRE), September 2002.

Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. Trovatore: Towards a Highly Scalable Distributed Web Crawler. In WWW Posters 2001, 2001.

N. Craswell, P. Bailey, and D. Hawking. Server Selection on the World Wide Web. In Proceedings of the Fifth ACM Conference on Digital Libraries, 2000, pp. 37-46.

E. Selberg, and O. Etzioni. The MetaCrawler Architecture for Resource Aggregation on the Web. IEEE Expert, 1997.

Wensheng Wu, Clement Yu, Weiyi Meng. Database Selection for Longer Queries, 2003.

L. Gravano, and H. Garcia-Molina. Merging Ranks from Heterogeneous Internet sources. International Conferences on Very Large Data Bases, 1997.

G. Towell, E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategies for Information Retrieval. 12th Int'l Conf. on Machine Learning, 1995.

E. Voorhees, N. Gupta, and B. Johnson-Laird. Learning Collection Fusion Strategies. ACM SI- GIR Conference, Seattle, 1995.

W. Meng, K.-L. Liu, C. Yu, X. Wang, Y. Chang, and N. Rishe. Determining Text Databases to Search in the Internet. Proc. Int’l Conf. Very Large Data Bases, Aug. 1998. pp. 14-25.


Refbacks

  • There are currently no refbacks.