NJIT ETD: "Information filtering by multiple examples" by Zhu, Mingzhu

E-books

Research & Information Literacy

Interlibrary loan

Theses & Dissertations

Littman Architecture Library

This site will be removed in January 2019, please change your bookmarks.
This page will redirect to https://digitalcommons.njit.edu/dissertations/128 in 5 seconds

The New Jersey Institute of Technology's
Electronic Theses & Dissertations Project

Title: Information filtering by multiple examples

Author: Zhu, Mingzhu

View Online: njit-etd2015-062
(xvi, 127 pages ~ 3.2 MB pdf)

Department: Department of Information Systems

Degree: Doctor of Philosophy

Program: Information Systems

Document Type: Dissertation

Advisory Committee: Wu, Yi-Fang Brook (Committee chair)
Duan, Lian (Committee member)
Oria, Vincent (Committee member)
Xu, Songhua (Committee member)
Zhao, Yihong (Committee member)

Date: 2015-05

Keywords: Information retrieval
Text mining
Query by example
Semi-supervised learning
Machine learning
Topic model

Availability: Unrestricted

Abstract:
A key to successfully satisfy an information need lies in how users express it using keywords as queries. However, for many users, expressing their information needs using keywords is difficult, especially when the information need is complex. Search By Multiple Examples (SBME), a promising method for overcoming this problem, allows users to specify their information needs as a set of relevant documents rather than as a set of keywords.

Most of the studies on SBME adopt the Positive Unlabeled learning (PU learning) techniques by treating the user's provided examples (denoted as query examples) as positive set and the entire data collection in the database as unlabeled set. User's information need is then represented as a query vector, which is obtained from the query examples or further augmented with unlabeled data as negative examples, in which the documents are ranked according to their degree of similarity to the query vector. The query examples are treated as being relevant to a single topic to build the query vector, but it is often the case that they belong to multiple topics. New methods are needed to deal with such a topic diversity issue.

Furthermore, there are many PU learning algorithms available, but it is still unknown which methods perform most effectively for SBME, as the experiments conducted in the previous studies have not taken into account the user search situation, where the size of the query examples varies and is much smaller than the size of the unlabeled data. When the query examples are much fewer than the unlabeled data, the system effectiveness may downgrade dramatically because of the class imbalance problem. Thus, it is important to identify the most effective PU learning algorithms for SBME and explore how to improve the system effectiveness further.

In the previous studies on SBME, a document is usually treated as a vector, of which the features are terms in the collections. Such a term-vector based document representation brings high dimensionality problems when the collection is large; or even worse, some noisy features seriously degrade the performance of the learning algorithms. Feature selection is necessary for solving the high dimensionality problem.

This research proposes a framework named Information Filtering by Multiple Examples (IFME) to explore how to improve SBME by: (1) solving the topic diversity issue by adopting probabilistic topic models to predict user's information need from the query examples; (2) tackling the class imbalance problem by adopting machine learning techniques; (3) identifying the most effective PU learning algorithms for SBME, (4) adopting ensemble learning techniques to improve the effectiveness of the PU learning algorithms for SBME further; and (5) adopting topic model for feature dimension reduction. The experimental results show that the proposed framework addressed the research questions successfully.

If you have any questions please contact the ETD Team, libetd@njit.edu.

ETD Information

Digital Commons @ NJIT

Theses and DIssertations

ETD Policies & Procedures

ETD FAQ's

ETD home

Request a Scan

NDLTD

NJIT's ETD project was given an ACRL/NJ Technology Innovation Honorable Mention Award in spring 2003