Articles via Databases
Articles via Journals
Online Catalog
E-books
Research & Information Literacy
Interlibrary loan
Theses & Dissertations
Collections
Policies
Services
About / Contact Us
Administration
Littman Architecture Library
This site will be removed in January 2019, please change your bookmarks.
This page will redirect to https://digitalcommons.njit.edu/dissertations/128 in 5 seconds

The New Jersey Institute of Technology's
Electronic Theses & Dissertations Project

Title: Information filtering by multiple examples
Author: Zhu, Mingzhu
View Online: njit-etd2015-062
(xvi, 127 pages ~ 3.2 MB pdf)
Department: Department of Information Systems
Degree: Doctor of Philosophy
Program: Information Systems
Document Type: Dissertation
Advisory Committee: Wu, Yi-Fang Brook (Committee chair)
Duan, Lian (Committee member)
Oria, Vincent (Committee member)
Xu, Songhua (Committee member)
Zhao, Yihong (Committee member)
Date: 2015-05
Keywords: Information retrieval
Text mining
Query by example
Semi-supervised learning
Machine learning
Topic model
Availability: Unrestricted
Abstract:

A key to successfully satisfy an information need lies in how users express it using keywords as queries. However, for many users, expressing their information needs using keywords is difficult, especially when the information need is complex. Search By Multiple Examples (SBME), a promising method for overcoming this problem, allows users to specify their information needs as a set of relevant documents rather than as a set of keywords.

Most of the studies on SBME adopt the Positive Unlabeled learning (PU learning) techniques by treating the user's provided examples (denoted as query examples) as positive set and the entire data collection in the database as unlabeled set. User's information need is then represented as a query vector, which is obtained from the query examples or further augmented with unlabeled data as negative examples, in which the documents are ranked according to their degree of similarity to the query vector. The query examples are treated as being relevant to a single topic to build the query vector, but it is often the case that they belong to multiple topics. New methods are needed to deal with such a topic diversity issue.

Furthermore, there are many PU learning algorithms available, but it is still unknown which methods perform most effectively for SBME, as the experiments conducted in the previous studies have not taken into account the user search situation, where the size of the query examples varies and is much smaller than the size of the unlabeled data. When the query examples are much fewer than the unlabeled data, the system effectiveness may downgrade dramatically because of the class imbalance problem. Thus, it is important to identify the most effective PU learning algorithms for SBME and explore how to improve the system effectiveness further.

In the previous studies on SBME, a document is usually treated as a vector, of which the features are terms in the collections. Such a term-vector based document representation brings high dimensionality problems when the collection is large; or even worse, some noisy features seriously degrade the performance of the learning algorithms. Feature selection is necessary for solving the high dimensionality problem.

This research proposes a framework named Information Filtering by Multiple Examples (IFME) to explore how to improve SBME by: (1) solving the topic diversity issue by adopting probabilistic topic models to predict user's information need from the query examples; (2) tackling the class imbalance problem by adopting machine learning techniques; (3) identifying the most effective PU learning algorithms for SBME, (4) adopting ensemble learning techniques to improve the effectiveness of the PU learning algorithms for SBME further; and (5) adopting topic model for feature dimension reduction. The experimental results show that the proposed framework addressed the research questions successfully.


If you have any questions please contact the ETD Team, libetd@njit.edu.

 
ETD Information
Digital Commons @ NJIT
Theses and DIssertations
ETD Policies & Procedures
ETD FAQ's
ETD home

Request a Scan
NDLTD

NJIT's ETD project was given an ACRL/NJ Technology Innovation Honorable Mention Award in spring 2003