Learning to Find Relevant Biological Articles Without Negative Training Examples
Learning to Find Relevant Biological Articles. Without Negative Training Examples. Keith Noto, Milton H. Saier Jr., and Charles Elkan. University of California, La …
Abstract. Classi?ers are traditionally learned using sets of positive and negative training examples. However, often a classi?er is required, but for training only an incomplete set of positive examples and a set of unlabeled examples are available. This is the situation, for example, with the Transport Classi?cation Database (TCDB, www.tcdb.org), a repository of information about proteins involved in transmembrane transport. This paper presents and evaluates a method for learning to rank the likely relevance to TCDB of newly published scienti?c articles, using the articles currently referenced in TCDB as positive training examples. The new method has succeeded in identifying 964 new articles relevant to TCDB in fewer than six months, which is a major practical success. From a general data mining perspective, the contributions of this paper are (i) devising and evaluating two novel approaches that solve the positive-only problem e?ectively, (ii) applying support vector machines in a state-of-the-art way for recognizing and ranking relevance, and (iii) deploying a system to update a widely-used, real-world biomedical database. Supplementary information including all data sets are publicly available at www.cs.ucsd.edu/users/knoto/pub/ajcai08.
Introduction
The transport classi?cation database, or TCDB (www.tcdb.org), is an online database which contains sequence, structural, and functional information about proteins that relate to transport across cell membranes in a variety of organisms, categorized into over 550 families of proteins [11]. TCDB is widely used, averaging over 50 di?erent users per day from research institutions all over the world. TCDB de?nes and implements the transport classi?cation system [10] for categorizing transport proteins, which was adopted by the International Union of Biochemistry and Molecular Biology as the international standard in 2002.
As of October 15, 2007, the start of the project described in this paper, the data contained in TCDB were compiled from 3,403 publications in over 200 di?erent journals…
Download Learning to Find Relevant Biological Articles Without Negative Training Examples.pdf