Classifying Biological Articles using Web Resources

ADVERTISEMENT


Classifying Biological Articles using Web Resources. Francisco M. Couto fcouto@di.fc.ul.pt. Bruno Martins bmartins@xldb.di.fc.ul.pt. Mário J. Silva mjs@di.fc.ul…

ABSTRACT

Text classi?cation systems on biomedical literature aim to select relevant articles to a speci?c issue from large corpora. Most systems with an acceptable accuracy are based on domain knowledge, which is very expensive and does not provide a general solution. This paper presents a novel approach for text classi?cation on biomedical literature, in-volving the use of information extracted from related web resources. We validated this approach by implementing the proposed method and testing it on the KDD2002 Cup challenge: bio-text task. Results show that our approach can e?ectively improve e?ciency on text classi?cation systems for biomedical literature.

Categories and Subject Descriptors
H.2.8 [Database Management]: Database Applications—Bioinformatics (genome or protein)databases, Feature ex-traction or construction, Text mining, Web mining

Keywords :
biomedical text classi?cation

INTRODUCTION

The classi?cation of biological literature is an important recent research topic, motivated by the large number of biological articles that curators have to read in order to update biological databases, or simply to be aware of progress in a speci?c area. Text classi?cation applied to biological literature can minimize this e?ort by automatically selecting only the relevant articles to a given task [3].
Text classi?cation systems are primarily designed to assign categories to documents, in order to support information retrieval, or to provide an aid to human indexers in the assignment task. In the simplest form, binary classi?cation, the system decides the relevant and irrelevant documents (or passages) from large corpora [20]. Most approaches to text classi?cation are based on statistical natural language processing [13]. They apply quantitative methods for auto-
mated language processing, using probabilistic modeling, information theory, and sometimes linear algebra. Statistical text classi?cation systems need a training set of documents in order to build a model later used to classify other documents. This training set consists in a representation of each document and its expected classi?cation. After building the model, and given the representation of a new document, the system can then predict its class. Most of the times, when we want to evaluate a model, we create a test set. This set also contains the expected classi?cation for each document, which will later be compared to its predicted classi?cation. The most common form of representation for documents is the bag-of-words. In this approach, features are the set of all words mentioned in the documents, and each document is represented by the number of occurrences of each one of these features in the text.

Download Classifying Biological Articles using Web Resources.pdf

Leave a Reply


Map: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67