The automated categorization (or classification) of texts into predefinedcategories has witnessed a booming interest in the last ten years, due to theincreased availability of documents in digital form and the ensuing need toorganize them. In the research community the dominant approach to this problemis based on machine learning techniques: a general inductive processautomatically builds a classifier by learning, from a set of preclassifieddocuments, the characteristics of the categories. The advantages of thisapproach over the knowledge engineering approach (consisting in the manualdefinition of a classifier by domain experts) are a very good effectiveness,considerable savings in terms of expert manpower, and straightforwardportability to different domains. This survey discusses the main approaches totext categorization that fall within the machine learning paradigm. We willdiscuss in detail issues pertaining to three different problems, namelydocument representation, classifier construction, and classifier evaluation.
translated by 谷歌翻译