Political texts on the Web, documenting laws and policies and the process leading to them, are of key importance to government, industry, and every individual citizen. Yet access to such texts is difficult due to the ever increasing volume and complexity of the content, prompting the need for indexing or annotating them with a common controlled vocabulary or ontology.
We investigated the effectiveness of different sources of evidence: such as the labeled training data, textual glosses of descriptor terms, and the thesaurus structure for automatically indexing political texts.
The main findings are the following.
First, using a learning to rank approach integrating all features, we observe significantly better performance than previous systems.
Second, the analysis of feature weights reveals the relative importance of various sources of evidence, also giving insight in the underlying classification problem. Interestingly we found that the most important part of political documents is their title.
The research was done by University of Amsterdam’s researchers: Mostafa Dehghani, Hosein Azarbonyad, Maarten Marx, and Jaap Kamps; the results were presented as a poster at the 37th European Conference on Information Retrieval and won the best poster award. The original paper is available here.
M. Dehghani, H. Azarbonyad, M. Marx, and J. Kamps. Sources of evidence for automatic indexing of political texts. In A. Hanbury, G. Kazai, A. Rauber, and N. Fuhr, editors, Advances in Information Retrieval, volume 9022 of Lecture Notes in Computer Science, pages 568–573. Springer International Publishing, 2015. ISBN 978-3-319-16353-6. doi: 10.1007/978-3-319-16354-3 63. URL http://dx.doi.org/ 10.1007/978-3-319-16354-3_63.