Performance of the Charniak-Lease parser on biological text using different training corpora
Correspondence: (Login to view email address)
- Faculty of Information, University of Toronto
- Department of Biology, Carleton University
- Document Type:
- Manuscript
- Date:
- Received 18 September 2008 15:39 UTC; Posted 18 September 2008
- Subjects:
- Bioinformatics
- Abstract:
POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis.
Discussion
- Votes:
-
2 votes
- Comments:
-
0 comments
- (Login to share with a colleague)
Additional information
- License:
- This document is licensed to the public under the Creative Commons Attribution 3.0 License
- How to cite this document:
-
Callahan, Alison and Dumontier, Michel. Performance of the Charniak-Lease parser on biological text using different training corpora. Available from Nature Precedings <http://hdl.handle.net/10101/npre.2008.2310.1> (2008)
- Version info:
-
Other versions of this document in Nature Precedings
None.
Other versions of this document elsewhere on the web
None known.