hdl:10101/npre.2008.2310.1
2 votes

Performance of the Charniak-Lease parser on biological text using different training corpora

Alison V. Callahan1 and Michel Dumontier2

Correspondence: (Login to view email address)

  1. Faculty of Information, University of Toronto
  2. Department of Biology, Carleton University
Document Type:
Manuscript
Date:
Received 18 September 2008 15:39 UTC; Posted 18 September 2008
Subjects:
Bioinformatics
Tags:
Abstract:

POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis.

Discussion

Votes:

2 votes

(Login to vote)

Comments:

0 comments

(Login to post a comment)

(Login to share with a colleague)

Additional information

License:
This document is licensed to the public under the Creative Commons Attribution 3.0 License
How to cite this document:

Callahan, Alison and Dumontier, Michel. Performance of the Charniak-Lease parser on biological text using different training corpora. Available from Nature Precedings <http://hdl.handle.net/10101/npre.2008.2310.1> (2008)

Version info:

Other versions of this document in Nature Precedings

None.

Other versions of this document elsewhere on the web

None known.

Participate

Related Documents

Advertisement