<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:creativeCommons="http://backend.userland.com/creativeCommonsRssModule" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:prism="http://prismstandard.org/namespaces/1.2/basic/" version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:media="http://search.yahoo.com/mrss/">
  <channel>
    <title>Nature Precedings - Tag feed for natural language processing</title>
    <link>http://precedings.nature.com/tags/natural%20language%20processing</link>
    <description>Recently posted documents tagged with 'natural language processing'</description>
    <dc:publisher>Nature Publishing Group</dc:publisher>
    <dc:language>en</dc:language>
    <prism:publicationName>Nature Precedings</prism:publicationName>
    <image>
      <title>Nature Precedings</title>
      <url>http://precedings.nature.com/images/header_logo.gif</url>
      <link>http://precedings.nature.com</link>
    </image>
    <atom:link type="application/rss+xml" rel="self" href="http://precedings.nature.com/tags/natural%20language%20processing/feed"/>
    <item>
      <title>Towards Context Driven Modularization of Large Biomedical Ontologies</title>
      <link>http://dx.doi.org/10.1038/npre.2009.3522.1</link>
      <description>Formal knowledge about human anatomy, radiology or diseases is necessary to support clinical applications such as medical image search. This machine processable knowledge can be acquired from biomedical domain ontologies, which however, are typically very large and complex models.  Thus, their straightforward incorporation into the software applications becomes difficult. In this paper we discuss first ideas on a statistical approach for modularizing large medical ontologies and we prioritize the practical applicability aspect. The underlying assumption is that the application relevant ontology fragments, i.e. modules, can be identified by the statistical analysis of the ontology concepts in the domain corpus. Accordingly, we argue that most frequently occurring concepts in the domain corpus define the application context and can therefore potentially yield the relevant ontology modules. We illustrate our approach on an example case that involves a large ontology on human anatomy and report on our first manual experiments.  </description>
      <guid>http://dx.doi.org/10.1038/npre.2009.3522.1</guid>
      <pubDate>Thu, 30 Jul 2009 19:39:18 UTC</pubDate>
      <dc:title>Towards Context Driven Modularization of Large Biomedical Ontologies</dc:title>
      <dc:identifier>doi:10.1038/npre.2009.3522.1</dc:identifier>
      <dc:date>2009-07-30</dc:date>
      <dc:creator>Pinar Wennerberg</dc:creator>
      <prism:publicationName>Nature Precedings</prism:publicationName>
      <prism:publicationDate>2009-07-30T19:39:18Z</prism:publicationDate>
      <prism:category>Manuscript</prism:category>
      <prism:section>Bioinformatics</prism:section>
      <media:thumbnail url="http://precedings.nature.com/documents/3522/version/1/files/npre20093522-1.pdf.thumb.png"/>
      <creativeCommons:license>http://creativecommons.org/licenses/by/3.0/</creativeCommons:license>
    </item>
    <item>
      <title>Towards Context Driven Modularization of Large Biomedical Ontologies</title>
      <link>http://dx.doi.org/10.1038/npre.2009.3523.1</link>
      <description>Formal knowledge about human anatomy, radiology or diseases is necessary to support clinical applications such as medical image search. This machine processable knowledge can be acquired from biomedical domain ontologies, which however, are typically very large and complex models.  Thus, their straightforward incorporation into the software applications becomes difficult. In this paper we discuss first ideas on a statistical approach for modularizing large medical ontologies and we prioritize the practical applicability aspect. The underlying assumption is that the application relevant ontology fragments, i.e. modules, can be identified by the statistical analysis of the ontology concepts in the domain corpus. Accordingly, we argue that most frequently occurring concepts in the domain corpus define the application context and can therefore potentially yield the relevant ontology modules. We illustrate our approach on an example case that involves a large ontology on human anatomy and report on our first manual experiments.</description>
      <guid>http://dx.doi.org/10.1038/npre.2009.3523.1</guid>
      <pubDate>Thu, 30 Jul 2009 19:38:16 UTC</pubDate>
      <dc:title>Towards Context Driven Modularization of Large Biomedical Ontologies</dc:title>
      <dc:identifier>doi:10.1038/npre.2009.3523.1</dc:identifier>
      <dc:date>2009-07-30</dc:date>
      <dc:creator>Pinar Wennerberg</dc:creator>
      <prism:publicationName>Nature Precedings</prism:publicationName>
      <prism:publicationDate>2009-07-30T19:38:16Z</prism:publicationDate>
      <prism:category>Presentation</prism:category>
      <prism:section>Bioinformatics</prism:section>
      <media:thumbnail url="http://precedings.nature.com/documents/3523/version/1/files/npre20093523-1.pdf.thumb.png"/>
      <creativeCommons:license>http://creativecommons.org/licenses/by/3.0/</creativeCommons:license>
    </item>
    <item>
      <title>Using Textpresso for Information Retrieval, Fact Extraction</title>
      <link>http://dx.doi.org/10.1038/npre.2009.3302.1</link>
      <description>Ten years ago WormBase1 started as a repository for sequence data for the modelorganism Caenorhabditis elegans and has since striven to include the curation of allgenetic and molecular data published for this nematode. With a publication rate in the C.elegans field of approximately 800 papers per year, WormBase (WB) has the opportunity to include information from every paper published. Currently there are ~11,000 full text research papers (mid-1970&amp;#8217;s to the present) downloaded into the WB curation database, from which over 27 data types (i.e. genetic interactions, transgene objects, gene expression patterns, mutant phenotypes etc.) are extracted by curators. Textpresso2 is an open source text-mining tool capable of rapid searches for keywords, as well as concepts, from the full text of research papers. Curators at WB use Textpresso on a daily basis for many aspects of literature curation, from simple keyword searches to semi- or fully automated entity and fact extraction, which feed into curation pipelines or directly into the curation database itself. In addition, Textpresso greatly aids prioritization of literature curation by retrieving papers based on their full contents rather than solely on their abstracts. Such retrievable contents can range from the very particular (such as a gene simply being mentioned in the Materials and Methods section of a paper) to the complex (such as molecular functions that involve cellular components). As WB expands to incorporate the genomes of other nematodes, we will be working with Textpresso developers to set up a library of literature for related nematodes. We expect Textpresso to be crucial for most efficiently directing our efforts in literature curation, and for most quickly providing data to users searching the literature. In this workshop we will show how we use Textpresso in our curation pipeline to help with literature queries, to prioritize our workflow, and to automate data and fact extraction.1 WormBase2 Textpresso</description>
      <guid>http://dx.doi.org/10.1038/npre.2009.3302.1</guid>
      <pubDate>Tue, 02 Jun 2009 14:49:35 UTC</pubDate>
      <dc:title>Using Textpresso for Information Retrieval, Fact Extraction</dc:title>
      <dc:identifier>doi:10.1038/npre.2009.3302.1</dc:identifier>
      <dc:date>2009-06-02</dc:date>
      <dc:creator>Kimberly Van Auken</dc:creator>
      <prism:publicationName>Nature Precedings</prism:publicationName>
      <prism:publicationDate>2009-06-02T14:49:35Z</prism:publicationDate>
      <prism:category>Presentation</prism:category>
      <prism:section>Genetics &amp; Genomics</prism:section>
      <prism:section>Bioinformatics</prism:section>
      <media:thumbnail url="http://precedings.nature.com/documents/3302/version/1/files/npre20093302-1.pdf.thumb.png"/>
      <creativeCommons:license>http://creativecommons.org/licenses/by/3.0/</creativeCommons:license>
    </item>
    <item>
      <title>UIMA in the Biocuration Workflow: A coherent framework for cooperation between biologists and  computational linguists</title>
      <link>http://dx.doi.org/10.1038/npre.2009.3171.1</link>
      <description>As collaborating partners, Barcelona Media Innovation Centre and GRIB (Universitat Pompeu Fabra) seek to combine strengths from Computational Linguistics and Biomedicine to produce a robust Text Mining system to generate data that will help biocurators in their daily work. The first version of this system will focus on the discovery of relationships between genes, SNPs (Single Nucleotide Polymorphisms) and diseases from the literature.A first challenge that we were faced with during the setup of this project is the fact that most current tools that support the curation workflow are complex, ad-hoc built applications which sometimes make difficult the interoperability and results sharing between research groups from different and unrelated expert fields. Often, biologists (even computer-savvy ones) are hard pressed to use and adapt sophisticated Natural Language Processing systems, and computational linguists are challenged by the intricacies of biology in applying their processing pipelines to elicit knowledge from texts. The flow of knowledge (needed to develop a usable, practical tool) to and from the parties involved in the development of such systems is not always easy or straightforward.The modular and versatile architecture of UIMA (Unstructed Information Management Architecture) provides a framework to address these challenges. UIMA is a component architecture and software framework implementation (including a UIMA SDK) to develop applications that analyse large volumes of unstructured information, and has been increasingly adopted by a significant part of the BioNLP community that needs industrial-grade and robust applications to exploit the whole bibliome. The use of UIMA to develop Text Mining applications useful for curation purposes allows the combination of diverse expertises which is beyond the individual know-how of biologists, computer scientists or linguists in isolation. A good synergy and circulation of knowledge between these experts is fundamental to the development of a successful curation tool.</description>
      <guid>http://dx.doi.org/10.1038/npre.2009.3171.1</guid>
      <pubDate>Fri, 24 Apr 2009 22:20:13 UTC</pubDate>
      <dc:title>UIMA in the Biocuration Workflow: A coherent framework for cooperation between biologists and  computational linguists</dc:title>
      <dc:identifier>doi:10.1038/npre.2009.3171.1</dc:identifier>
      <dc:date>2009-04-24</dc:date>
      <dc:creator>Laura Ines Furlong</dc:creator>
      <prism:publicationName>Nature Precedings</prism:publicationName>
      <prism:publicationDate>2009-04-24T22:20:13Z</prism:publicationDate>
      <prism:category>Poster</prism:category>
      <prism:section>Bioinformatics</prism:section>
      <media:thumbnail url="http://precedings.nature.com/documents/3171/version/1/files/npre20093171-1.pdf.thumb.png"/>
      <creativeCommons:license>http://creativecommons.org/licenses/by/3.0/</creativeCommons:license>
    </item>
    <item>
      <title>Performance of the Charniak-Lease parser on biological text using different training corpora</title>
      <link>http://precedings.nature.com/documents/2310/version/1</link>
      <description>POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis.</description>
      <guid>http://precedings.nature.com/documents/2310/version/1</guid>
      <pubDate>Thu, 18 Sep 2008 15:47:19 UTC</pubDate>
      <dc:title>Performance of the Charniak-Lease parser on biological text using different training corpora</dc:title>
      <dc:identifier>hdl:10101/npre.2008.2310.1</dc:identifier>
      <dc:date>2008-09-18</dc:date>
      <dc:creator>Alison V. Callahan</dc:creator>
      <prism:publicationName>Nature Precedings</prism:publicationName>
      <prism:publicationDate>2008-09-18T15:47:19Z</prism:publicationDate>
      <prism:category>Manuscript</prism:category>
      <prism:section>Bioinformatics</prism:section>
      <media:thumbnail url="http://precedings.nature.com/documents/2310/version/1/files/npre20082310-1.pdf.thumb.png"/>
      <creativeCommons:license>http://creativecommons.org/licenses/by/3.0/</creativeCommons:license>
    </item>
    <item>
      <title>Identifying Data Sharing in Biomedical Literature</title>
      <link>http://precedings.nature.com/documents/1721/version/2</link>
      <description>Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to find shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.</description>
      <guid>http://precedings.nature.com/documents/1721/version/2</guid>
      <pubDate>Mon, 04 Aug 2008 20:32:00 UTC</pubDate>
      <dc:title>Identifying Data Sharing in Biomedical Literature</dc:title>
      <dc:identifier>hdl:10101/npre.2008.1721.2</dc:identifier>
      <dc:date>2008-08-04</dc:date>
      <dc:creator>Heather Piwowar</dc:creator>
      <prism:publicationName>Nature Precedings</prism:publicationName>
      <prism:publicationDate>2008-08-04T20:32:00Z</prism:publicationDate>
      <prism:category>Manuscript</prism:category>
      <prism:section>Bioinformatics</prism:section>
      <media:thumbnail url="http://precedings.nature.com/documents/1721/version/2/files/npre20081721-2.pdf.thumb.png"/>
      <creativeCommons:license>http://creativecommons.org/licenses/by/3.0/</creativeCommons:license>
    </item>
    <item>
      <title>Identifying Data Sharing in Biomedical Literature</title>
      <link>http://precedings.nature.com/documents/1721/version/1</link>
      <description>Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.</description>
      <guid>http://precedings.nature.com/documents/1721/version/1</guid>
      <pubDate>Tue, 25 Mar 2008 21:14:35 UTC</pubDate>
      <dc:title>Identifying Data Sharing in Biomedical Literature</dc:title>
      <dc:identifier>hdl:10101/npre.2008.1721.1</dc:identifier>
      <dc:date>2008-03-25</dc:date>
      <dc:creator>Heather Piwowar</dc:creator>
      <prism:publicationName>Nature Precedings</prism:publicationName>
      <prism:publicationDate>2008-03-25T21:14:35Z</prism:publicationDate>
      <prism:category>Manuscript</prism:category>
      <prism:section>Bioinformatics</prism:section>
      <media:thumbnail url="http://precedings.nature.com/documents/1721/version/1/files/npre20081721-1.pdf.thumb.png"/>
      <creativeCommons:license>http://creativecommons.org/licenses/by/3.0/</creativeCommons:license>
    </item>
  </channel>
</rss>
