This document has been updated!

The most recent version of this document (v2) was posted on 2008 August 04.

View the most recent version
hdl:10101/npre.2008.1721.1
1 vote

Identifying Data Sharing in Biomedical Literature

Heather Piwowar1 & Wendy W. Chapman1

Correspondence: (Login to view email address)

  1. University of Pittsburgh
Document Type:
Manuscript
Date:
Received 25 March 2008 16:22 UTC; Posted 25 March 2008
Subjects:
Bioinformatics
Tags:
Abstract:

Many policies and projects now encourage investigators to share their raw research data with other scientists. Unfortunately, it is difficult to measure the effectiveness of these initiatives because data can be shared in such a variety of mechanisms and locations. We propose a novel approach to finding shared datasets: using NLP techniques to identify declarations of dataset sharing within the full text of primary research articles. Using regular expression patterns and machine learning algorithms on open access biomedical literature, our system was able to identify 61% of articles with shared datasets with 80% precision. A simpler version of our classifier achieved higher recall (86%), though lower precision (49%). We believe our results demonstrate the feasibility of this approach and hope to inspire further study of dataset retrieval techniques and policy evaluation.

Discussion

Votes:

1 vote

(Login to vote)

Comments:

0 comments

(Login to post a comment)

(Login to share with a colleague)

Additional information

License:
This document is licensed to the public under the Creative Commons Attribution 3.0 License
How to cite this document:

Piwowar, Heather and Chapman, Wendy. Identifying Data Sharing in Biomedical Literature. Available from Nature Precedings <http://hdl.handle.net/10101/npre.2008.1721.1> (2008)

Version info:

Other versions of this document in Nature Precedings

Version number Document title Date
v2 Posted 04 August 2008

Other versions of this document elsewhere on the web

None known.

Participate

Advertisement