Document information

doi:10.1038/npre.2007.425.3
3 votes

Examining the uses of shared data

Heather A. Piwowar1 & Douglas B. Fridsma1

Correspondence: (Login to view email address)

  1. University of Pittsburgh
Document Type:
Poster
Date:
Received 18 July 2007 15:15 UTC; Posted 18 July 2007
Subjects:
Bioinformatics
Tags:
Abstract:

Does your research area re-use shared datasets?
* Re-using data has many benefits, including research synergy and efficient resource use
* Some research areas have tools, communities, and practices which facilitate re-use
* Identifying these areas will allow us to learn from them, and apply the lessons to areas which underutilize the sharing and re-purposing of scientific data between investigators

Which datasets?
This preliminary analysis examines the re-use of microarray gene expression datasets. Thousands of microarray gene expression datasets have been deposited in publicly available databases.
Many studies reuse this data, but it is not well understood for what purposes. Here, we examined all publications found in PubMed Central on April 1, 2007 whose full-text contained the phrases “microarray” and “gene expression” to find studies which re-used microarray data.

How did we identify re-use?
We developed prototype machine-learning classifiers to identify a) studies containing original microarray data (n=900) and b) studies which instead re-used microarray data (n=250). Preprocessing (Python NLTK) extracted manually-selected keyword frequencies from the full-text publications as features for a Support Vector Machine (SVMlite). The classifier was trained and tested on a manually-labeled set of documents (PLoS articles prior to January 2007 containing the word “microarray,” n=200).

How did we identify patterns of re-use?
We compared the Medical Subject Heading (MeSH) of the two classes to estimate the odds that a specific MeSH term would be used given all studies with original microarray data, compared to the odds of the same term describing studies with re-used data. Terms were truncated to comparable levels in the MeSH hierarchy.

Results
Publications with original vs. re-used microarray data have different distributions of MeSH terms (Figure 1), and occur in different proportions across various journals (Figure 2).
Microarray data source (original vs. re-used) did not affect the odds of a study focusing on humans, mice, or invertebrates, whereas publications with re-used data did involve a relatively high proportion of studies involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants, or genetically-altered or inbred animals (OR<0.5) compared to publications with original data.
Trends in odds ratios of MeSH terms for other attributes can be seen in Figure 3.

Hope
Although not all research topics can be addressed by re-using existing data, many can. Identifying areas with frequent re-use can highlight best practices to be used when developing research agendas, tools, standards, repositories, and communities in areas which have yet to receive major benefits from shared data.

Future Work
We plan to refine our tool for identifying studies which re-use data, and continue studying and measuring re-use and reusability.

Presented at:
ISMB 2007, 22 July 2007

Discussion

Votes:

3 votes

(Login to vote)

Comments:

0 comments

2 comments on previous versions

(Login to post a comment)

(Login to share with a colleague)

Additional information

License:
This document is licensed to the public under the Creative Commons Attribution 2.5 License
How to cite this document:

Piwowar, Heather and Fridsma, Douglas. Examining the uses of shared data. Available from Nature Precedings <http://dx.doi.org/10.1038/npre.2007.425.3> (2007)

Version info:

Other versions of this document in Nature Precedings

Version number Document title Date
v2 Posted 17 July 2007
v1 Posted 17 July 2007

Other versions of this document elsewhere on the web

None known.

Participate

Advertisement