background image
Odds Ratios by Disease
0.0
2.0
4.0
6.0
8.0
10.0
Musculoskeletal Diseases
Nervous System Diseases
Nutritional and Metabolic Diseases
Prostate Cancer
Breast Cancer
Colon Cancer
Lung Cancer
Leukemia/Lymphoma
All Neoplasms
All Diseases
Odds Ratio
Odds Ratios by Area of Information Science
0.0
2.0
4.0
6.0
8.0
10.0
Image Processing, Computer-Assisted
User-Computer Interface
Database Management Systems
Vocabulary, Controlled
Computer Graphics
Software
Databases, Genetic
Information Storage and Retrieval
Computer Simulation
Software Validation
Databases, Protein
Pattern Recognition, Automated
Artificial Intelligence
Odds Ratio
Examining the uses of shared data
Heather A. Piwowar and Douglas B. Fridsma
Does your research area re-use shared datasets?
·
Re-using data has many benefits, including research synergy and efficient resource use
·
Some research areas have tools, communities, and practices which facilitate re-use
·
Identifying these areas will allow us to learn from them, and apply the lessons
to areas which underutilize the
sharing and re-purposing of scientific data between investigators
Hope
Although not all research topics can be addressed by
re-using existing data, many can. Identifying areas
with frequent re-use can highlight best practices to
be used when developing research agendas, tools,
standards, repositories, and communities in areas
which have yet to receive major benefits from shared
data.
Future Work
We plan to refine our tool for identifying studies
which re-use data, and continue studying and
measuring re-use and reusability.
Acknowledgements
We sincerely thank our funders:
· USA NLM for a training grant
· USA NSF for a travel grant
For Further Information
Please contact
hpiwowar@alumni.pitt.edu
Poster will be available at Nature Precedings.
FIGURE 1:
Documents with re-used data
have a different MeSH distribution than those
with original data.
Which datasets?
This preliminary analysis examines the re-use of microarray gene expression datasets.
Thousands of microarray gene expression datasets have been deposited in publicly available databases.
Many studies reuse this data, but it is not well understood for what purposes.
Here, we examined all publications found in PubMed Central on April 1, 2007 whose full-text contained the
phrases "microarray" and "gene expression" to find studies which re-used microarray data.
How did we identify re-use?
We developed prototype machine-learning classifiers to identify a) studies containing original microarray data
(n=900) and b) studies which instead re-used microarray data (n=250). Preprocessing (Python NLTK)
extracted manually-selected keyword frequencies from the full-text publications as features for a Support
Vector Machine (SVMlite). The classifier was trained and tested on a manually-labeled set of documents
(PLoS articles prior to January 2007 containing the word "microarray," n=200).
How did we identify patterns of re-use?
We compared the Medical Subject Heading (MeSH) of the two classes to estimate the odds that a specific
MeSH term would be used given all studies with original microarray data, compared to the odds of the same
term describing studies with re-used data. Terms were truncated to comparable levels in the MeSH hierarchy.
Results
· Publications with original vs. re-used microarray data have different distributions of MeSH terms
(
Figure 1), and occur in different proportions across various journals (Figure 2).
· Microarray data source (original vs. re-used) did not affect the odds of a study focusing on humans, mice, or
invertebrates, whereas publications with re-used data did involve a relatively high proportion of studies
involving fungi (odds ratio (OR)=2.4), and a relatively low proportion involving rats, bacteria, viruses, plants,
or genetically-altered or inbred animals (OR<0.5) compared to publications with original data.
· Trends in odds ratios of MeSH terms for other attributes can be seen in
Figure 3.
FIGURE 2:
As expected, journals with a
bioinformatics focus published the highest
proportion of studies with re-used microarray data.
Percent of microarray studies with re-used data
by Journal
0%
5%
10% 15% 20% 25% 30% 35% 40%
Mol Cancer
Environ Health Perspect
Arthritis Res Ther
Reprod Biol Endocrinol
BMC Microbiol
BMC Dev Biol
BMC Biotechnol
BMC Mol Biol
BMC Neurosci
Respir Res
BMC Immunol
Retrovirology
Breast Cancer Res
BMC Cancer
BMC Genomics
PLoS Genet
Nucleic Acids Res
BMC Evol Biol
Genome Biol
PLoS Biol
PLoS Comput Biol
BMC Bioinformatics
(Number of microarray studies with re-used data)/(All microarray studies)
Publication MeSH vectors in
PCA space
-
0
.
0
6
-
0
.
0
4
-
0
.
0
2
0
0
.
0
2
0
.
0
4
0
.
0
6
0
.
0
8
-
0
.
0
4
-
0
.
03
-
0
.
0 2
-
0
.
0 1
0
First Principal Component
S
e
c
o
n
d

P
r
i
n
c
i
p
a
l
C
o
m
p
o
n
e
n
t
Publications with
original data
Publications which
re-used data
Odds Ratios of MeSH term occurrence
given studies with re-used vs. original data
by Organism
0.0
2.0
4.0
6.0
8.0
10.0
Plants
Fungi
Viruses
Bacteria
Invertebrates
AnimalPopulationGroups
Rats
Mice
Humans
All Chordata
Odds Ratio
(Odds of MeSH term given all publications with re-used microarray data over
odds of MeSH term given all publications with original microarray data)
Odds Ratios by Biological Phenomenon
0.0
2.0
4.0
6.0
8.0
10.0
Cell Proliferation
Apoptosis
Reproduction
Up-Regulation
Cell Differentiation
Pharmacology
Protein Biosynthesis
Transcription, Genetic
Down-Regulation
Proteomics
Metabolism
Signal Transduction
Growth/Development
Mutagenesis
Species Specificity
Organ Specificity
Phylogeny
Genomics
Promoter Regions
Cell Cycle
Environmental Sciences
Computational Biology
Binding Sites
Evolution
Odds Ratio
FIGURE 3:
The
fold-change in odds
that a specific MeSH term will describe
a publication with re-used microarray data as
compared to
a publication with original data
Odds Ratios by Biological Methods
0.0
2.0
4.0
6.0
8.0
10.0
Reverse Transcriptase Polymerase Chain Reaction
Polymerase Chain Reaction
Nucleic Acid Hybridization
Chromosome Mapping
Sequence Analysis, DNA
Sequence Alignment
All Genetic Techniques
Protein Interaction Mapping
Clinical Lab Techniques
All Non-Statistical Investigative Techniques
Odds Ratio
Odds Ratios by Statistical Methods
0.0
2.0
4.0
6.0
8.0
10.0
Reproducibility of Results
Survival Analysis
Analysis of Variance
Systems Integration
Principal Component Analysis
Cluster Analysis
Bayes Theorem
Sensitivity and Specificity
Models
Multivariate Analysis
Algorithms
Predictive Value of Tests
Sample Size
All Statistical Methods
Odds Ratio
University of Pittsburgh
Department of Biomedical Informatics
Nature Precedings : doi:10.1038/npre.2007.425.3 : Posted 18 Jul 2007