Dr. John R. Skoyles, June 7, 1999
E-Biomed is a challenge to rewrite history -- imagine that scientific
publishing had started off not with print but with the internet, what would
it look like? Are there traditions about how we publish articles on paper
that are compromises linked to the practical problems of using the print
medium but that are not part of good scientific communication -- traditions
that have become so in grained into the practice of science that we fall
to see how they impair science and ideally should abandoned in the move to
the electronic media?
One such tradition I suggests exists: if science had started around the
internet all papers as a matter of course would contain links to archived
raw and semi-analysed data. However, in spite of mentioning the advantages
of data sharing and data archiving, the E-Biomed proposal fails to
adequately to challenge this tradition -- whether raw and semi-analysed is
posted with a paper is still left to its authors discretion. I shall argue
below it should be both expected and compulsory -- we should not mistake
our familiarity with its omission with being part of the scientific process
-- its traditional omission lies not in science but to the limits of print
as a medium.
Compulsory data archiving in summary would (1) make fraud easier to detect,
(2) encourage scientific criticism and (3) aid the scientific process in
general.
Experimental data only rarely needs to be read -- usually we are happy with
their author's own statistical treatment. But not always and in such cases
it is important. Researchers do not always fully analyse their data;
sometimes editors restrict their publication space; and sometimes we have
an idea we would like to try out on those data. It would be nice if the
experimental data we read about were easy to access. Though there are
several potential problems with compulsory archiving of published data, the
benefits would, I believe, vastly outweigh them.
Here follows a case for the compulsory archiving of data. I also raise a
few objections.
First, electronic data archiving should be easy to implement and will
become increasingly so. Most researchers would have little trouble
archiving their data upon publication. Most results sections are based upon
computer analysed ASCII data files or other standard data formats. Most
researchers should thus have their raw data stored in a form (i.e. file and
subdirectory names) which makes it easy for other researchers to use. The
commands and procedures for transferring it to a central data archive will
be familiar to most and be no greater than archiving papers.
Second, the scientific ethic is to make error correction as easy as
possible. Scientists are not always entirely competent or honest. Numerous
cases of fraud and intellectual dishonesty have occurred in all areas of
science. Researchers are subject to enormous pressures to publish but
unfortunately this normally requires positive findings. This puts pressure
on researchers to rerun analyses (changing criteria for categorizing data,
excluding subjects, treating missing data, etc.) when only negative
findings turn up. It is not clear how many researchers resist these
pressures on the integrity of data analysis. At present, it is difficult to
check. In a case reported in Science, two psychologists were only able to
check the data analysis of another psychologist through the intervention of
lawyers (Palca 1991).
There has been public disquiet in the US Congress (notably, on the part of
Congressman John Dingell) concerning fraud and intellectual dishonesty in
science. Research on published fraudulent papers has revealed many defects
(Stewart & Feder 1987). It is likely that any archived data would contain
even more accessible and noticeable defects (in their data distributions,
treatment and analysis). Archiving data would thus make it easier to detect
both fraud and intellectual dishonesty.
Third, much honestly obtained and analysed data is incompetently handled,
yet many legitimate criticisms never arise because of difficulties
accessing data. At present, if a scientist that suspects that another
researcher's analysis gives only part of the story or is misleading, faces
an involved process of contacting them for the original data (something
inconvenient to all concerned). Archiving data would increase the
opportunities for legitimate criticism of published work.
Fourth, researchers ask different questions. Sometimes a researcher may
wish to reanalyse data to answer questions the original authors ignored.
People carrying out meta-analyses will often want to check the quality of
the work they are using. At present this is not possible.
Fifth, students could gain much by examining real research papers and then
"playing around" with their data, seeing the affects of different
data-analytic strategies. They might even find things overlooked by their
authors.
Sixth, much data is accidentally lost (despite the requirement of most
journals that authors retain their data for a number of years). An archive
would make a convenient data backup.
Seventh, scientific papers are printed on paper -- this, not the nature of
science, is the reason data are not normally made accessible at this time.
Science is about open communication that maximally exposes ideas and
arguments to criticism (one legitimate criticism of an idea is the way its
data are handled). Printed paper is a convenient means for opening written
ideas to criticism, but it is unsuitable for making data accessible to
criticism (it limits the quantity which can be published and communicates
in a form that is inconvenient for computer reanalysis). Print has until
recently been the only means for disseminating scientific ideas and data.
Hence the tradition has arisen of limiting the dissemination of data. We
should recognise the opportunity that electronic archives provide for
breaking with this.
There are some reasons against archiving:
Certain classes of data (e.g., clinical data) may have to be excluded to
preserve the confidentiality and privacy of those from whom it is
collected. This constraint does not apply to large portions of research
which involves research on animals, reaction time studies on student
subjects, or computer simulations.
Researchers certainly have the right to the "first go" at their data.
However, the fact of publication, unless contrary notice is given, usually
signifies that the data have already been substantially analysed, and
frequently no further analysis is intended.
There is another objection that is however entirely invalid. Many
researchers will be uncomfortable with their data being archived because
none of us are perfect. If our data can be reanalysed we may be shown to
have carried out, quite unintentionally, inappropriate or misleading
analysis. To some extent the present state of affairs is quite convenient
for hiding the fact that many researchers could be better statisticians and
could keep better records.
Data archiving as a standard part of electronic paper publication of course
would involve some cost and effort, perhaps even some inconvenience.
However, with the public and congressional concern about whether scientists
are maximally ensuring the integrity of their data, an archive would show a
commitment from the scientific community to ensuring honesty in published
research.
REFERENCES
Palca, J. (1991). News and Comment: Get-the-lead-out guru challenged.
Science 253: 842-844.
Stewart, W. W. & Feder, N. (1987). The integrity of the scientific
literature. Nature 325: 207-214.
Based upon FTP INTERNET DATA ARCHIVING: A Cousin for PSYCOLOQUY
psycoloquy.92.3.29.data-archive.1.skoyles
ftp://ftp.princeton.edu/pub/harnad/Psycoloquy/1992.volume.3/psyc.92.3.29.data-archive.1.skoyles
alternatively:
www.users.globalnet.co.uk/~skoyles/pdata.html
Dr. John R. Skoyles
London, UK
Check out my Golden House-Sparrow award winning homepage
http://www.bigfoot.com/~skoyles