Table Of Content

The R Series Statistics RI m In computational science, reproducibility requires that researchers e make code and data available to others so that the data can be ana- sp lyzed in a similar manner as in the original publication. Code must el Implementing e be available to be distributed, data must be accessible in a readable a m format, and a platform must be available for widely distributing the r c data and code. In addition, both data and code need to be licensed e Reproducible h permissively enough so that others can reproduce the work without n a substantial legal burden. t i Research n Implementing Reproducible Research covers many of the elements g necessary for conducting and distributing reproducible research. It explains how to accurately reproduce a scientific result. R Divided into three parts, the book discusses the tools, practices, and e dissemination platforms for ensuring reproducibility in computational p science. It describes: r o • Computational tools, such as Sweave, knitr, VisTrails, Sumatra, d CDE, and the Declaratron system u • Open source practices, good programming practices, trends in c open science, and the role of cloud computing in reproducible i b research l • Software and methodological platforms, including open source e software packages, RunMyCode platform, and open access journals Each part presents contributions from leaders who have developed software and other products that have advanced the field. These innovators explore the use of reproducible research in bioinformatics and large-scale data analyses and offer guidelines on best practices and legal issues, including recommendations of the Reproducible Edited by Research Standard. P Le Sto Victoria Stodden eng isch dde Friedrich Leisch n Roger D. Peng K15945 K15945_Cover.indd 1 3/12/14 9:54 AM The R Series Implementing Reproducible Research Edited by Victoria Stodden Columbia University New York, New York, USA Friedrich Leisch University of Natural Resources and Life Sciences Institute of Applied Statistics and Computing Vienna, Austria Roger D. Peng Johns Hopkins University Baltimore, Maryland, USA Chapman & Hall/CRC The R Series Series Editors John M. Chambers Torsten Hothorn Department of Statistics Division of Biostatistics Stanford University University of Zurich Stanford, California, USA Switzerland Duncan Temple Lang Hadley Wickham Department of Statistics Department of Statistics University of California, Davis Rice University Davis, California, USA Houston, Texas, USA Aims and Scope This book series reflects the recent rapid growth in the development and application of R, the programming language and software environment for statistical computing and graphics. R is now widely used in academic research, education, and industry. It is constantly growing, with new versions of the core software released regularly and more than 4,000 packages available. It is difficult for the documentation to keep pace with the expansion of the software, and this vital book series provides a forum for the publication of books covering many aspects of the development and application of R. The scope of the series is wide, covering three main threads: • Applications of R to specific disciplines such as biology, epidemiology, genetics, engineering, finance, and the social sciences. • Using R for the study of topics of statistical methodology, such as linear and mixed modeling, time series, Bayesian methods, and missing data. • The development of R, including programming, building packages, and graphics. The books will appeal to programmers and developers of R software, as well as applied statisticians and data analysts in many fields. The books will feature detailed worked examples and R code fully integrated into the text, ensuring their usefulness to researchers, practitioners and students. Published Titles Analyzing Baseball Data with R, Max Marchi and Jim Albert Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R, Daniel S. Putler and Robert E. Krider Dynamic Documents with R and knitr, Yihui Xie Event History Analysis with R, Göran Broström Implementing Reproducible Research, Victoria Stodden, Friedrich Leisch, and Roger D. Peng Programming Graphical User Interfaces with R, Michael F. Lawrence and John Verzani R Graphics, Second Edition, Paul Murrell Reproducible Research with R and RStudio, Christopher Gandrud Statistical Computing in C++ and R, Randall L. Eubank and Ana Kupresanin MATLAB® is a trademark of The MathWorks, Inc. and is used with permission. The MathWorks does not warrant the accuracy of the text or exercises in this book. This book’s use or discussion of MATLAB® software or related products does not constitute endorsement or sponsorship by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® software. CRC Press Taylor & Francis Group 6000 Broken Sound Parkway NW, Suite 300 Boca Raton, FL 33487-2742 © 2014 by Taylor & Francis Group, LLC CRC Press is an imprint of Taylor & Francis Group, an Informa business No claim to original U.S. Government works Version Date: 20131025 International Standard Book Number-13: 978-1-4665-6160-1 (eBook - PDF) This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint. Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmit- ted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. For permission to photocopy or use material electronically from this work, please access www.copyright. com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the CRC Press Web site at http://www.crcpress.com Contents Preface................................................................................ vii Acknowledgment ................................................................. xiii Editors.................................................................................xv Contributors........................................................................xvii PartI Tools 1. knitr:AComprehensiveToolforReproducibleResearchinR........3 YihuiXie 2. ReproducibilityUsingVisTrails ...........................................33 JulianaFreire,DavidKoop,FernandoChirigati,andCláudioT.Silva 3. Sumatra:AToolkitforReproducibleResearch .........................57 AndrewP.Davison,MicheleMattioni,DmitrySamarkanov,and BartoszTelen´czuk 4. CDE:AutomaticallyPackageandReproduceComputational Experiments....................................................................79 PhilipJ.Guo 5. ReproduciblePhysicalScienceandtheDeclaratron ................. 113 PeterMurray-RustandDaveMurray-Rust PartII PracticesandGuidelines 6. DevelopingOpen-SourceScientificPractice........................... 149 K.JarrodMillmanandFernandoPérez 7. ReproducibleBioinformaticsResearchforBiologists................ 185 LikitPreeyanon,AlexisBlackPyrkosz,andC.TitusBrown 8. ReproducibleResearchforLarge-ScaleDataAnalysis............... 219 HolgerHoeflingandAnthonyRossini v vi Contents 9. PracticingOpenScience ................................................... 241 LuisIbanez,WilliamJ.Schroeder,andMarcusD.Hanwell 10. Reproducibility,VirtualAppliances,andCloudComputing....... 281 BillHowe 11. TheReproducibilityProject:AModelofLarge-Scale CollaborationforEmpiricalResearchonReproducibility .......... 299 OpenScienceCollaboration 12. WhatComputationalScientistsNeedtoKnowabout IntellectualPropertyLaw:APrimer..................................... 325 VictoriaStodden PartIII Platforms 13. OpenScienceinMachineLearning...................................... 343 MikioL.BraunandChengSoonOng 14. RunMyCode.org:AResearch-ReproducibilityToolfor ComputationalSciences ................................................... 367 ChristopheHurlin,ChristophePérignon,andVictoriaStodden 15. OpenScienceandtheRoleofPublishersinReproducible Research....................................................................... 383 IainHrynaszkiewicz,PeterLi,andScottEdmunds Index................................................................................. 419 Preface Sciencemovesforwardwhendiscoveriesarereplicatedandreproduced. In general, the more frequently a given relationship is observed by independent scientists, the more trust we have that such a relationship truly exists innature.Replication,thepracticeofindependentlyimplementingscientific experiments to validate specific findings, is the cornerstone of discovering scientific truth. Related to replication is reproducibility, which is the calcu- lation of quantitative scientific results by independent scientists using the originaldatasetsandmethods.Reproducibilitycanbethoughtofasadiffer- entstandardofvalidityfromreplicationbecauseitforgoesindependentdata collectionandusesthemethodsanddatacollectedbytheoriginalinvestiga- tor (Peng et al., 2006). Reproducibility has become an important issue for morerecentresearchduetoadvancesintechnologyandtherapidspreadof computationalmethodsacrosstheresearchlandscape. Much has been written about the rise of computational science and the complicationscomputingbringstothetraditionalpracticeofscience(Bailey et al. 2013; Birney et al. 2009; Donoho et al. 2009; Peng 2011; Stodden 2012; Stodden et al. 2013; Yale Roundtable 2010). Large datasets, fast computers, andsophisticatedstatisticalmodelingmakeapowerfulcombinationforsci- entificdiscovery.However,theycanalsoleadtoalackofreproducibilityin computationalsciencefindingswheninappropriatelyappliedtothediscov- eryprocess.Recentexamplesshowthatimproperuseofcomputationaltools andsoftwarecanleadtospectacularlyincorrectresults(e.g.,Coombesetal. 2007).Makingcomputationalresearchreproducibledoesnotguaranteecor- rectnessofallresults,butitallowsforquicklybuildingonsoundresultsand forrapidlyrootingoutunsoundones. Thesharingofanalyticdataand thecomputer codes used tomapthose dataintocomputationalresultsiscentraltoanycomprehensivedefinitionof reproducibility.Exceptforthesimplestofanalyses,thecomputercodeused toanalyzeadatasetistheonlyrecordthatpermitsotherstofullyunderstand whataresearcherhasdone.Thetraditionalmaterialsandmethodssections inmostjournalpublicationsaresimplytooshorttoallowfortheinclusionof criticaldetailsthatmakeupananalysis.Often,seeminglyinnocuousdetails can have profound impacts on the results, particularly when the relation- shipsbeingexaminedareinherentlyweak.Someconcernshavebeenraised over the sharing of code and data. For example, the sharing of data may allow other competing scientists to analyze the data and scoop the scientists who originally published the data, or the sharing of code may lead to the inability to monetize software through proprietary versions of the code.Whiletheseconcernsarerealandhavenotbeenfullyresolvedbythe scientificcommunity,wedonotdwellontheminthisbook. vii viii Preface This book is focused on a simple question. Assuming one agrees that reproducibilityofascientificresultisagoodthing,howdowedoit?Incom- putational science, reproducibility requires that one make code and data available to others so that they may analyze the original data in a similar mannerasintheoriginalpublication.Thistaskrequiresthattheanalysisbe doneinsuchawaythatpreservesthecodeanddata,andpermitstheirdis- tributioninaformatthatisgenerallyreadable, andaplatformbeavailable to the author on which the data and code can be distributed widely. Both data and code need to be licensed permissively enough so that others can reproducetheworkwithoutasubstantiallegalburden. Inthisbook,wecovermanyoftheingredientsnecessaryforconducting anddistributingreproducibleresearch.Thebookisdividedintothreeparts thatcoverthethreeprincipalareas:tools,practices,andplatforms.Eachpart containscontributionsfromleadersintheareaofreproducibleresearchwho havemateriallycontributedtotheareawithsoftwareorotherproducts. Tools Literate statistical programming is a concept introduced by Rossini, which builds on the idea of literate programming as described by Donald Knuth. Withliteratestatisticalprogramming,onecombinesthedescriptionofasta- tistical analysis and the code for doing the statistical analysis into a single document. Subsequently, one can take the combined document and pro- duce either a human-readable document (i.e., PDF) or a machine-readable code file. An early implementation of this concept was the Sweave system ofLeisch,whichusesRasitsprogramminglanguageandLaTeXasitsdoc- umentation language. Yihui Xie describes his knitr package, which builds substantiallyonSweaveandincorporatesmanynewideasdevelopedsince the initial development of Sweave. Along these lines, Tanu Malik and col- leaguesdescribetheScienceObjectLinkingandEmbeddingframeworkfor creatinginteractivepublicationsthatallowauthorstoembedvariousaspects of computational research in a document, creating a complete research compendium. There have been a number of systems developed recently that are designed to track the provenance of data analysis outputs and to manage aresearcher’sworkflow.JulianaFreireandcolleaguesdescribetheVisTrails systemforopensourceprovenancemanagementforscientificworkflowcre- ation. VisTrails interfaces with existing scientific software and captures the inputs,outputs,andcodethatproducedaparticularresult,evenpresenting this workflow in flowchart form. Andrew Davison and colleagues describe theSumatratoolkitforreproducibleresearch.Theirgoalistointroduceatool forreproducibleresearchthatminimizesthedisruptiontoscientists’existing Preface ix workflows,thereforemaximizingtheuptakebycurrentscientists.Theirtool serves as a kind of “backend” to keep track of the code, data, and dependencies as a researcher works. This allows for easily reproducing specific analysesandforsharingwithcolleagues. Philip Guo takes the “backend tracking” idea one step further and describes his Code, Data, Environment (CDE) package, which is a minimal “virtual machine” for reproducing the environment as well as the analysis. Thispackagekeepstrackofallfilesusedbyagivenprogram(i.e.,astatistical analysis program) and bundles everything, including dependencies, into a singlepackage.Thisapproachguaranteesthatallrequirementsareincluded andthatagivenanalysiscanbereproducedonanothercomputer. Peter Murray-Rust and Dave Murray-Rust introduce The Declaraton, a tool for the precise mapping of mathematical expressions to computational implementations.Theypresentanexamplefrommaterialsscience,defining whatreproducibilitymeansinthisfield,inparticularforunstabledynamical systems. PracticesandGuidelines Conductingreproducibleresearchrequiresmorethantheexistenceofgood tools.Ensuringreproducibilityrequirestheintegrationofusefultoolsintoa largerworkflowthatisrigorousinkeepingtrackofresearchactivities. One metaphoristhatofthelabnotebook,nowextendedtocomputationalexper- iments. Jarrod Millman and Fernando Pérez raise important points about how computational scientists should be trained, noting that many are not formally trained in computing, but rather pick up skills “on the go.” They detail skills and tools that may be useful to computational scientists and describe a web-based notebook system developed in IPython that can be used to combine text, mathematics, computation, and results into a reproducible analysis. Titus Brown discusses tools that can be useful in the area ofbioinformaticsaswellasgoodprogrammingpracticesthatcanapplytoa broadrangeofareas. Holger Hoeing and Anthony Rossini present a case study in how to producereproducibleresearchinacommercialenvironmentforlarge-scale dataanalysesinvolvingteamsofinvestigators, analysts, andstakeholders/ clients. All scientific practice, whether in academia or industry, can be informed by the authors’ experiences and the discussion of tools they used toorganizetheirwork. Closely coupled with the idea of reproducibility is the notion of “open science,” whereby results are made available to the widest audience possi- blethroughjournalpublicationsorothermeans.LuisIbanezandcolleagues givesomethoughtsonopenscienceandreproducibilityandtrendsthatare

Implementing Reproducible Research PDF

440 Pages·2014·3.836 MB·English

by Victoria Stodden, Friedrich Leisch, Roger D Peng

Checking for file health...

Save to my drive

Quick download

Download

Download Implementing Reproducible Research PDF Free - Full Version

by Victoria Stodden; Friedrich Leisch; Roger D Peng| 2014| 440 pages| 3.836| English

Download Implementing Reproducible Research by Victoria Stodden; Friedrich Leisch; Roger D Peng in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About Implementing Reproducible Research

No description available for this book.

Detailed Information

Author:	Victoria Stodden; Friedrich Leisch; Roger D Peng
Publication Year:	2014
Pages:	440
Language:	English
File Size:	3.836
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free Implementing Reproducible Research Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download Implementing Reproducible Research PDF?

Yes, on https://PDFdrive.to you can download Implementing Reproducible Research by Victoria Stodden; Friedrich Leisch; Roger D Peng completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read Implementing Reproducible Research on my mobile device?

After downloading Implementing Reproducible Research PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of Implementing Reproducible Research?

Yes, this is the complete PDF version of Implementing Reproducible Research by Victoria Stodden; Friedrich Leisch; Roger D Peng. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download Implementing Reproducible Research PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.