The Huck Institutes of the Life Sciences

Biological Data Analysis: The Right Way

How Penn State’s annual data reproducibility boot camp has shaped data collection in the biological sciences

The week of July 10th marked the second annual data reproducibility boot camp, ‘Biological Data Analysis: The Right Way’. The boot camp was aimed at training participants in conducting research in a systematic, reproducible manner and addressing the ‘reproducibility crisis’ that is plaguing biological research.


This five day event was well attended by over forty participants including graduate students, postdoctoral researchers and others selected from a pool of over eighty applicants. The boot camp addressed a breadth of topics including perspectives on the definition of reproducibility, issues faced when producing and recording data, field-specific problems regarding data collection, statistics, and software useful for tracking and analyzing data.


The first day of the boot camp brought together speakers from different disciplines, who addressed the central question: ‘Is there a reproducibility crisis’?  The topics covered included how we make scientific inference and its relation to reproducibility from an evolutionary perspective (Dr. Ken Weiss), sources and proximity of lack of reproducibility (Dr. Ross Hardison), the reproducibility crisis in psychology and neuroscience (Dr. Rick Gilmore), computational reproducibility and data sharing (Dr. Vasant Honavar), statistical issues that are important from a publication perspective (Dr. Jim Rosenberger), and wetlab perspective on metadata (Dr. Cheryl Keller). Many speakers provided alternative viewpoints which were essential to a well-rounded and balanced presentation of ideas related to data analysis.

Dr. Shaun Mahony instructed the second day on ‘Software Carpentry’, the basics of computational reproducibility including version control, documentation and automation. The participants were introduced to Shell scripting, markdowns and Git. Dr. Qunhua Li instructed the third day on statistical aspects for reproducible research. Dr. Anton Nekrutenko instructed the fourth day on use of Galaxy in reproducible research. These sessions included short lectures, videos, and hands-on exercises.

The final half-day session was devoted to presentations from last year’s attendees on lessons learned in reproducibility. These presentations were made by Matthew Jensen (BG), Lila Rieber (BG) and Di (Bruce) Chen (Genetics). The boot camp concluded with an open discussion where participants shared what they learned from the boot camp.

The participants found the talks, hands-on activities and discussions very useful. To quote feedback from some of the attending students “It was really helpful to be connected with faculty, staff, and other students with helpful experience with reproducible data analysis tools. I also hadn't heard of most of the online tutorials we used during boot camp and found them to be excellent resources I plan to revisit”,

and “I liked that the Boot Camp was focused on a broader topic such as Reproducibility. Therefore it was applicable and useful to all of the faculties and graduate students of different departments.” All course materials are publicly available here.

The boot camp was organized by Dr. Cooduvalli Shashikant, Program Chair, Bioinformatics and Genomics, with the help of Melissa Bailey, Event Coordinator, Huck Institutes of Life Sciences.

This series of boot camp was started with a competitive administrative supplement from the NIH to Computation, Bioinformatics and Statistics Predoctoral Training Program procured in 2015. Continuance of the annual boot camp offering is made possible by contributions from participating colleges and the Huck.