The Huck Institutes of the Life Sciences

Announcing new Galaxy-based Science Gateways at Penn State

The Institute for CyberScience is now offering consulting services for Penn State researchers interested in building Galaxy-based Science Gateways integrating advanced cyber-infrastructure components (e.g., data collections, instruments, supercomputers, and analytical tools) behind user-friendly interfaces for high-performance computing (HPC) resources.

Starting in November 2015, the Institute for CyberScience (ICS) participated in a yearlong joint venture with the labs of Dr. Frank Pugh and Dr. Shaun Mahony to define the details for a new ICS service offering for Penn State researchers consisting of consulting services for labs interested in building Science Gateways based on the Galaxy platform.

Science Gateways integrate various components of advanced cyber-infrastructure (e.g., data collections, instruments, supercomputers, and analytical tools) behind user-friendly interfaces. These environments eliminate the complexities inherent in research that require complex programming for high performance compute (HPC) resources.

Galaxy is an open, web-based platform for data-intensive research. It has been developed within the Center for Comparative Genomics and Bioinformatics at Penn State University and the Department of Biology at Johns Hopkins University, with additional support from the National Human Genome Research Institute, the National Science Foundation, the Huck Institutes of the Life Sciences, and the Institute for CyberScience at Penn State.

Work on the Galaxy platform began in 2005 with a small team under Dr. Anton Nekrutenko at Penn State and continues today. In 2010 the project expanded into the open source bioinformatics community and now has over 120 contributors worldwide.

Although developed by the bioinformatics community, Galaxy is a flexible platform that can be used for data intensive research in virtually any scientific discipline. Enabling this is the ability to easily wrap tools used within a field of research into a Galaxy environment, resulting in a Galaxy instance for that area of research. These Galaxy instances are called “flavors,” and Galaxy flavors that are currently available include next-gen sequencing, ChIP-exo, proteomics, computational chemistry, imaging and constructive solid geometry. Additional flavors are continually being built, and each flavor provides the foundation for building a Science Gateway that can be used by the Penn State research community.

The first ICS Galaxy-based Science Gateway, developed by Greg Von Kuster, one of the original members of the Galaxy development team and currently an R&D engineer in ICS, is named Galaxy-cegr (see Figure 1). It is hosted within a virtual machine (VM) on ICS HPC resources and has access to cluster nodes for job execution. This environment enables researchers in the Center for Eukaryotic Gene Regulation (CEGR) to perform ChIP-exo analyses on very large datasets produced by the Center’s sequencer.

Figure 1: Galaxy-cegr Science Gateway

Figure 1: Galaxy-cegr Science Gateway

 

The Science Gateway consists of four primary components, an Illumina NextSeq 500 sequencer owned by the Pugh/Mahony labs, the CEGR pre-processing pipeline, the Galaxy ChIP-exo environment and the Platform for Eukaryotic Gene Regulation (PEGR), a web application developed by the labs.

Within the Science Gateway environment, the sequencer produces raw datasets that are converted into the fastqsanger data format and imported into Galaxy data libraries (Galaxy’s hierarchical container for datasets). The Galaxy ChIP-exo workflow is automatically executed for each set of samples, with each tool in the workflow submitting jobs to the HPC cluster nodes, producing datasets that are then used as inputs to the next tool in the workflow chain.

The Galaxy ChIP-exo workflow includes tools that generate metadata (fine-grained statistics) about datasets produced by specified tools within the workflow. These statistics are sent to PEGR for review.

The CEGR Pre-Processing Pipeline

The CEGR pre-processing pipeline consists of four custom programs, developed to automate the processes used by the labs, ultimately preparing the data for analysis within the Galaxy ChIP-exo instance (the green boxes in Figure 2 below depict these custom programs). Each of these programs includes quality assurance components that automatically halt processing if errors occur, logging the details for review and correction. Each program can be executed independently (assuming that the previous program in the pipeline has completed successfully) allowing for a certain step to be re-executed after corrections are made.

Figure 2: Galaxy-cegr Pre-processing Pipeline

Figure 2: Galaxy-cegr Pre-processing Pipeline

 

  • copy_raw_data - Polls the lab’s server that contains the raw datasets produced by the sequencer to determine when the sequencing run is complete. The raw datasets are then copied from the server to the Science Gateway’s file store.

  • bcl2fastq – Converts the raw datasets into the fastqsanger data format required by the initial tools within the Galaxy ChIP-exo workflow.

  • send_data_to_galaxy – Creates a Galaxy data library for the run’s sample datasets. The sample datasets are imported into appropriate folders from which they can be retrieved for analysis. The program submits a PBS job to the ICS HPC cluster to import each sample dataset, storing them on the Science Gateway’s file store.

  • start_workflows – Retrieves sample datasets from the run’s Galaxy data library folders and provides them as input to the Galaxy ChIP-exo workflow. Each tool in the workflow performs its function by submitting a PBS job to the HPC cluster, storing the resulting datasets on the Science Gateway’s file store.

Galaxy includes a feature rich REST API which is used by the pre-processing pipeline for all direct interaction with Galaxy.

The Galaxy ChIP-exo Environment

The Galaxy environment is configured with 8 front-end web server processes. It contains the ChIP-exo workflows for both single and paired reads. These workflows consist of the chain of tools that perform the ChIP-exo analysis and send the statistics to PEGR (see Figure 3).

Figure 3: Galaxy-cegr ChIP-exo Workflow

Figure 3: Galaxy-cegr ChIP-exo Workflow

 

The Galaxy environment is also configured with 8 job handler processes that submit PBS jobs to the ICS HPC cluster nodes, producing datasets that are stored on the Science Gateway’s file store. This configuration allows for load balancing on the web front-end and the many simultaneous PBS jobs submitted to the HPC cluster.

The Platform for Eukaryotic Gene Regulation

The Platform for Eukaryotic Gene Regulation (PEGR) is a web based sample tracking application developed by the Pugh/Mahony labs. Lab technicians enter sample information into PEGR that is used by the CEGR pre-processing pipeline to prepare the data for analysis within Galaxy. PEGR interacts with Galaxy via the Galaxy REST API.

Galaxy-Based Science Gateways

The Galaxy platform can be configured to map specified tools to users. In this way, a user logging into a Galaxy instance will be presented with a set of tools pertaining to their specific field of research, while another user logging into the same Galaxy instance will be presented with a different set of tools. These tool sets could intersect or be completely disparate. This feature allows for multiple researchers spanning various fields of science to share the same Galaxy-based Science Gateway, providing the option for those with smaller grants to not have full responsibility over a complete environment.

Those interested in building Galaxy-based Science Gateways for their research should send an email to Greg Von Kuster. Greg’s consulting services include building all custom components needed within a Galaxy-based Science Gateway environment for labs or individual researchers. These services include developing wrappers for research tools not currently available within the Galaxy platform, potentially creating a new Galaxy flavor for a field of research. These environments can be hosted on ICS HPC or other compute resources, depending upon the researcher’s needs.

The Institute for CyberScience is an interdisciplinary research institute under the Office of the Vice President for Research, and is dedicated to supporting cyber-enabled research across science disciplines. ICS builds an active community of researchers using computational methods in a wide range of fields through co-hiring, provides seed funding for ambitious computational research projects, and offers access to high-performance computing resources through its Advanced Cyber Infrastructure. With the support of ICS, Penn State researchers harness the power of Big Data, Big Simulation, and Big Compute to solve the world’s problems. For more information, please visit https://ics.psu.edu or email ics@psu.edu.