HII Research Computing Infrastructure

In collaboration with USF RC HII has access to over 9000 cores, 32 TB of memory, and 3 PB of storage for large-scale Bioinformatics and Genomic workloads.

Network connectivity is comprised of both 32 Gbps QDR Infiniband as well as cluster-to-cluster 100 Gbps over the USF "Research Rail" for large-scale data transfers between computing centers.

The infrastructure is continually updated with recent additions including:

  • DDN GPFS cluster providing 2.5 PB of scalable high-performance storage
  • DDN GPFS metadata server for increased transactional, high-throughput computing
  • 2 high-memory compute nodes each with 28 cores and 1024 GB of memory
  • 40 high-performance compute nodes each with 24 cores and 128 GB of memory
  • 40 high-performance compute nodes each with 16 cores and 128 GB of memory
  • 100 Gbps network technology implementing the USF "Research Rail"

HII has positioned itself to accommodate extremely large data sets through the development of a comprehensive Big Data Infrastructure.

Examples of Big Data in the life sciences include biomarkers, SNP, gene expression, microbiome, viral metagenomics, metabolomics, whole genome sequencing, proteomics, electronic medical records, and DICOM medical images.

With the unprecedented amount of data collected daily through advancements in data generation and capture techniques, datasets are increasingly so large and complex that they outstrip even the most advanced of conventional approaches to data management, processing, and analysis.

This Big Data problem represents a new epoch in data science and brings significant challenges in the transfer, storage, curation, allocation, and analysis of data. These challenges demand the integration of the most advanced technologies available with innovative approaches to high performance computing truly leveraging the potential of Big Data. With the advancements in computer and data science, questions which were previously unrealistic to investigate can now determine from the data. Big Data has the potential to revolutionize research, especially in the field of biomedical science.

To provide researchers, analysts, and developers with meaningful datasets, the data acquired by HII must be aggregated. Data aggregation involves the application of a variety of techniques to combine data together from various, disparate sources in a way that facilitates analysis. The data aggregation component is comprised of a data warehouse, online analytical processing (OLAP), reporting services, and the SAS Grid. HII has developed a large scale data warehouse infrastructure to provide enhanced data extraction, transformation, and loading of data for analytical consumption. The data warehouse infrastructure provides historical snapshots of aggregate data used for ad-hoc queries and reporting. It consists of a hub and spoke architecture providing business users with the ability to keep existing data marts to suit their needs. Aggregated datasets are replicated and stored within the data warehouse, both to protect the integrity of research data from the real-time churn of system transactional data and to keep datasets specific to one researcher or manuscript in a separate, pristine state. OLAP cubes are employed to provide SAS developers and biostatisticians with aggregate datasets. An OLAP cube is a multidimensional dataset that provides quick access to pre-summarized data generated from large conglomerate datasets. Data retrieval speed is significantly enhanced by the pre-aggregation of data cubes, allowing for the analytical process to occur in a timely and rapid fashion. The actual aggregation process deriving summary data from data stored in the data warehouse, which is then stored hierarchically within the cube. Data is backed up nightly to a data de-duplication appliance and offloaded on an as-needed basis to a high capacity sixth generation of linear tape open tape library using a linear tape file system backup appliance to free high-performance data storage.

HII operates a strict backup schedule. Transactional data is backed up nightly to a data de-duplication appliance and is copied weekly to a tape library with high capacity fifth generation of Linear Tape Open (LTO-5) drives. For our HPC storage needs, we utilize IBM’s General Parallel File System (GPFS) high-performance clustered file system. GPFS provides concurrent high-speed file access to massive parallel sequencing applications executing on multiple nodes in our HPC Linux clusters.

Storage: Short-term

HII currently manages 1.2 Petabytes of raw storage capacity. The Institute is in the process of acquiring an additional 480 Terabytes of high performance storage with transfer rates of up to 36GB/s and with growth capacity of up to 6 Petabytes. Our storage infrastructure relies on a multi-protocol and multi-tier set of technology targeted to storing and archiving information in the most cost effective, reliable, efficient and compliant way.

Storage: Long-term

Monthly tape backups are retained indefinitely while all other tapes are on a three month rotation. Tapes with full back up images and on-demand archives are sent off-site weekly to an Iron Mountain storage facility.