Home‎ > ‎

Research

Selected Media Coverage:

An informal blog is maintained by the members of the PCDS to disseminate research findings and activities of the lab and its members. The intended audience for his blog is general public with an interest in computational science, K-12 students, and undergraduate students who might be interested in pursuing STEM educations and careers. The blog can be accessed here.

External Funding: 

Our research, so far, has been supported by the following federal agencies and Institutes: 

             


Research Interests:

The concept of data in omics field in not new. However, with the deluge of data from various sources, the computational challenges have increased many folds. We under the direction of Prof. Fahad Saeed at Western Michigan University have been focusing on solving big data genomic and proteomics problems. Such innovative computational solutions will bring us one step closer to personal and precision medicine. 

  

The research focus of the lab is at the intersection of high performance computing and real-world applications, especially in computational biology. We are particularly interested in High Performance Computing (HPC) solutions to Big Data problems in high-throughput proteomics and genomics using variety of high-performance architectures and algorithms. We are interested in designing and developing novel ways to dealing with Big Data computational biology problems using application specific domain-knowledge and high performance techniques. Other research interests include large-scale data analytics & visualizationparallel analysis solutions for massive graphs and information retrieval from large complex data sets. We here at parallel computing and data science lab are striving to solve big data problem emanating from these high-throughput technologies. Our techniques include novel reductive analytics algorithms, high performance algorithms for compressive analytics, high performance computing solutions to general big data problems and efficient protocols for sharing and transferring big genomic and proteomic data sets. 

 

Reductive Analytics 

Large volume is one of the characteristics of big data sets. This peta-scale datasets from healthcare, genomics/proteomics, banking and advertising leads to storage, analyzing and visualization challenges. Our efforts to deal with this deluge of data has been focused on developing data reductive techniques for big data. The basic idea is to be able to reduce the data to a significantly smaller size and still be able to draw same conclusions when whole data set is analyzed. 

As an application we have been focusing on mass spectrometry (MS) data sets. Modern proteomics studies utilize high-throughput mass spectrometers which can produce data at an astonishing rate. These big mass spectrometry (MS) datasets can easily reach peta-scale level creating storage and analytic problems for large-scale systems biology studies. These MS data are large in size, are highly fragmented with irregular data points, and extremely noisy. We have made impressive advances in dealing with these kind of data sets by formulating highly efficient quantification and classification domain specific algorithm. Our experiments show peptide deduction accuracy of up to 95% with reduction in the data size of up to 70%. Our results also indicate that we are able to process 1 million spectra in under 1 h on a sequential machine making it highly efficient for big datasets. Comparable reduction tools took over 3 days for the same dataset on a similar machine. 


Representative Publications:

  1. Muaaz Awan and Fahad Saeed, "MS-REDUCE: An ultrafast technique for reduction of Big Mass Spectrometry Data for high-throughput processing", accepted in Oxford Bioinformatics, Jan 2016 Tech Report PubMed | Oxford
  2. Muaaz Awan and Fahad Saeed*"On the sampling of Big Mass Spectrometry Data", Proceedings of Bioinformatics and Computational Biology (BICoB) Conference, Honolulu Hawaii, March 2015 Tech Report
  3. Fahad Saeed*, Jason Hoffert and Mark Knepper, "CAMS-RS: Clustering Algorithm for Large-Scale Mass Spectrometry Data using Restricted Search Space and Intelligent Random-Sampling",  IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol.11, No.1, pp.128,141, Jan. 2014 Tech Report | PubMed | IEEE Xplore
  4. Fahad Saeed*, Jason Hoffert, Trairak Pisitkun, Mark Knepper, "Exploiting thread-level and instruction-level parallelism to cluster mass spectrometry data using multicore architectures"Network Modeling Analysis in Health Informatics and Bioinformatics, Vol. 3, No. 1, pp 1-19, Feb. 2014 Springer | PubMed

Compressive Analytics 

Another way to deal with large volume of big data is to compress the data. Compression is used for general compression (e.g. zip folders) as well as in domain specific instances (e.g. jpeg for images). Compression of big data is a useful way to store data and allows practitioners to organize data in a manageable way. However, there are two bottleneck in the current practice. 1) One is that the compression algorithms (especially for domain specific application such as genomics) exhibit poor scalability with increasing size of the data. 2) The other is that the compressed data has to decompressed before it can be used again. Our experiments show that decompression is at least as time-consuming as compression for large data sets. 

We have been making progress in this field by developing high performance techniques that can perform compression for big genomic data sets in a reasonable amount of time. To this end, we have presented a HPC technique that can run on memory-distributed clusters. The technique allows us to compress next generation sequencing data at a much faster pace than in currently possible. Other focus areas include on developing HPC technique that could allow analysis on the compressed-form of the data.  


Representative Publications: 

  1. Sandino N. V. Perez and Fahad Saeed*,  "A Parallel Algorithm for Compression of Big Next-Generation Sequencing (NGS) Datasets", Proceedings of Parallel and Distributed Processing with Applications (IEEE ISPA-15), Vol.3. pp. 196-201 Helsinki Finland, Aug 2015 Tech Report | IEEE Xplore
  2. Mohammed Aledhari and Fahad Saeed*"Design and Implementation of Network Transfer Protocol for Big Genomic Data"Proceedings of IEEE International Congress on Big Data (IEEE BigData Congress), pp. 281-288, New York City, USA, June 2015 (18% acceptance rate) Tech Report | IEEE Xplore

High Performance Computing Solutions to Big Data Analytics using Heterogeneous Architectures 

Heterogeneous architectures that are "portable" are particularly appealing for efficient low-cost Big Data management, processing and analysis because of two reasons. First, processing units (e.g. CPU) and accelerators such as GPU's now co-exist which allows exploitation of parallelism needed for variable amount of workloads and algorithms. Second, off-loading compute intensive processes to hardware accelerators gives enormous performance improvements. However, heterogeneity of the processing elements introduces multiple challenges e.g. GPU's operate on a different processing model as compared to CPU's, have limited memory and have challenges such as long data latency delays from CPU to accelerators However, emerging high bandwidth, low latency interconnect technologies and innovative algorithmic procedures provide abundant opportunities for the development of highly innovative techniques for extreme acceleration of core Big Data problems. 


Both volume and velocity of big data from different sources requires high performance computing solutions. In order to make the analysis of these big data sets to domain scientists; the HPC solutions should be able to use "portable" HPC architectures such as FPGA or GPU's. This kind of HPC techniques which do not require extensive infrastructure will be very useful in precision and personal medicine. To this end, we have been designing and developing GPU and FPGA based algorithms that can process big data sets.  

Our current focus is on developing high performance solutions to fundamental operations such as searching, sorting and graph traversal. Once, we have formulated HPC solutions to these operations using FPGA's and GPU's; they can be used to develop computational biology and bioinformatics solutions. Recently, we have introduced a GPU based algorithm that allows sorting of very large number of arrays.   


Representative Publications: 

  1. Muaaz Gul Awan and Fahad Saeed*, "GPU-ArraySort: A parallel, in-place algorithm for sorting large number of arrays", Proceedings of Workshop on High Performance Computing for Big Data, International Conference on Parallel Processing (ICPP-2016), Philadelphia PA, August 2016

Efficient Protocols for sharing Big Data Sets 

Transporting big data is also one of the major challenges. The current network  protocols and infrastructure is heavily geared towards transporting data without considering content or context. This makes the current protocols rather inefficient in transporting big data. For example, 10 TB of big genomic data takes more time to transmit to a university in Europe as compared to transporting the hard-drive using a courier service. One reason that transmission of big data using existing networking protocol is the inability of these protocols to consider the data being considered e.g. static fixed length encoding of text as well as images. However, not every kind of data need similar encoding space to transmit data. To this end, we  have been developing protocols that can consider the content of the data that is being transmitted. Current focus is on developing these protocols for big genomic data but more efforts will be invested to transmit other kind of big data. 


Future Directions 

The future directions of the lab include developing high performance computing solutions to big data problems using heterogeneous architectures, integration strategies for various structures/unstructured data sources and novel sampling, indexing & sketching techniques for big data sets. We are essentially a high performance computational lab but we actively colloborate with experimental systems biology labs around the world including National Institutes of Health (NIH) Bethesda MD. Our research, so far, has been supported by National Science Foundation (NSF), National Institutes of Health (NIH), Nvidia, Altera Corporation and Western Michigan University.