nsf.jpg

imArray-An Automated High-Performance Microarray Scanner Software Package for Microarray Image Analysis, Data Management and Knowledge Mining



Project Summary: Microarray is a powerful tool for genomic research and it has great potential for clinical diagnoses in the future. Though this useful technique can perform a biological assay in parallel, it requires intelligent software to process its data. In this proposal, we proposed to develop an automatic high-performance microarray scanner software, imArray, which is intended to provide a comprehensive data and information management environment for microarray image analysis and microarray data mining from multi-modalities. The imArray software consists of three important components, which are image analysis engine, data analysis engine, and knowledge discovery and data mining engine. The implementation of these key components is seamlessly integrated with IBM's Unstructured Information Management Architecture (UIMA) which is a software agent for converting unstructured information into structured knowledge. With this architecture, it becomes easier to relate information from multi-modality sources which includes not only raw data, but also experimental results, literatures, and documents of other formats, to the related domain knowledge for further discovering the new knowledge which is previously hidden behind the unstructured information.


With the collaboration between faculty members from University of Alabama at Birmingham (UAB), researchers with strong expertise in computer science, biostatistics, and biochemistry and molecular genetics are hereby proposing to develop the microarray scanner software, imArray, which is the extension of our previous work on the fully automatic gridding and segmentation for cDNA microarray image analysis. The proposed imArray software takes the advantage of the UIMA to analyze and manage the unstructured information in multi-modalities and then uses the well-organized information to discover new knowledge which is hidden behind the enormous amount of unstructured data. It is worth mentioning that UIMA plays a key role in the entire process of data handling, analyzing, and data mining. It is our belief that microarray technology will be widely used for clinical purpose around the world and the proposed imArray software package will greatly enhance the performance of the microarray scanner software and serve as an integrated data management system which deals with nearly every aspect related to microarray data processing, indexing, and querying.


Introduction of microarray and imArray architecture <PowerPoint slide download>


Software description: imArray system follows a Model-View-Controller (MVC) design pattern, and is implemented based on the UIMA framework. It offers a friendly GUI for specifying imported and exported data locations. Moreover, users can easily plug in various analysis components for performing various analysis tasks. imArray system includes four major modules:

(1) Slide Information Module
(2) Slide Blocking Module
(3) Slide Gridding Module
(4) Slide Segmentation Module

Modules are implemented as either a primitive AE(Analysis Engine) or an Aggregate AE. A primitive AE contains an Annotator and a Component Descriptor. An annotator is codes for analyzing unstructured contents. A component descriptor is a XML file describing the data structure and the input/output requirements of the annotator. An aggregate AE includes one or more primitive AEs, and it contains only a component descriptor with additional information about the data processing flow involved in primitive AEs. Slide Information Module analyzes parses, retrieves slide information in XML documents, and stores information in CAS for further analysis. This module collaborates with an agent-based automatic information update module to retrieve the latest gene annotations from various public databases, thus can provide the up-to-date information to researchers. Slide Blocking Module includes three sub modules: signal/noise detector to identify foreground and background pixels, tilt detector to detect and correct tilted slides, and block boundary detector to separate blocks. Slide Gridding Module is realized as an aggregate AE which has two primitive AEs (bounding box generator and grid line detector). This module generates grids for blocks and stores grids as annotations in CAS. Slide Segmentation Module segments each cell in grids and extracts real signal pixels from each spot. We use Otsu's thresholding algorithm in a progressive manner to get a local threshold for a spot region by minimizing the intra-class and inter-class variance of pixel intensities.

Below are the code package and few of the publications related to this work.

Code

imArray software package <Download>
Please notice that the tilt detector module and the agent-based automatic information update module are still under implementation. Hence, we do not include these two modules in the current released version of imArray software package.

Publications


A Supervised Machine Learning Approach of Extracting and Ranking Published Paper Describing Coexpression Relationships among Genes

In this project, we have described a framework to extract coexpression relationships among genes, from published literature using supervised machine learning approach. We used a graphical model, Dynamic Conditional Random Fields (DCRFs) for training our classifier. Our approach is based on semantic analysis of text to classify the predicates describing coexpression relationship rather than learning the keywords. Below are the code package and few of the publications related to this work.

Code

Gene co-expression classification and information retrieval code package <Download>

Publications