Similarity Measures Based on the Overlap of Ranked Genes Are Effective for Comparison and Classification of Microarray Data

Serra, Fabrizio; Romualdi, Chiara; Fogolari, Federico

doi:10.1089/cmb.2015.0057

Similarity (or conversely distance) measures are at the heart of most bioinformatic applications. When the similarity involves only a small subset of features out of many, global similarity measures may be significantly affected by noise. Selecting only a subset of (putatively relevant) features for comparison is a widespread solution to the problem albeit affected by arbitrariness and manual intervention. The problem is becoming more and more important due to the increasing amount of experimental data available. In recent years measures based on ranking similarities between two datasets have been proposed. Here, we use one of the proposed rank similarity measures, sharing some aspects with the fraction enrichment score used for protein structure prediction and the gene set enrichment analysis, and test its performance in classifying experiments. The discrimination ability of the similarity measures based on the overlap of ranked genes tested here compares well or better with standard measures of similarity. This conclusion supports the use of rank-based proximity measures to gain further insight in dataset comparisons, particularly on expression data obtained by different techonologies (e.g., RNA-seq and microarrays).