Nice post. I have been surfing the internet to find an answer: what would be the most appropriate way to cluster and later visualise 380+ patients with measures for 29 proteins (in pg/L)? And shouldn’t one correct for covariates while clustering, for instance for age and sex?

Since the data scales are sometimes not overlapping, but I do want to compare the data, I Box-Cox transformed (because there are biologically plausible low levels) and than normalised the data. Than I just want to cluster the most comparable proteins (in the rows) and the most comparable patients in the columns. Intuitively I was thinking “correlation”. And that turns out to be “right”. But now about the distance and linkage…there I’m wondering what best to use… Even after reading your post I don’t know…

And another thing, I never see this done in papers: the correction for potential confounders prior to clustering. Is there a reason why?

Thanks!

Sander

]]>Think about three clusters for four times. One has mean 100, 150, 200, 50. The second has mean 5, 10, 10, 5 and the third has mean 10, 5, 2, 1. Most genes which originate from the first cluster will have a much larger distance between them than genes from the other two clusters. This causes the two low expression clusters to be always clustered together, even though they have distinct biological causes. The cluster with large expression will be fragmented into many clusters, although it should be a single cluster. I found that transforming the data using a regularised logarithm transformation and using existing algorithms for microarrays gives adequate clustering in this scenario.

]]>Thank you for an interesting post as always. I have a question about the statement “Mahalanobis distance seeks to remedy the problem of genes with high variance among the samples dominating the distance calculation by appropriate normalization”. Are you sure that “high variance genes dominating the distance calcualtion” is necessarily a bad thing? Intuitively, low variance genes often won’t contain a lot of information, because the useful signal will inevitabley be drowned out by measurement error?

As a side note, would it make sense to test the various distance metrics in real data? I.e. applying them to some existing high throughput datasets, like TCGA. CCLE or GTEx and see how well various distance metrics recapitulate the expected structure of the data (based on cancer subtype, tissue of origin or any similar condition that you would expect a “good” distance metric to recapitulate).

Thanks again for posting and forgiveness if I have misunderstood your point.

]]>Quilting aside, I would argue strongly that Euclidean distance, Pearson’s correlation or Spearman’s (rank) correlation are not appropriate for data that carry only relative information… in fact, they are known (largely outside molecular bioscience) to fall prey to Spurious Correlation (https://en.wikipedia.org/wiki/Spurious_correlation)

I suggest that for this kind of compositional data *proportionality* is the appropriate measure of pairwise association between components:

https://www.researchgate.net/publication/251878910_Have_you_got_things_in_proportion_A_practical_strategy_for_exploring_association_in_high-dimensional_compositions?ev=prf_pub

…and Aitchison’s distance is the right metric to use between compositions.

Aitchison’s distance is related to (symmetric) Kullback-Liebler divergence (as an upper bound).

]]>http://bioinformatics.oxfordjournals.org/content/30/2/197.long

Although the documentation is minimal and some features used to make the figures in the journal article aren’t implemented. I’m also not sure there are serious problems with using existing algorithms for the data type to warrant a new one.

]]>