Draft:Minimax linkage

From Wikipedia, the free encyclopedia


In computational mathematics/statistics, minimax linkage is a criterion applied in hierarchical cluster analysis. Minimax linkage hierarchical clustering is a special case of the hierarchical clustering approaches, originally first introduced by Ao et al.[1] in the AI genomics software project CLUSTAG in 2004. Medical institutions have been deploying the minimax linkage hierarchical clustering in their genomics research. Jacob Bien and Robert Tibshirani (2011)[2] investigated the theoretical properties of the minimax linkage hierarchical clustering. Xiao Hui Tai and Kayla Frisoli (2021)[3] conducted benchmarking for the minimax linkage hierarchical clustering. The development history of the minimax linkage criterion is shown as follows.

Minimax linkage criterion in tag SNPs selection[edit]

In the context of tag-SNP selection, a desirable property for a clustering algorithm would be that a cluster must contain at least one SNP (the tag SNP) that is no more than the merging distance from all the other SNPs from the same cluster. If this is the case, then by setting a cutoff merging distance of C, one can ensure that no SNP is further than C away from the tag SNP in its cluster.

In order to achieve the desired property described above, Ao et al. proposed a new definition of the distance between two clusters, as follows:

(1) For each SNP belonging to either cluster, find the maximum distance between it and all the other SNPs in the two clusters;

(2) The smallest of these maximum distances is defined as the distance between the two clusters;

(3) The corresponding SNP is defined as the tag SNP of the newly merged cluster.

Ao et al. called their first proposed method the minimax clustering[1]. It is an agglomerative method. There exists a parallel in topology in which the distance between two compact sets can be measured by a sup-inf metric known as Hausdorff distance (Barnsley, 1988[4]; Wucklidge, 1996[5]).

Minimax linkage in genomics applications[edit]

The complete linkage hierarchical clustering, minimax linkage hierarchical clustering and set cover algorithms were implemented in the program CLUSTAG for tag SNP selection. The performance of the three implemented algorithms, using SNP data from the ENCODE regions of the HapMap project, was evaluated by Ao et al.[1], according to three criteria:

(1) Compression, the ratio of clusters to SNPs,

(2) Compactness, the average distance between a SNP and the tag SNP of its cluster (1-R2), and

(3) Run time.

Their results showed that the compression ratio is roughly equivalent for the set cover and minimax clustering algorithms but substantially higher for the complete linkage. The minimax algorithm produces more compact clusters than the set cover algorithm. The run times of all three algorithms were estimated to increase in proportion to the square of the number of SNPs.

Tag SNP selection can significantly reduce both the cost and the time of mapping genome areas associated with the target disease by avoiding the need to genotype every SNP in the chromosomal region. It is found that the genotyping cost saved can be as high as 80%[1]. WCLUSTAG is a modified version of CLUSTAG that allows the user to specify variable weights for different genomic regions or SNPs[6]. This flexible feature is useful for researchers who wish to prioritize genomic regions in an association study, and to further optimize study design under budgetary constraints. Examples of the medical research institutions that have been deploying/expounding the minimax linkage hierarchical clustering via the software CLUSTAG and WCLUSTAG are listed as follows:

Research: The Analysis of 51 Genes in DSM-IV Combined Type Attention Deficit Hyperactivity Disorder: Association Signals in DRD4, DAT1 and 16 Other Genes[7]

Institutions: MRC Social Genetic Developmental and Psychiatry Centre, Institute of Psychiatry, London, UK; Department of Psychiatry, Trinity Centre for Health Sciences, St James's Hospital, Dublin, Ireland; Department of Psychiatry, Radboud University, The Netherlands; S Herzog Memorial Hospital, Jerusalem, Israel; Triversum, Alkmaar, Holland; University Clinic for Child and Adolescent Psychiatry, Essen, Germany; Departments of Experimental Clinical Health Psychology, Ghent University, Ghent, Belgium; Department of Developmental and Educational Psychology, University of Valencia, Valencia, Spain; Department of Child and Adolescent Psychiatry, University of Zurich, Zurich, Switzerland; and others

Research: Fine Mapping of the NRG1 Hirschsprung's Disease Locus[8]

Institutions: Department of Psychiatry & Department of Surgery & Genome Research Centre, University of Hong Kong, Hong Kong, China; Department of Surgery, Guiyang Medical College Affiliated Hospital, Guiyang, China; Department of Epidemiology and Biostatistics, Ministry of Education Key Lab of Environment and Health, School of Public Health, Tongji Medical College of Huazhong University of Science and Technology, Wuhan, China; and others

Research: SNP Discovery Using a Pangenome: Has the Single Reference Approach Become Obsolete?[9]

Institutions: School of Agriculture and Food Sciences, University of Queensland, Australia; School of Biological Sciences and Institute of Agriculture, University of Western Australia, Australia

Research: Establishing Multiple Omics Baselines for Three Southeast Asian Populations in the Singapore Integrative Omics Study[10]

Institutions: Saw Swee Hock School of Public Health & Life Sciences Institute & Yong Loo Lin School of Medicine, National University of Singapore, Singapore; Baker IDI Heart and Diabetes Institute, Melbourne, Australia; Department of Medical Microbiology, University of Manitoba, Winnipeg, Canada; National Microbiology Laboratory, Winnipeg, Canada; Department of Biochemistry and Molecular Biology, The University of Melbourne, Melbourne, Australia; Molecular Engineering of Biological and Chemical System/Chemical Pharmaceutical Engineering, Singapore-Massachusetts Institute of Technology Alliance, Singapore; and others

Research: A Workflow for Selection of Single Nucleotide Polymorphic Markers for Studying of Genetics of Ischemic Stroke Outcomes[11]

Institutions: Institute of Molecular Genetics of National Research Centre, Moscow, Russia

Theoretical properties of minimax linkage[edit]

Bien and Tibshirani (2011)[2] believed that Ao et al.’s proposed linkage had great potential as a tool for data analysis, and successfully showed that minimax linkage shared many of the desirable theoretical properties of the standard linkages while adding interpretative value. Thereby the minimax linkage criterion can increase the ease of interpretation.

Benchmarking the minimax linkage hierarchical clustering[edit]

Bien and Tibshirani (2011)[2] used two real datasets to demonstrate the appeal of using minimax linkage compared with other linkages.

Tai and Frisoli (2021)[3] reported that, similarly to Bien and Tibshirani (2011), minimax linkage often produced the smallest distances to prototypes, meaning that objects in a cluster were tightly clustered around their prototype. It was noted that this was true across a range of values for the total number of clusters (k).

References[edit]

  1. ^ a b c d Ao, Sio Iong; Yip, K.; Ng, M.; Cheung, D.; Fong, P.-Y.; Melhado, I.; Sham, P. C. (advance online: 2004-12-07). "CLUSTAG: hierarchical clustering and graph methods for selecting tag SNPs". Bioinformatics. 21 (8): 1735–1736.
  2. ^ a b c Bien, Jacob; Tibshirani, Robert (2011). "Hierarchical Clustering With Prototypes via Minimax Linkage". Journal of the American Statistical Association. 106 (495): 1075–1084. doi:10.1198/jasa.2011.tm10183. PMC 4527350. PMID 26257451.
  3. ^ a b Tai, Xiao Hui; Frisoli, Kayla. "Benchmarking Minimax Linkage in Hierarchical Clustering". In: Data Analysis and Rationality in a Complex World, Springer, 2021.
  4. ^ Barnsley, M. F. 1988. Fractals everywhere. Academic Press.
  5. ^ Wucklidge, W. 1996. Efficient visual recognition using the Hausdorff distance. Springer.
  6. ^ Sham, P.C.; Ao, Sio Iong; Kwan, J.S.H.; Kao, P.; Cheung, F.; Fong, P.Y.; Ng, M.K. (advance online: 2006-10-23). “Combining functional and linkage disequilibrium information in the selection of tag SNPs”. Bioinformatics, Volume 23, Issue 1, January 2007, Pages 129–131.
  7. ^ Brookes, K.; Xu, X.; Chen, W.; Zhou, K.; Neale, B.; Lowe, N.; Aneey, R.; Franke, B.; Gill, M.; Ebstein, R.; Buitelaar, J.; Sham, P.; Campbell, D.; Knight, J.; Andreou, P.; Altink, M.; Arnold, R.; Boer, F.; Buschgens, C.; Butler, L.; Christiansen, H. et al. “The Analysis of 51 Genes in DSM-IV Combined Type Attention Deficit Hyperactivity Disorder: Association Signals in DRD4, DAT1 and 16 Other Genes”. Molecular Psychiatry, 2006, 11, pages 934-953.
  8. ^ Tang, Clara Sze-Man; Tang, Wai-Kiu; So, Man-Ting; Miao, Xiao-Ping; Leung,  Brian Man-Chun; Yip,  Benjamin Hon-Kei; Leon,  Thomas Yuk-Yu; Ngan, Elly Sau-Wai; Lui,  Vincent Chi-Hang; Chen, Yan; Chan,  Ivy Hau-Yee et al. “Fine Mapping of the NRG1 Hirschsprung's Disease Locus”. PLoS ONE 2011, 6(1): e16181.
  9. ^ Hurgobin, Bhavna; Edwards, David. “SNP Discovery Using a Pangenome: Has the Single Reference Approach Become Obsolete?”. Medical Scientists: Biology 2017, 6, 21.
  10. ^ Saw, Woei-Yuh; Tantoso, Erwin; Begum, Husna; Zhou, Lihan; Zou, Ruiyang; He, Cheng; Chan, Sze Ling; Tan, Linda Wei-Lin; Wong, Lai-Ping; Xu, Wenting; Moong, Don Kyin New; Lim, Yenly; Li, Bowen; Pillai, Nisha Esakimuthu; Peterson, Trevor A. et al. “Establishing Multiple Omics Baselines for Three Southeast Asian Populations in the Singapore Integrative Omics Study”. Nature Communications 2017, volume 8, Article number: 653.
  11. ^ Khvorykh, Gennady; Khrunin,  Andrey; Filippenkov, Ivan; Stavchansky, Vasily; Dergunova, Lyudmila; Limborska,   Svetlana. “A Workflow for Selection of Single Nucleotide Polymorphic Markers for Studying of Genetics of Ischemic Stroke Outcomes”. Genes 2021, 12, 328.