Knowledge

Dasgupta's objective

Source 📝

598: 80:, with the elements as its vertices and with non-negative real weights on its edges. Large weights indicate elements that should be considered more similar to each other, while small weights or missing edges indicate pairs of elements that are not similar. A hierarchical clustering can be described as a tree (not necessarily a binary tree) whose leaves are the elements to be clustered; the clusters are then the subsets of elements descending from each tree node, and the size 384:, the problem of finding a partition that minimizes the ratio of the total weight of cut edges to the total number of cut pairs. Equivalently, for purposes of approximation, one may minimize the ratio of the total weight of cut edges to the number of elements on the smaller side of the cut. Using the best known approximation for the sparsest cut problem, the 32:, the optimal clustering for this quality measure follows the underlying structure of the ultrametric space. In this sense, clustering methods that produce good clusterings for this objective can be expected to approximate the 363: 425: 78: 238: 183: 108: 206: 151: 278: 258: 128: 28:
on the elements to be clustered. It is named after Sanjoy Dasgupta, who formulated it in 2016. Its key property is that, when the similarity comes from an
39:
In Dasgupta's formulation, the input to a clustering problem consists of similarity scores between certain pairs of elements, represented as an
639: 286: 372:
to find. However, it is possible to find a clustering that approximates the minimum value of the objective in
632: 658: 391: 663: 625: 377: 17: 613: 505:, Philadelphia, Pennsylvania: Society for Industrial and Applied Mathematics, pp. 378–397, 376:
by a divisive (top-down) clustering algorithm that repeatedly subdivides the elements using an
45: 503:
Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2018)
571: 528: 476: 381: 211: 156: 8: 449:
Dasgupta, Sanjoy (2016), "A cost function for similarity-based hierarchical clustering",
385: 83: 188: 133: 575: 550: 506: 480: 454: 263: 243: 113: 25: 451:
Proceedings of the 48th Annual ACM SIGACT Symposium on Theory of Computing (STOC 2016)
579: 29: 597: 559: 516: 484: 464: 40: 567: 524: 472: 373: 609: 520: 498: 652: 541: 563: 468: 33: 605: 548:(2009), "Expander flows, geometric embeddings and graph partitioning", 501:(2018), "Hierarchical clustering: objective functions and algorithms", 496: 240:
denote the smallest cluster of a given clustering that contains both
545: 511: 459: 369: 497:
Cohen-Addad, Vincent; Kanade, Varun; Mallmann-Trenn, Frederik;
24:
is a measure of the quality of a clustering, defined from a
280:. Then Dasgupta defines the cost of a clustering to be 394: 289: 266: 246: 214: 191: 159: 136: 116: 86: 48: 419: 358:{\displaystyle \sum _{uv\in E}w(uv)\cdot |C(uv)|.} 357: 272: 252: 232: 200: 177: 145: 122: 102: 72: 650: 540: 633: 453:, New York, New York: ACM, pp. 118–127, 368:The optimal clustering for this objective is 640: 626: 444: 442: 440: 510: 458: 130:is its number of elements. For each edge 36:underlying the given similarity measure. 448: 437: 651: 592: 490: 420:{\displaystyle O({\sqrt {\log n}})} 13: 534: 14: 675: 596: 414: 398: 348: 344: 335: 328: 321: 312: 227: 218: 172: 163: 96: 88: 67: 55: 1: 430: 612:. You can help Knowledge by 7: 10: 680: 591: 521:10.1137/1.9781611975031.26 185:denote the weight of edge 153:of the input graph, let 564:10.1145/1502793.1502794 469:10.1145/2897518.2897527 378:approximation algorithm 73:{\displaystyle G=(V,E)} 18:hierarchical clustering 608:-related article is a 421: 359: 274: 254: 234: 202: 179: 147: 124: 104: 74: 422: 360: 275: 255: 235: 233:{\displaystyle C(uv)} 203: 180: 178:{\displaystyle w(uv)} 148: 125: 105: 75: 392: 388:of this approach is 382:sparsest cut problem 287: 264: 244: 212: 189: 157: 134: 114: 84: 46: 22:Dasgupta's objective 659:Clustering criteria 386:approximation ratio 103:{\displaystyle |C|} 551:Journal of the ACM 417: 355: 308: 270: 250: 230: 201:{\displaystyle uv} 198: 175: 146:{\displaystyle uv} 143: 120: 100: 70: 26:similarity measure 621: 620: 558:(2): A5:1–A5:37, 412: 290: 273:{\displaystyle v} 253:{\displaystyle u} 123:{\displaystyle C} 30:ultrametric space 671: 664:Statistics stubs 642: 635: 628: 600: 593: 583: 582: 538: 532: 531: 514: 494: 488: 487: 462: 446: 426: 424: 423: 418: 413: 402: 364: 362: 361: 356: 351: 331: 307: 279: 277: 276: 271: 259: 257: 256: 251: 239: 237: 236: 231: 207: 205: 204: 199: 184: 182: 181: 176: 152: 150: 149: 144: 129: 127: 126: 121: 109: 107: 106: 101: 99: 91: 79: 77: 76: 71: 41:undirected graph 16:In the study of 679: 678: 674: 673: 672: 670: 669: 668: 649: 648: 647: 646: 589: 587: 586: 546:Vazirani, Umesh 544:; Rao, Satish; 539: 535: 499:Mathieu, Claire 495: 491: 447: 438: 433: 401: 393: 390: 389: 374:polynomial time 347: 327: 294: 288: 285: 284: 265: 262: 261: 245: 242: 241: 213: 210: 209: 190: 187: 186: 158: 155: 154: 135: 132: 131: 115: 112: 111: 110:of any cluster 95: 87: 85: 82: 81: 47: 44: 43: 12: 11: 5: 677: 667: 666: 661: 645: 644: 637: 630: 622: 619: 618: 601: 585: 584: 542:Arora, Sanjeev 533: 489: 435: 434: 432: 429: 416: 411: 408: 405: 400: 397: 366: 365: 354: 350: 346: 343: 340: 337: 334: 330: 326: 323: 320: 317: 314: 311: 306: 303: 300: 297: 293: 269: 249: 229: 226: 223: 220: 217: 197: 194: 174: 171: 168: 165: 162: 142: 139: 119: 98: 94: 90: 69: 66: 63: 60: 57: 54: 51: 9: 6: 4: 3: 2: 676: 665: 662: 660: 657: 656: 654: 643: 638: 636: 631: 629: 624: 623: 617: 615: 611: 607: 602: 599: 595: 594: 590: 581: 577: 573: 569: 565: 561: 557: 553: 552: 547: 543: 537: 530: 526: 522: 518: 513: 508: 504: 500: 493: 486: 482: 478: 474: 470: 466: 461: 456: 452: 445: 443: 441: 436: 428: 409: 406: 403: 395: 387: 383: 379: 375: 371: 352: 341: 338: 332: 324: 318: 315: 309: 304: 301: 298: 295: 291: 283: 282: 281: 267: 247: 224: 221: 215: 195: 192: 169: 166: 160: 140: 137: 117: 92: 64: 61: 58: 52: 49: 42: 37: 35: 31: 27: 23: 19: 614:expanding it 603: 588: 555: 549: 536: 502: 492: 450: 367: 38: 34:ground truth 21: 15: 653:Categories 606:statistics 512:1704.02147 460:1510.05043 431:References 580:263871111 407:⁡ 325:⋅ 302:∈ 292:∑ 380:for the 208:and let 572:2535878 529:3775814 485:2262168 477:3536559 370:NP-hard 578:  570:  527:  483:  475:  604:This 576:S2CID 507:arXiv 481:S2CID 455:arXiv 610:stub 260:and 560:doi 517:doi 465:doi 404:log 655:: 574:, 568:MR 566:, 556:56 554:, 525:MR 523:, 515:, 479:, 473:MR 471:, 463:, 439:^ 427:. 20:, 641:e 634:t 627:v 616:. 562:: 519:: 509:: 467:: 457:: 415:) 410:n 399:( 396:O 353:. 349:| 345:) 342:v 339:u 336:( 333:C 329:| 322:) 319:v 316:u 313:( 310:w 305:E 299:v 296:u 268:v 248:u 228:) 225:v 222:u 219:( 216:C 196:v 193:u 173:) 170:v 167:u 164:( 161:w 141:v 138:u 118:C 97:| 93:C 89:| 68:) 65:E 62:, 59:V 56:( 53:= 50:G

Index

hierarchical clustering
similarity measure
ultrametric space
ground truth
undirected graph
NP-hard
polynomial time
approximation algorithm
sparsest cut problem
approximation ratio



arXiv
1510.05043
doi
10.1145/2897518.2897527
MR
3536559
S2CID
2262168
Mathieu, Claire
arXiv
1704.02147
doi
10.1137/1.9781611975031.26
MR
3775814
Arora, Sanjeev
Vazirani, Umesh

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.