Knowledge

Truth discovery

Source đź“ť

212:
The examples below point out the main differences of the two methods. Knowing that in both examples the truth is provided by source 1, in the single truth case (first table) we can say that sources 2 and 3 oppose to the truth and as a result provide wrong values. On the other hand, in the second case
751:, and sources can provide multiple values for a single data item, it is not possible to consider values individually. An alternative is to consider mappings and relations between set of provided values and sources providing them. The trustworthiness of a source is then computed based on the 422:
Detecting copying behaviors is very important, in fact, copy allows to spread false values easily making truth discovery very hard, since many sources would vote for the wrong values. Usually systems decrease the weight of votes associated to copied values or even don’t count them at all.
465:. The vote assigned to a value is computed as the sum of the trustworthiness of the sources that provide that particular value, while the trustworthiness of a source is computed as the sum of the votes assigned to the values that the source provides. 619: 411:
is refined, improving the assessment of the true values that in turn leads to a better estimation of the trustworthiness of the sources. This process usually ends when all the values reach a convergence state.
1357: 445:
is the simplest method, the most popular value is selected as the true one. Majority voting is commonly used as a baseline when assessing the performances of more complex methods.
699: 144:
after hand-crafted labeling of the provided values; unfortunately, this is not feasible since the number of needed labeled examples should be proportional to the number of
668: 643: 1312:
Zhao, Bo; Rubinstein, Benjamin I. P.; Gemmell, Jim; Han, Jiawei (2012-02-01). "A Bayesian approach to discovering truth from conflicting sources for data integration".
213:(second table), sources 2 and 3 are neither correct nor erroneous, they instead provide a subset of the true values and at the same time they do not oppose the truth. 489:) between the set of values provided by the source and the set of values considered true (either selected in a probabilistic way or obtained from a ground truth). 103:. This, together with the fact that we are increasing our reliance on data to derive important decisions, motivates the need of developing good truth discovery 434:
Below are reported some of the characteristics of the most relevant typologies of single-truth methods and how different systems model source trustworthiness.
209:
Multi-truth discovery has unique features that make the problem more complex and should be taken into consideration when developing truth-discovery solutions.
389:
and, at the end, the value with the highest vote is select as the true one. In the more sophisticated methods, votes do not have the same weight for all the
60:(e.g birthday of a person, capital city of a country). While in the second case multiple true values are allowed (e.g. cast of a movie, authors of a book). 506: 56:
Truth discovery problems can be divided into two sub-classes: single-truth and multi-truth. In the first case only one true value is allowed for a
770:
to automatically define the set of true values of given data item and also to assess source quality without need of any supervision.
876:
Li, Yaliang; Gao, Jing; Meng, Chuishi; Li, Qi; Su, Lu; Zhao, Bo; Fan, Wei; Han, Jiawei (2016-02-25). "A Survey on Truth Discovery".
778:
Many real-world applications can benefit from the use of truth discovery algorithms. Typical domains of application include:
758:
More sophisticated methods also consider domain coverage and copying behaviors to better estimate source trustworthiness.
1079:; Lyons, Kenneth; Meng, Weiyi; Srivastava, Divesh (2012-12-01). "Truth finding on the deep web: is the problem solved?". 1277:
Xiaoxin Yin; Jiawei Han; Yu, P.S. (2008). "Truth Discovery with Multiple Conflicting Information Providers on the Web".
1151:; Berti-Equille, Laure; Srivastava, Divesh (2009-08-01). "Integrating conflicting data: the role of source dependence". 385:
The vast majority of truth discovery methods are based on a voting approach: each source votes for a value of a certain
431:
Most of the currently available truth discovery methods have been designed to work well only in the single-truth case.
1253: 955: 1130:
Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic
364: 346: 328: 306: 673: 743:
to define the probability of a group of values being true conditioned on the values provided by all the
453:
These methods estimate source trustworthiness exploiting a similar technique to the one used to measure
501:
to define the probability of a value being true conditioned on the values provided by all the sources.
1126:"On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes" 651: 626: 839: 986:
Lin, Xueling; Chen, Lei (2018). "Domain-aware Multi-truth Discovery from Conflicting Sources".
930:
Proceedings of the 24th ACM International on Conference on Information and Knowledge Management
822: 791: 752: 709: 416: 478: 716:
to detect copying behaviors and use these insights to better assess source trustworthiness.
1228: 8: 137: 1232: 924:
Wang, Xianzhi; Sheng, Quan Z.; Fang, Xiu Susie; Yao, Lina; Xu, Xiaofei; Li, Xue (2015).
1380: 1339: 1321: 1259: 1209: 1168: 1106: 1088: 961: 893: 740: 713: 498: 486: 474: 1037: 1020: 925: 1294: 1249: 1213: 1042: 951: 810: 482: 270: 254: 238: 1125: 965: 731:
Below are reported two typologies of multi-truth methods and their characteristics.
614:{\displaystyle P(v\mid \psi (o))={\frac {P(\psi (o)\mid v)\cdot P(v)}{P(\psi (o))}}} 1343: 1331: 1286: 1263: 1241: 1234:
Proceedings of the third ACM international conference on Web search and data mining
1199: 1172: 1160: 1110: 1098: 1032: 995: 941: 933: 897: 885: 834: 64: 767: 333: 311: 76: 68: 801:
Truth discovery algorithms could be also used to revolutionize the way in which
701:
is the set of the observed values provided by all the sources for that specific
795: 454: 1358:"The huge implications of Google's idea to rank sites based on their accuracy" 403:
but estimated with an iterative approach. At each step of the truth discovery
1374: 1335: 1298: 1290: 1164: 1148: 1102: 1076: 1046: 1016: 814: 787: 442: 369: 351: 315: 119: 111: 46: 1245: 1227:
Galland, Alban; Abiteboul, Serge; Marian, Amélie; Senellart, Pierre (2010).
999: 937: 889: 728:, less attention has been devoted to the study of the multi-truth discovery 45:
have been proposed to tackle this problem, ranging from simple methods like
850: 783: 419:
of provided values, copying values from other sources and domain coverage.
1204: 1187: 844: 779: 393:, more importance is indeed given to votes coming from trusted sources. 156:
Single-truth and multi-truth discovery are two very different problems.
946: 725: 926:"An Integrated Bayesian Approach for Effective Multi-Truth Discovery" 748: 702: 646: 462: 404: 386: 199: 192: 170:
different values provided for a given data item oppose to each other;
164: 159:
Single-truth discovery is characterized by the following properties:
148:, and in many applications the number of sources can be prohibitive. 129:
The solution to this problem is to assess the trustworthiness of the
123: 115: 104: 100: 80: 57: 42: 31: 118:. Nevertheless, recent studies, have shown that, if we rely only on 1093: 818: 802: 747:. In this case, since there could be multiple true values for each 744: 458: 408: 390: 174: 145: 141: 130: 96: 92: 72: 50: 35: 1326: 415:
Source trustworthiness can be based on different metrics, such as
806: 712:
of the values that provides. Other more complex methods exploit
99:
provide (partially or completely) different values for the same
1226: 140:
techniques could be exploited to assign a reliability score to
133:
and give more importance to votes coming from trusted sources.
708:
The trustworthiness of a source is then computed based on the
181:
While in the multi-truth case the following properties hold:
49:
to more complex ones able to estimate the trustworthiness of
1147: 16:
Process of choosing the actual true value for a data item
1311: 1240:. New York, New York, USA: ACM Press. pp. 131–140. 151: 1276: 761: 677: 655: 630: 932:. Melbourne, Australia: ACM Press. pp. 493–502. 676: 654: 629: 509: 1188:"Authoritative sources in a hyperlinked environment" 473:
These methods estimate source trustworthiness using
95:
makes more and more probable to find that different
1279:
IEEE Transactions on Knowledge and Data Engineering
1229:"Corroborating information from disagreeing views" 693: 662: 637: 613: 821:, to procedures that rank web pages based on the 63:Typically, truth discovery is the last step of a 1372: 923: 195:does not imply opposing to all the other values; 122:, we could get wrong results even in 30% of the 1074: 188:different values could provide a partial truth; 1015: 468: 481:. Source trustworthiness is computed as the 396:Source trustworthiness usually is not known 875: 110:Many currently available methods rely on a 285:Who wrote “The nature of space and time”? 1325: 1203: 1185: 1124:Ng, Andrew Y; Jordan, Michael I. (2001). 1092: 1036: 945: 380: 185:the truth is composed by a set of values; 1123: 163:only one true value is allowed for each 26:) is the process of choosing the actual 985: 426: 91:The abundance of data available on the 38:provide conflicting information on it. 1373: 813:, going from current methods based on 719: 79:and the records referring to the same 1025:Synthesis Lectures on Data Management 152:Single-truth vs multi-truth discovery 86: 1143: 1141: 1139: 1070: 1068: 1066: 1064: 1062: 1060: 1058: 1056: 1011: 1009: 981: 979: 977: 975: 919: 917: 915: 913: 911: 909: 907: 871: 869: 867: 865: 762:Probabilistic Graphical Models based 1019:; Srivastava, Divesh (2015-02-15). 734: 694:{\displaystyle \textstyle \psi (o)} 198:the number of true values for each 177:can either be correct or erroneous. 13: 878:ACM SIGKDD Explorations Newsletter 437: 407:the trustworthiness score of each 14: 1392: 1314:Proceedings of the VLDB Endowment 1153:Proceedings of the VLDB Endowment 1136: 1081:Proceedings of the VLDB Endowment 1053: 1038:10.2200/S00578ED1V01Y201404DTM040 1006: 972: 904: 862: 825:of the information they provide. 492: 448: 217:When was George Washington born? 1186:Kleinberg, Jon M. (1999-09-01). 1350: 1305: 1270: 773: 191:claiming one value for a given 1220: 1179: 1117: 768:probabilistic graphical models 687: 681: 605: 602: 596: 590: 582: 576: 567: 558: 552: 546: 534: 531: 525: 513: 114:to define the true value of a 1: 856: 755:of the values that provides. 663:{\displaystyle \textstyle o} 638:{\displaystyle \textstyle v} 365:The nature of space and time 347:The nature of space and time 329:The nature of space and time 307:The nature of space and time 7: 828: 469:Information-retrieval based 10: 1397: 645:is a value provided for a 1336:10.14778/2168651.2168656 1291:10.1109/TKDE.2007.190745 1165:10.14778/1687627.1687690 1103:10.14778/2535568.2448943 1246:10.1145/1718487.1718504 1000:10.1145/3187009.3177739 938:10.1145/2806416.2806443 890:10.1145/2897350.2897352 840:Information Integration 1362:www.washingtonpost.com 1021:"Big Data Integration" 792:information extraction 695: 664: 639: 615: 381:Source trustworthiness 1205:10.1145/324133.324140 696: 665: 640: 616: 479:information retrieval 784:crowd/social sensing 674: 652: 627: 507: 427:Single-truth methods 83:have been detected. 720:Multi-truth methods 487:similarity measures 475:similarity measures 286: 218: 138:supervised learning 67:pipeline, when the 1192:Journal of the ACM 847:(data integration) 766:These methods use 741:Bayesian inference 739:These methods use 714:Bayesian inference 691: 690: 660: 659: 635: 634: 611: 499:Bayesian inference 497:These methods use 477:typically used in 284: 216: 87:General principles 609: 483:cosine similarity 378: 377: 282: 281: 271:George Washington 255:George Washington 239:George Washington 1388: 1366: 1365: 1354: 1348: 1347: 1329: 1309: 1303: 1302: 1274: 1268: 1267: 1239: 1224: 1218: 1217: 1207: 1183: 1177: 1176: 1145: 1134: 1133: 1121: 1115: 1114: 1096: 1072: 1051: 1050: 1040: 1013: 1004: 1003: 983: 970: 969: 949: 921: 902: 901: 873: 835:Data Integration 700: 698: 697: 692: 669: 667: 666: 661: 644: 642: 641: 636: 620: 618: 617: 612: 610: 608: 585: 541: 287: 283: 219: 215: 65:data integration 1396: 1395: 1391: 1390: 1389: 1387: 1386: 1385: 1371: 1370: 1369: 1356: 1355: 1351: 1310: 1306: 1275: 1271: 1256: 1237: 1225: 1221: 1184: 1180: 1146: 1137: 1122: 1118: 1073: 1054: 1014: 1007: 984: 973: 958: 922: 905: 874: 863: 859: 831: 776: 764: 737: 722: 675: 672: 671: 653: 650: 649: 628: 625: 624: 586: 542: 540: 508: 505: 504: 495: 471: 451: 443:Majority voting 440: 438:Majority voting 429: 383: 334:Stephen Hawking 312:Stephen Hawking 154: 120:majority voting 112:voting strategy 89: 47:majority voting 34:when different 22:(also known as 20:Truth discovery 17: 12: 11: 5: 1394: 1384: 1383: 1368: 1367: 1349: 1320:(6): 550–561. 1304: 1285:(6): 796–808. 1269: 1254: 1219: 1198:(5): 604–632. 1178: 1159:(1): 550–561. 1149:Dong, Xin Luna 1135: 1116: 1077:Dong, Xin Luna 1052: 1017:Dong, Xin Luna 1005: 994:(5): 635–647. 988:VLDB Endowment 971: 956: 903: 860: 858: 855: 854: 853: 848: 842: 837: 830: 827: 811:search engines 798:construction. 796:knowledge base 775: 772: 763: 760: 736: 735:Bayesian based 733: 721: 718: 689: 686: 683: 680: 658: 633: 607: 604: 601: 598: 595: 592: 589: 584: 581: 578: 575: 572: 569: 566: 563: 560: 557: 554: 551: 548: 545: 539: 536: 533: 530: 527: 524: 521: 518: 515: 512: 494: 493:Bayesian based 491: 470: 467: 450: 449:Web-link based 447: 439: 436: 428: 425: 382: 379: 376: 375: 372: 367: 362: 358: 357: 356:Partial truth 354: 349: 344: 340: 339: 338:Partial truth 336: 331: 326: 322: 321: 318: 309: 304: 300: 299: 297: 294: 291: 280: 279: 276: 273: 268: 264: 263: 260: 257: 252: 248: 247: 244: 241: 236: 232: 231: 229: 226: 223: 207: 206: 196: 189: 186: 179: 178: 171: 168: 153: 150: 88: 85: 15: 9: 6: 4: 3: 2: 1393: 1382: 1379: 1378: 1376: 1363: 1359: 1353: 1345: 1341: 1337: 1333: 1328: 1323: 1319: 1315: 1308: 1300: 1296: 1292: 1288: 1284: 1280: 1273: 1265: 1261: 1257: 1255:9781605588896 1251: 1247: 1243: 1236: 1235: 1230: 1223: 1215: 1211: 1206: 1201: 1197: 1193: 1189: 1182: 1174: 1170: 1166: 1162: 1158: 1154: 1150: 1144: 1142: 1140: 1131: 1127: 1120: 1112: 1108: 1104: 1100: 1095: 1090: 1087:(2): 97–108. 1086: 1082: 1078: 1071: 1069: 1067: 1065: 1063: 1061: 1059: 1057: 1048: 1044: 1039: 1034: 1030: 1026: 1022: 1018: 1012: 1010: 1001: 997: 993: 989: 982: 980: 978: 976: 967: 963: 959: 957:9781450337946 953: 948: 943: 939: 935: 931: 927: 920: 918: 916: 914: 912: 910: 908: 899: 895: 891: 887: 883: 879: 872: 870: 868: 866: 861: 852: 849: 846: 843: 841: 838: 836: 833: 832: 826: 824: 820: 816: 815:link analysis 812: 808: 804: 799: 797: 793: 790:aggregation, 789: 788:crowdsourcing 785: 781: 771: 769: 759: 756: 754: 750: 746: 742: 732: 729: 727: 717: 715: 711: 706: 704: 684: 678: 656: 648: 631: 621: 599: 593: 587: 579: 573: 570: 564: 561: 555: 549: 543: 537: 528: 522: 519: 516: 510: 502: 500: 490: 488: 484: 480: 476: 466: 464: 460: 456: 446: 444: 435: 432: 424: 420: 418: 413: 410: 406: 402: 399: 394: 392: 388: 373: 371: 370:J. K. Rowling 368: 366: 363: 360: 359: 355: 353: 352:Roger Penrose 350: 348: 345: 342: 341: 337: 335: 332: 330: 327: 324: 323: 319: 317: 316:Roger Penrose 313: 310: 308: 305: 302: 301: 298: 295: 292: 289: 288: 277: 274: 272: 269: 266: 265: 261: 258: 256: 253: 250: 249: 245: 242: 240: 237: 234: 233: 230: 227: 224: 221: 220: 214: 210: 205: 202:is not known 201: 197: 194: 190: 187: 184: 183: 182: 176: 172: 169: 166: 162: 161: 160: 157: 149: 147: 143: 139: 134: 132: 127: 125: 121: 117: 113: 108: 106: 102: 98: 94: 84: 82: 78: 74: 71:of different 70: 66: 61: 59: 54: 52: 48: 44: 39: 37: 33: 29: 25: 24:truth finding 21: 1361: 1352: 1317: 1313: 1307: 1282: 1278: 1272: 1233: 1222: 1195: 1191: 1181: 1156: 1152: 1129: 1119: 1084: 1080: 1031:(1): 1–198. 1028: 1024: 991: 987: 929: 881: 877: 851:Data Quality 800: 777: 774:Applications 765: 757: 745:data sources 738: 730: 723: 707: 622: 503: 496: 472: 452: 441: 433: 430: 421: 414: 400: 397: 395: 391:data sources 384: 211: 208: 203: 180: 158: 155: 135: 128: 109: 90: 73:data sources 62: 55: 51:data sources 40: 36:data sources 27: 23: 19: 18: 947:2440/110033 884:(2): 1–16. 845:Data Fusion 724:Due to its 409:data source 275:1734-10-23 259:1738-09-17 243:1732-02-22 173:values and 1132:: 841–848. 1094:1503.00303 1075:Li, Xian; 857:References 780:healthcare 726:complexity 485:(or other 374:Erroneous 278:Erroneous 262:Erroneous 228:Birthdate 124:data items 105:algorithms 75:have been 43:algorithms 28:true value 1381:Databases 1327:1203.0058 1299:1041-4347 1214:221584113 1047:2153-5418 803:web pages 749:data item 703:data item 679:ψ 647:data item 594:ψ 571:⋅ 562:∣ 550:ψ 523:ψ 520:∣ 463:web links 461:based on 459:web pages 455:authority 405:algorithm 387:data item 204:a priori. 200:data item 193:data item 165:data item 136:Ideally, 116:data item 107:.   101:data item 81:data item 58:data item 32:data item 1375:Category 966:16207808 829:See also 823:accuracy 819:PageRank 753:accuracy 710:accuracy 417:accuracy 320:Correct 296:Authors 246:Correct 41:Several 1364:. 2015. 1344:8837716 1264:1761360 1173:9664056 1111:3133027 898:9060471 290:Source 222:Source 175:sources 146:sources 142:sources 131:sources 97:sources 77:unified 69:schemas 1342:  1297:  1262:  1252:  1212:  1171:  1109:  1045:  964:  954:  896:  807:ranked 623:where 401:priori 293:Title 30:for a 1340:S2CID 1322:arXiv 1260:S2CID 1238:(PDF) 1210:S2CID 1169:S2CID 1107:S2CID 1089:arXiv 962:S2CID 894:S2CID 817:like 225:Name 1295:ISSN 1250:ISBN 1043:ISSN 952:ISBN 805:are 794:and 670:and 1332:doi 1287:doi 1242:doi 1200:doi 1161:doi 1099:doi 1033:doi 996:doi 942:hdl 934:doi 886:doi 809:in 457:of 361:S4 343:S3 325:S2 303:S1 267:S3 251:S2 235:S1 93:web 1377:: 1360:. 1338:. 1330:. 1316:. 1293:. 1283:20 1281:. 1258:. 1248:. 1231:. 1208:. 1196:46 1194:. 1190:. 1167:. 1155:. 1138:^ 1128:. 1105:. 1097:. 1083:. 1055:^ 1041:. 1027:. 1023:. 1008:^ 992:11 990:. 974:^ 960:. 950:. 940:. 928:. 906:^ 892:. 882:17 880:. 864:^ 786:, 782:, 705:. 314:, 126:. 53:. 1346:. 1334:: 1324:: 1318:5 1301:. 1289:: 1266:. 1244:: 1216:. 1202:: 1175:. 1163:: 1157:2 1113:. 1101:: 1091:: 1085:6 1049:. 1035:: 1029:7 1002:. 998:: 968:. 944:: 936:: 900:. 888:: 688:) 685:o 682:( 657:o 632:v 606:) 603:) 600:o 597:( 591:( 588:P 583:) 580:v 577:( 574:P 568:) 565:v 559:) 556:o 553:( 547:( 544:P 538:= 535:) 532:) 529:o 526:( 517:v 514:( 511:P 398:a 167:;

Index

data item
data sources
algorithms
majority voting
data sources
data item
data integration
schemas
data sources
unified
data item
web
sources
data item
algorithms
voting strategy
data item
majority voting
data items
sources
supervised learning
sources
sources
data item
sources
data item
data item
George Washington
George Washington
George Washington

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

↑