Additive smoothing

863:). But at least one possibility must have a non-zero pseudocount, otherwise no prediction could be computed before the first observation. The relative values of pseudocounts represent the relative prior expected probabilities of their possibilities. The sum of the pseudocounts, which may be very large, represents the estimated weight of the prior knowledge compared with all the actual observations (one for each) when determining the expected probability. 1532:

of natural language processing and information retrieval, the data consists of the number of occurrences of each word in a document. Additive smoothing allows the assignment of non-zero probabilities to words which do not occur in the sample. Recent studies have proven that additive smoothing is more

493:

Laplace came up with this smoothing technique when he tried to estimate the chance that the sun will rise tomorrow. His rationale was that even given a large sample of days with the rising sun, we still can not be completely sure that the sun will still rise tomorrow (known as the

849:

Depending on the prior knowledge, which is sometimes a subjective value, a pseudocount may have any non-negative finite value. It may only be zero (or the possibility ignored) if impossible by definition, such as the possibility of a decimal digit of

1427: 924:. Higher values are appropriate inasmuch as there is prior knowledge of the true values (for a mint-condition coin, say); lower values inasmuch as there is prior knowledge that there is probable bias, but of unknown degree (for a bent coin, say). 874:

and with small data sets, of a possible event not occurring. Its observed frequency is therefore zero, apparently implying a probability of zero. This oversimplification is inaccurate and often unhelpful, particularly in probability-based

275: 1250: 794: 858:

is run, or excluded and not counted because of no interest, such as if only interested in the zeros and ones. Generally, there is also a possibility that no value may be computable or observable in a finite time (see the

920:. However, given appropriate prior knowledge, the sum should be adjusted in proportion to the expectation that the prior probabilities should be considered correct, despite evidence to the contrary – see 119: 906:. This approach is equivalent to assuming a uniform prior distribution over the probabilities for each possible event (spanning the simplex where each probability is between 0 and 1, and they all sum to 1). 708: 339: 1040: 1155: 1481: 1313: 1097: 1305: 1501: 844: 560: 536: 402: 174: 824: 607: 436: 1278: 1069: 979: 647: 627: 580: 163: 139: 1179: 1775: 719: 1780: 1802: 854:

being a letter, or a physical possibility that would be rejected and so not counted, such as a computer printing a letter when a valid program for

48:, is a technique used to smooth count data, eliminating issues caused by certain values having 0 occurrences. Given a set of observation counts 51: 655: 944: 510:

is an amount (not generally an integer, despite its name) added to the number of observed cases in order to change the expected

1720:

Agresti, Alan; Coull, Brent A. (1998). "Approximate is better than 'exact' for interval estimation of binomial proportions".

1602: 1176:

Often the bias of an unknown trial population is tested against a control population with known parameters (incidence rates)

887:. By artificially adjusting the probability of rare (but not impossible) events so those probabilities are not exactly zero, 283: 910: 1555: 989: 902:

to each observed number of events, including the zero-count possibilities. This is sometimes called Laplace's

1107: 405: 518:

of those data, when not known to be zero. It is so named because, roughly speaking, a pseudo-count of value

1533:

effective than other probability smoothing methods in several retrieval tasks such as language-model-based

939:

One way to motivate pseudocounts, particularly for binomial data, is via a formula for the midpoint of an

1812: 916:

Pseudocounts should be set to one only when there is no prior knowledge at all – see the

1422:{\displaystyle {\hat {\theta }}_{i}={\frac {x_{i}+\mu _{i}\alpha d}{N+\alpha d}}\qquad (i=1,\ldots ,d).} 921: 1435: 1161: 1101:) yields pseudocount of 2 for each outcome, so 4 in total, colloquially known as the "plus four rule": 871: 1534: 1807: 917: 888: 880: 1560: 1076: 142: 1651:

ICTIR '15 Proceedings of the 2015 International Conference on the Theory of Information Retrieval

1517: 539: 466: 462: 270:{\displaystyle {\hat {\theta }}_{i}={\frac {x_{i}+\alpha }{N+\alpha d}}\qquad (i=1,\ldots ,d),} 1283: 1486: 1432:

As a consistency check, if the empirical estimator happens to equal the incidence rate, i.e.

956: 829: 545: 521: 482: 372: 362: 43: 1749: 1688: 948: 802: 585: 36: 1647:"Axiomatic Analysis of Smoothing Methods in Language Models for Pseudo-Relevance Feedback" 410: 8: 1255: 1048: 884: 867: 477:. In the special case where the number of categories is 2, this is equivalent to using a 358: 20: 1245:{\displaystyle {\boldsymbol {\mu }}=\langle \mu _{1},\mu _{2},\ldots ,\mu _{d}\rangle .} 1737: 1708: 1538: 1529: 964: 928: 903: 892: 632: 612: 565: 474: 454: 439: 366: 148: 124: 1632: 1598: 940: 478: 789:{\displaystyle p_{i,\alpha {\text{-smoothed}}}={\frac {x_{i}+\alpha }{N+\alpha d}},} 1770:

Proceedings of the 34th annual meeting on Association for Computational Linguistics

1729: 1704: 1700: 1550: 876: 515: 1666:"Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems" 1745: 860: 495: 1691:(1927). "Probable inference, the law of succession, and statistical inference". 1765: 458: 1787: 913:

approach, a pseudocount of one half should be added to each possible outcome.

1796: 353: = 0 corresponding to no smoothing (this parameter is explained in 1788:

A video explaining the use of Additive smoothing in a Naïve Bayes classifier

1665: 1646: 1670:

CERI '16 Proceedings of the 4th Spanish Conference on Information Retrieval

1280:

should be replaced by the known incidence rate of the control population

511: 1741: 1712: 450:

is also used), though in practice a smaller value is typically chosen.

27: 114:{\displaystyle \mathbf {x} =\langle x_{1},x_{2},\ldots ,x_{d}\rangle } 346: 166: 1733: 1171: 1766:

An empirical study of smoothing techniques for language modeling

1071:

standard deviations to approximate a 95% confidence interval (

703:{\displaystyle p_{i,{\text{empirical}}}={\frac {x_{i}}{N}},} 870:

there is the possibility, especially with low-probability

713:

but the posterior probability when additively smoothed is

931:

of the events from other factors and adjust accordingly.

542:

similarly to each category having an additional count of

1664:

Valcarce, Daniel; Parapar, Javier; Barreiro, Álvaro.

1489: 1438: 1316: 1286: 1258: 1182: 1110: 1079: 1051: 992: 967: 832: 805: 722: 658: 635: 615: 588: 568: 548: 524: 413: 375: 286: 177: 165:

trials, a "smoothed" version of the counts gives the

151: 127: 54: 1622:(2nd ed.). Pearson Education, Inc. p. 863. 334:{\displaystyle {\hat {x}}_{i}=N{\hat {\theta }}_{i}} 16:

Statistical technique for smoothing categorical data

1781:

Bayesian interpretation of pseudocount regularizers

1663: 1579:C. D. Manning, P. Raghavan and H. Schütze (2008). 1495: 1475: 1421: 1299: 1272: 1244: 1149: 1091: 1063: 1034: 973: 838: 818: 788: 702: 641: 621: 601: 574: 554: 530: 430: 396: 333: 269: 157: 133: 113: 1523: 481:as the conjugate prior for the parameters of the 1794: 1593:Jurafsky, Daniel; Martin, James H. (June 2008). 1172:Generalized to the case of known incidence rates 361:, as the resulting estimate will be between the 1693:Journal of the American Statistical Association 1592: 1516:Additive smoothing is commonly a component of 1617: 1719: 1597:(2nd ed.). Prentice Hall. p. 132. 1236: 1191: 1165: 629:samples, the empirical probability of event 108: 63: 1620:Artificial Intelligence: A Modern Approach 1583:. Cambridge University Press, p. 260. 1483:the smoothed estimator is independent of 1035:{\displaystyle {\frac {n_{S}+z}{n+2z}}.} 357:below). Additive smoothing is a type of 19:For the image processing technique, see 1803:Statistical natural language processing 1644: 1633:Lecture 5 | Machine Learning (Stanford) 1618:Russell, Stuart; Norvig, Peter (2010). 1184: 1150:{\displaystyle {\frac {n_{S}+2}{n+4}}.} 945:binomial proportion confidence interval 457:point of view, this corresponds to the 1795: 1687: 983:standard deviations on either side is 952: 1581:Introduction to Information Retrieval 1307:to calculate the smoothed estimator: 1252:In this case the uniform probability 1645:Hazimeh, Hussein; Zhai, ChengXiang. 1503:and also equals the incidence rate. 446:should be 1 (in which case the term 1626: 13: 14: 1824: 1758: 1511: 1476:{\displaystyle \mu _{i}=x_{i}/N,} 1160:This is also the midpoint of the 345: > 0 is a smoothing 898:The simplest approach is to add 562:. If the frequency of each item 442:, some authors have argued that 354: 56: 1506: 1388: 236: 1705:10.1080/01621459.1927.10502953 1657: 1638: 1611: 1595:Speech and Language Processing 1586: 1573: 1556:Prediction by partial matching 1524:Statistical language modelling 1413: 1389: 1324: 927:A more complex approach is to 501: 319: 294: 261: 237: 185: 1: 1566: 1092:{\displaystyle z\approx 1.96} 799:as if to increase each count 1764:SF Chen, J Goodman (1996). " 866:In any observed data set or 7: 1544: 947:. The best-known is due to 934: 10: 1829: 1680: 881:artificial neural networks 488: 18: 1722:The American Statistician 1635:at 1h10m into the lecture 1535:pseudo-relevance feedback 918:principle of indifference 280:where the smoothed count 1561:Categorical distribution 1300:{\displaystyle \mu _{i}} 1166:Agresti & Coull 1998 929:estimate the probability 341:, and the "pseudocount" 143:multinomial distribution 1518:naive Bayes classifiers 1496:{\displaystyle \alpha } 889:zero-frequency problems 839:{\displaystyle \alpha } 555:{\displaystyle \alpha } 531:{\displaystyle \alpha } 397:{\displaystyle x_{i}/N} 1497: 1477: 1423: 1301: 1274: 1246: 1162:Agresti–Coull interval 1151: 1093: 1065: 1036: 975: 955:: the midpoint of the 891:are avoided. Also see 840: 820: 790: 704: 643: 623: 603: 576: 556: 540:posterior distribution 532: 467:Dirichlet distribution 463:posterior distribution 432: 398: 335: 271: 159: 135: 115: 1498: 1478: 1424: 1302: 1275: 1247: 1152: 1094: 1066: 1037: 976: 957:Wilson score interval 841: 821: 819:{\displaystyle x_{i}} 791: 705: 644: 624: 604: 602:{\displaystyle x_{i}} 577: 557: 533: 483:binomial distribution 433: 399: 363:empirical probability 336: 272: 160: 136: 116: 1487: 1436: 1314: 1284: 1256: 1180: 1108: 1077: 1049: 990: 965: 949:Edwin Bidwell Wilson 885:hidden Markov models 830: 803: 720: 656: 633: 613: 586: 566: 546: 522: 465:, using a symmetric 431:{\displaystyle 1/d.} 411: 373: 284: 175: 149: 125: 52: 1539:recommender systems 1273:{\displaystyle 1/d} 1064:{\displaystyle z=2} 879:techniques such as 438:Invoking Laplace's 406:uniform probability 359:shrinkage estimator 21:Laplacian smoothing 1813:Probability theory 1530:bag of words model 1493: 1473: 1419: 1297: 1270: 1242: 1147: 1089: 1061: 1032: 971: 904:rule of succession 836: 816: 786: 700: 639: 619: 599: 572: 552: 528: 475:prior distribution 440:rule of succession 428: 394: 367:relative frequency 355:§ Pseudocount 331: 267: 155: 131: 111: 32:additive smoothing 1604:978-0-13-187321-6 1386: 1327: 1142: 1027: 974:{\displaystyle z} 959:corresponding to 943:, particularly a 941:interval estimate 781: 740: 695: 673: 642:{\displaystyle i} 622:{\displaystyle N} 575:{\displaystyle i} 479:beta distribution 448:add-one smoothing 322: 297: 234: 188: 158:{\displaystyle N} 134:{\displaystyle d} 1820: 1808:Categorical data 1753: 1716: 1699:(158): 209–212. 1674: 1673: 1661: 1655: 1654: 1642: 1636: 1630: 1624: 1623: 1615: 1609: 1608: 1590: 1584: 1577: 1551:Bayesian average 1502: 1500: 1499: 1494: 1482: 1480: 1479: 1474: 1466: 1461: 1460: 1448: 1447: 1428: 1426: 1425: 1420: 1387: 1385: 1371: 1364: 1363: 1351: 1350: 1340: 1335: 1334: 1329: 1328: 1320: 1306: 1304: 1303: 1298: 1296: 1295: 1279: 1277: 1276: 1271: 1266: 1251: 1249: 1248: 1243: 1235: 1234: 1216: 1215: 1203: 1202: 1187: 1156: 1154: 1153: 1148: 1143: 1141: 1130: 1123: 1122: 1112: 1100: 1098: 1096: 1095: 1090: 1070: 1068: 1067: 1062: 1041: 1039: 1038: 1033: 1028: 1026: 1012: 1005: 1004: 994: 982: 980: 978: 977: 972: 922:further analysis 877:machine learning 857: 853: 845: 843: 842: 837: 825: 823: 822: 817: 815: 814: 795: 793: 792: 787: 782: 780: 766: 759: 758: 748: 743: 742: 741: 738: 709: 707: 706: 701: 696: 691: 690: 681: 676: 675: 674: 671: 648: 646: 645: 640: 628: 626: 625: 620: 608: 606: 605: 600: 598: 597: 581: 579: 578: 573: 561: 559: 558: 553: 538:weighs into the 537: 535: 534: 529: 437: 435: 434: 429: 421: 403: 401: 400: 395: 390: 385: 384: 340: 338: 337: 332: 330: 329: 324: 323: 315: 305: 304: 299: 298: 290: 276: 274: 273: 268: 235: 233: 219: 212: 211: 201: 196: 195: 190: 189: 181: 164: 162: 161: 156: 140: 138: 137: 132: 120: 118: 117: 112: 107: 106: 88: 87: 75: 74: 59: 1828: 1827: 1823: 1822: 1821: 1819: 1818: 1817: 1793: 1792: 1761: 1756: 1734:10.2307/2685469 1683: 1678: 1677: 1662: 1658: 1643: 1639: 1631: 1627: 1616: 1612: 1605: 1591: 1587: 1578: 1574: 1569: 1547: 1526: 1514: 1509: 1488: 1485: 1484: 1462: 1456: 1452: 1443: 1439: 1437: 1434: 1433: 1372: 1359: 1355: 1346: 1342: 1341: 1339: 1330: 1319: 1318: 1317: 1315: 1312: 1311: 1291: 1287: 1285: 1282: 1281: 1262: 1257: 1254: 1253: 1230: 1226: 1211: 1207: 1198: 1194: 1183: 1181: 1178: 1177: 1174: 1131: 1118: 1114: 1113: 1111: 1109: 1106: 1105: 1078: 1075: 1074: 1072: 1050: 1047: 1046: 1013: 1000: 996: 995: 993: 991: 988: 987: 966: 963: 962: 960: 937: 893:Cromwell's rule 861:halting problem 855: 851: 831: 828: 827: 810: 806: 804: 801: 800: 767: 754: 750: 749: 747: 737: 727: 723: 721: 718: 717: 686: 682: 680: 670: 663: 659: 657: 654: 653: 634: 631: 630: 614: 611: 610: 593: 589: 587: 584: 583: 567: 564: 563: 547: 544: 543: 523: 520: 519: 504: 496:sunrise problem 491: 469:with parameter 417: 412: 409: 408: 386: 380: 376: 374: 371: 370: 325: 314: 313: 312: 300: 289: 288: 287: 285: 282: 281: 220: 207: 203: 202: 200: 191: 180: 179: 178: 176: 173: 172: 150: 147: 146: 126: 123: 122: 102: 98: 83: 79: 70: 66: 55: 53: 50: 49: 24: 17: 12: 11: 5: 1826: 1816: 1815: 1810: 1805: 1791: 1790: 1785: 1784: 1783: 1773: 1760: 1759:External links 1757: 1755: 1754: 1728:(2): 119–126. 1717: 1684: 1682: 1679: 1676: 1675: 1656: 1637: 1625: 1610: 1603: 1585: 1571: 1570: 1568: 1565: 1564: 1563: 1558: 1553: 1546: 1543: 1525: 1522: 1513: 1512:Classification 1510: 1508: 1505: 1492: 1472: 1469: 1465: 1459: 1455: 1451: 1446: 1442: 1430: 1429: 1418: 1415: 1412: 1409: 1406: 1403: 1400: 1397: 1394: 1391: 1384: 1381: 1378: 1375: 1370: 1367: 1362: 1358: 1354: 1349: 1345: 1338: 1333: 1326: 1323: 1294: 1290: 1269: 1265: 1261: 1241: 1238: 1233: 1229: 1225: 1222: 1219: 1214: 1210: 1206: 1201: 1197: 1193: 1190: 1186: 1173: 1170: 1158: 1157: 1146: 1140: 1137: 1134: 1129: 1126: 1121: 1117: 1088: 1085: 1082: 1060: 1057: 1054: 1043: 1042: 1031: 1025: 1022: 1019: 1016: 1011: 1008: 1003: 999: 970: 936: 933: 911:Jeffreys prior 835: 813: 809: 797: 796: 785: 779: 776: 773: 770: 765: 762: 757: 753: 746: 736: 733: 730: 726: 711: 710: 699: 694: 689: 685: 679: 669: 666: 662: 638: 618: 596: 592: 571: 551: 527: 503: 500: 490: 487: 459:expected value 427: 424: 420: 416: 393: 389: 383: 379: 328: 321: 318: 311: 308: 303: 296: 293: 278: 277: 266: 263: 260: 257: 254: 251: 248: 245: 242: 239: 232: 229: 226: 223: 218: 215: 210: 206: 199: 194: 187: 184: 154: 130: 110: 105: 101: 97: 94: 91: 86: 82: 78: 73: 69: 65: 62: 58: 34:, also called 15: 9: 6: 4: 3: 2: 1825: 1814: 1811: 1809: 1806: 1804: 1801: 1800: 1798: 1789: 1786: 1782: 1779: 1778: 1777: 1774: 1771: 1767: 1763: 1762: 1751: 1747: 1743: 1739: 1735: 1731: 1727: 1723: 1718: 1714: 1710: 1706: 1702: 1698: 1694: 1690: 1689:Wilson, E. B. 1686: 1685: 1671: 1667: 1660: 1652: 1648: 1641: 1634: 1629: 1621: 1614: 1606: 1600: 1596: 1589: 1582: 1576: 1572: 1562: 1559: 1557: 1554: 1552: 1549: 1548: 1542: 1540: 1536: 1531: 1521: 1519: 1504: 1490: 1470: 1467: 1463: 1457: 1453: 1449: 1444: 1440: 1416: 1410: 1407: 1404: 1401: 1398: 1395: 1392: 1382: 1379: 1376: 1373: 1368: 1365: 1360: 1356: 1352: 1347: 1343: 1336: 1331: 1321: 1310: 1309: 1308: 1292: 1288: 1267: 1263: 1259: 1239: 1231: 1227: 1223: 1220: 1217: 1212: 1208: 1204: 1199: 1195: 1188: 1169: 1167: 1163: 1144: 1138: 1135: 1132: 1127: 1124: 1119: 1115: 1104: 1103: 1102: 1086: 1083: 1080: 1058: 1055: 1052: 1029: 1023: 1020: 1017: 1014: 1009: 1006: 1001: 997: 986: 985: 984: 968: 958: 954: 953:Wilson (1927) 950: 946: 942: 932: 930: 925: 923: 919: 914: 912: 907: 905: 901: 896: 894: 890: 886: 882: 878: 873: 869: 864: 862: 847: 833: 811: 807: 783: 777: 774: 771: 768: 763: 760: 755: 751: 744: 734: 731: 728: 724: 716: 715: 714: 697: 692: 687: 683: 677: 667: 664: 660: 652: 651: 650: 636: 616: 594: 590: 569: 549: 541: 525: 517: 513: 509: 499: 497: 486: 484: 480: 476: 472: 468: 464: 460: 456: 451: 449: 445: 441: 425: 422: 418: 414: 407: 391: 387: 381: 377: 368: 364: 360: 356: 352: 348: 344: 326: 316: 309: 306: 301: 291: 264: 258: 255: 252: 249: 246: 243: 240: 230: 227: 224: 221: 216: 213: 208: 204: 197: 192: 182: 171: 170: 169: 168: 152: 144: 141:-dimensional 128: 103: 99: 95: 92: 89: 84: 80: 76: 71: 67: 60: 47: 45: 40: 38: 33: 29: 22: 1776:Pseudocounts 1769: 1725: 1721: 1696: 1692: 1669: 1659: 1650: 1640: 1628: 1619: 1613: 1594: 1588: 1580: 1575: 1527: 1515: 1507:Applications 1431: 1175: 1159: 1044: 938: 926: 915: 908: 899: 897: 865: 848: 798: 712: 507: 505: 492: 470: 452: 447: 443: 350: 342: 279: 42: 35: 31: 25: 512:probability 508:pseudocount 502:Pseudocount 1797:Categories 1567:References 909:Using the 846:a priori. 28:statistics 1491:α 1441:μ 1405:… 1380:α 1366:α 1357:μ 1325:^ 1322:θ 1289:μ 1237:⟩ 1228:μ 1221:… 1209:μ 1196:μ 1192:⟨ 1185:μ 1084:≈ 834:α 775:α 764:α 739:-smoothed 735:α 672:empirical 550:α 526:α 347:parameter 320:^ 317:θ 295:^ 253:… 228:α 217:α 186:^ 183:θ 167:estimator 109:⟩ 93:… 64:⟨ 46:smoothing 39:smoothing 1545:See also 935:Examples 455:Bayesian 404:and the 44:Lidstone 1750:1628435 1742:2685469 1713:2276774 1681:Sources 1099:⁠ 1073:⁠ 1045:Taking 981:⁠ 961:⁠ 609:out of 489:History 461:of the 453:From a 349:, with 121:from a 37:Laplace 1748: 1740: 1711: 1601: 872:events 868:sample 1738:JSTOR 1709:JSTOR 1528:In a 951:, in 516:model 514:in a 473:as a 145:with 1599:ISBN 1537:and 1087:1.96 883:and 649:is 1768:". 1730:doi 1701:doi 1168:). 900:one 826:by 582:is 498:). 41:or 26:In 1799:: 1746:MR 1744:. 1736:. 1726:52 1724:. 1707:. 1697:22 1695:. 1668:. 1649:. 1541:. 1520:. 895:. 506:A 485:. 369:) 30:, 1772:. 1752:. 1732:: 1715:. 1703:: 1672:. 1653:. 1607:. 1471:, 1468:N 1464:/ 1458:i 1454:x 1450:= 1445:i 1417:. 1414:) 1411:d 1408:, 1402:, 1399:1 1396:= 1393:i 1390:( 1383:d 1377:+ 1374:N 1369:d 1361:i 1353:+ 1348:i 1344:x 1337:= 1332:i 1293:i 1268:d 1264:/ 1260:1 1240:. 1232:d 1224:, 1218:, 1213:2 1205:, 1200:1 1189:= 1164:( 1145:. 1139:4 1136:+ 1133:n 1128:2 1125:+ 1120:S 1116:n 1081:z 1059:2 1056:= 1053:z 1030:. 1024:z 1021:2 1018:+ 1015:n 1010:z 1007:+ 1002:S 998:n 969:z 856:π 852:π 812:i 808:x 784:, 778:d 772:+ 769:N 761:+ 756:i 752:x 745:= 732:, 729:i 725:p 698:, 693:N 688:i 684:x 678:= 668:, 665:i 661:p 637:i 617:N 595:i 591:x 570:i 471:α 444:α 426:. 423:d 419:/ 415:1 392:N 388:/ 382:i 378:x 365:( 351:α 343:α 327:i 310:N 307:= 302:i 292:x 265:, 262:) 259:d 256:, 250:, 247:1 244:= 241:i 238:( 231:d 225:+ 222:N 214:+ 209:i 205:x 198:= 193:i 153:N 129:d 104:d 100:x 96:, 90:, 85:2 81:x 77:, 72:1 68:x 61:= 57:x 23:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Knowledge

Additive smoothing

Index