863:). But at least one possibility must have a non-zero pseudocount, otherwise no prediction could be computed before the first observation. The relative values of pseudocounts represent the relative prior expected probabilities of their possibilities. The sum of the pseudocounts, which may be very large, represents the estimated weight of the prior knowledge compared with all the actual observations (one for each) when determining the expected probability.
1532:
of natural language processing and information retrieval, the data consists of the number of occurrences of each word in a document. Additive smoothing allows the assignment of non-zero probabilities to words which do not occur in the sample. Recent studies have proven that additive smoothing is more
493:
Laplace came up with this smoothing technique when he tried to estimate the chance that the sun will rise tomorrow. His rationale was that even given a large sample of days with the rising sun, we still can not be completely sure that the sun will still rise tomorrow (known as the
849:
Depending on the prior knowledge, which is sometimes a subjective value, a pseudocount may have any non-negative finite value. It may only be zero (or the possibility ignored) if impossible by definition, such as the possibility of a decimal digit of
1427:
924:. Higher values are appropriate inasmuch as there is prior knowledge of the true values (for a mint-condition coin, say); lower values inasmuch as there is prior knowledge that there is probable bias, but of unknown degree (for a bent coin, say).
874:
and with small data sets, of a possible event not occurring. Its observed frequency is therefore zero, apparently implying a probability of zero. This oversimplification is inaccurate and often unhelpful, particularly in probability-based
275:
1250:
794:
858:
is run, or excluded and not counted because of no interest, such as if only interested in the zeros and ones. Generally, there is also a possibility that no value may be computable or observable in a finite time (see the
920:. However, given appropriate prior knowledge, the sum should be adjusted in proportion to the expectation that the prior probabilities should be considered correct, despite evidence to the contrary – see
119:
906:. This approach is equivalent to assuming a uniform prior distribution over the probabilities for each possible event (spanning the simplex where each probability is between 0 and 1, and they all sum to 1).
708:
339:
1040:
1155:
1481:
1313:
1097:
1305:
1501:
844:
560:
536:
402:
174:
824:
607:
436:
1278:
1069:
979:
647:
627:
580:
163:
139:
1179:
1775:
719:
1780:
1802:
854:
being a letter, or a physical possibility that would be rejected and so not counted, such as a computer printing a letter when a valid program for
48:, is a technique used to smooth count data, eliminating issues caused by certain values having 0 occurrences. Given a set of observation counts
51:
655:
944:
510:
is an amount (not generally an integer, despite its name) added to the number of observed cases in order to change the expected
1720:
Agresti, Alan; Coull, Brent A. (1998). "Approximate is better than 'exact' for interval estimation of binomial proportions".
1602:
1176:
Often the bias of an unknown trial population is tested against a control population with known parameters (incidence rates)
887:. By artificially adjusting the probability of rare (but not impossible) events so those probabilities are not exactly zero,
283:
910:
1555:
989:
902:
to each observed number of events, including the zero-count possibilities. This is sometimes called
Laplace's
1107:
405:
518:
of those data, when not known to be zero. It is so named because, roughly speaking, a pseudo-count of value
1533:
effective than other probability smoothing methods in several retrieval tasks such as language-model-based
939:
One way to motivate pseudocounts, particularly for binomial data, is via a formula for the midpoint of an
1812:
916:
Pseudocounts should be set to one only when there is no prior knowledge at all – see the
1422:{\displaystyle {\hat {\theta }}_{i}={\frac {x_{i}+\mu _{i}\alpha d}{N+\alpha d}}\qquad (i=1,\ldots ,d).}
921:
1435:
1161:
1101:) yields pseudocount of 2 for each outcome, so 4 in total, colloquially known as the "plus four rule":
871:
1534:
1807:
917:
888:
880:
1560:
1076:
142:
1651:
ICTIR '15 Proceedings of the 2015 International
Conference on the Theory of Information Retrieval
1517:
539:
466:
462:
270:{\displaystyle {\hat {\theta }}_{i}={\frac {x_{i}+\alpha }{N+\alpha d}}\qquad (i=1,\ldots ,d),}
1283:
1486:
1432:
As a consistency check, if the empirical estimator happens to equal the incidence rate, i.e.
956:
829:
545:
521:
482:
372:
362:
43:
1749:
1688:
948:
802:
585:
36:
1647:"Axiomatic Analysis of Smoothing Methods in Language Models for Pseudo-Relevance Feedback"
410:
8:
1255:
1048:
884:
867:
477:. In the special case where the number of categories is 2, this is equivalent to using a
358:
20:
1245:{\displaystyle {\boldsymbol {\mu }}=\langle \mu _{1},\mu _{2},\ldots ,\mu _{d}\rangle .}
1737:
1708:
1538:
1529:
964:
928:
903:
892:
632:
612:
565:
474:
454:
439:
366:
148:
124:
1632:
1598:
940:
478:
789:{\displaystyle p_{i,\alpha {\text{-smoothed}}}={\frac {x_{i}+\alpha }{N+\alpha d}},}
1770:
Proceedings of the 34th annual meeting on
Association for Computational Linguistics
1729:
1704:
1700:
1550:
876:
515:
1666:"Additive Smoothing for Relevance-Based Language Modelling of Recommender Systems"
1745:
860:
495:
1691:(1927). "Probable inference, the law of succession, and statistical inference".
1765:
458:
1787:
913:
approach, a pseudocount of one half should be added to each possible outcome.
1796:
353: = 0 corresponding to no smoothing (this parameter is explained in
1788:
A video explaining the use of
Additive smoothing in a NaĆÆve Bayes classifier
1665:
1646:
1670:
CERI '16 Proceedings of the 4th
Spanish Conference on Information Retrieval
1280:
should be replaced by the known incidence rate of the control population
511:
1741:
1712:
450:
is also used), though in practice a smaller value is typically chosen.
27:
114:{\displaystyle \mathbf {x} =\langle x_{1},x_{2},\ldots ,x_{d}\rangle }
346:
166:
1733:
1171:
1766:
An empirical study of smoothing techniques for language modeling
1071:
standard deviations to approximate a 95% confidence interval (
703:{\displaystyle p_{i,{\text{empirical}}}={\frac {x_{i}}{N}},}
870:
there is the possibility, especially with low-probability
713:
but the posterior probability when additively smoothed is
931:
of the events from other factors and adjust accordingly.
542:
similarly to each category having an additional count of
1664:
Valcarce, Daniel; Parapar, Javier; Barreiro, Ćlvaro.
1489:
1438:
1316:
1286:
1258:
1182:
1110:
1079:
1051:
992:
967:
832:
805:
722:
658:
635:
615:
588:
568:
548:
524:
413:
375:
286:
177:
165:
trials, a "smoothed" version of the counts gives the
151:
127:
54:
1622:(2nd ed.). Pearson Education, Inc. p. 863.
334:{\displaystyle {\hat {x}}_{i}=N{\hat {\theta }}_{i}}
16:
Statistical technique for smoothing categorical data
1781:
Bayesian interpretation of pseudocount regularizers
1663:
1579:C. D. Manning, P. Raghavan and H. SchĆ¼tze (2008).
1495:
1475:
1421:
1299:
1272:
1244:
1149:
1091:
1063:
1034:
973:
838:
818:
788:
702:
641:
621:
601:
574:
554:
530:
430:
396:
333:
269:
157:
133:
113:
1523:
481:as the conjugate prior for the parameters of the
1794:
1593:Jurafsky, Daniel; Martin, James H. (June 2008).
1172:Generalized to the case of known incidence rates
361:, as the resulting estimate will be between the
1693:Journal of the American Statistical Association
1592:
1516:Additive smoothing is commonly a component of
1617:
1719:
1597:(2nd ed.). Prentice Hall. p. 132.
1236:
1191:
1165:
629:samples, the empirical probability of event
108:
63:
1620:Artificial Intelligence: A Modern Approach
1583:. Cambridge University Press, p. 260.
1483:the smoothed estimator is independent of
1035:{\displaystyle {\frac {n_{S}+z}{n+2z}}.}
357:below). Additive smoothing is a type of
19:For the image processing technique, see
1803:Statistical natural language processing
1644:
1633:Lecture 5 | Machine Learning (Stanford)
1618:Russell, Stuart; Norvig, Peter (2010).
1184:
1150:{\displaystyle {\frac {n_{S}+2}{n+4}}.}
945:binomial proportion confidence interval
457:point of view, this corresponds to the
1795:
1687:
983:standard deviations on either side is
952:
1581:Introduction to Information Retrieval
1307:to calculate the smoothed estimator:
1252:In this case the uniform probability
1645:Hazimeh, Hussein; Zhai, ChengXiang.
1503:and also equals the incidence rate.
446:should be 1 (in which case the term
1626:
13:
14:
1824:
1758:
1511:
1476:{\displaystyle \mu _{i}=x_{i}/N,}
1160:This is also the midpoint of the
345: > 0 is a smoothing
898:The simplest approach is to add
562:. If the frequency of each item
442:, some authors have argued that
354:
56:
1506:
1388:
236:
1705:10.1080/01621459.1927.10502953
1657:
1638:
1611:
1595:Speech and Language Processing
1586:
1573:
1556:Prediction by partial matching
1524:Statistical language modelling
1413:
1389:
1324:
927:A more complex approach is to
501:
319:
294:
261:
237:
185:
1:
1566:
1092:{\displaystyle z\approx 1.96}
799:as if to increase each count
1764:SF Chen, J Goodman (1996). "
866:In any observed data set or
7:
1544:
947:. The best-known is due to
934:
10:
1829:
1680:
881:artificial neural networks
488:
18:
1722:The American Statistician
1635:at 1h10m into the lecture
1535:pseudo-relevance feedback
918:principle of indifference
280:where the smoothed count
1561:Categorical distribution
1300:{\displaystyle \mu _{i}}
1166:Agresti & Coull 1998
929:estimate the probability
341:, and the "pseudocount"
143:multinomial distribution
1518:naive Bayes classifiers
1496:{\displaystyle \alpha }
889:zero-frequency problems
839:{\displaystyle \alpha }
555:{\displaystyle \alpha }
531:{\displaystyle \alpha }
397:{\displaystyle x_{i}/N}
1497:
1477:
1423:
1301:
1274:
1246:
1162:AgrestiāCoull interval
1151:
1093:
1065:
1036:
975:
955:: the midpoint of the
891:are avoided. Also see
840:
820:
790:
704:
643:
623:
603:
576:
556:
540:posterior distribution
532:
467:Dirichlet distribution
463:posterior distribution
432:
398:
335:
271:
159:
135:
115:
1498:
1478:
1424:
1302:
1275:
1247:
1152:
1094:
1066:
1037:
976:
957:Wilson score interval
841:
821:
819:{\displaystyle x_{i}}
791:
705:
644:
624:
604:
602:{\displaystyle x_{i}}
577:
557:
533:
483:binomial distribution
433:
399:
363:empirical probability
336:
272:
160:
136:
116:
1487:
1436:
1314:
1284:
1256:
1180:
1108:
1077:
1049:
990:
965:
949:Edwin Bidwell Wilson
885:hidden Markov models
830:
803:
720:
656:
633:
613:
586:
566:
546:
522:
465:, using a symmetric
431:{\displaystyle 1/d.}
411:
373:
284:
175:
149:
125:
52:
1539:recommender systems
1273:{\displaystyle 1/d}
1064:{\displaystyle z=2}
879:techniques such as
438:Invoking Laplace's
406:uniform probability
359:shrinkage estimator
21:Laplacian smoothing
1813:Probability theory
1530:bag of words model
1493:
1473:
1419:
1297:
1270:
1242:
1147:
1089:
1061:
1032:
971:
904:rule of succession
836:
816:
786:
700:
639:
619:
599:
572:
552:
528:
475:prior distribution
440:rule of succession
428:
394:
367:relative frequency
355:Ā§ Pseudocount
331:
267:
155:
131:
111:
32:additive smoothing
1604:978-0-13-187321-6
1386:
1327:
1142:
1027:
974:{\displaystyle z}
959:corresponding to
943:, particularly a
941:interval estimate
781:
740:
695:
673:
642:{\displaystyle i}
622:{\displaystyle N}
575:{\displaystyle i}
479:beta distribution
448:add-one smoothing
322:
297:
234:
188:
158:{\displaystyle N}
134:{\displaystyle d}
1820:
1808:Categorical data
1753:
1716:
1699:(158): 209ā212.
1674:
1673:
1661:
1655:
1654:
1642:
1636:
1630:
1624:
1623:
1615:
1609:
1608:
1590:
1584:
1577:
1551:Bayesian average
1502:
1500:
1499:
1494:
1482:
1480:
1479:
1474:
1466:
1461:
1460:
1448:
1447:
1428:
1426:
1425:
1420:
1387:
1385:
1371:
1364:
1363:
1351:
1350:
1340:
1335:
1334:
1329:
1328:
1320:
1306:
1304:
1303:
1298:
1296:
1295:
1279:
1277:
1276:
1271:
1266:
1251:
1249:
1248:
1243:
1235:
1234:
1216:
1215:
1203:
1202:
1187:
1156:
1154:
1153:
1148:
1143:
1141:
1130:
1123:
1122:
1112:
1100:
1098:
1096:
1095:
1090:
1070:
1068:
1067:
1062:
1041:
1039:
1038:
1033:
1028:
1026:
1012:
1005:
1004:
994:
982:
980:
978:
977:
972:
922:further analysis
877:machine learning
857:
853:
845:
843:
842:
837:
825:
823:
822:
817:
815:
814:
795:
793:
792:
787:
782:
780:
766:
759:
758:
748:
743:
742:
741:
738:
709:
707:
706:
701:
696:
691:
690:
681:
676:
675:
674:
671:
648:
646:
645:
640:
628:
626:
625:
620:
608:
606:
605:
600:
598:
597:
581:
579:
578:
573:
561:
559:
558:
553:
538:weighs into the
537:
535:
534:
529:
437:
435:
434:
429:
421:
403:
401:
400:
395:
390:
385:
384:
340:
338:
337:
332:
330:
329:
324:
323:
315:
305:
304:
299:
298:
290:
276:
274:
273:
268:
235:
233:
219:
212:
211:
201:
196:
195:
190:
189:
181:
164:
162:
161:
156:
140:
138:
137:
132:
120:
118:
117:
112:
107:
106:
88:
87:
75:
74:
59:
1828:
1827:
1823:
1822:
1821:
1819:
1818:
1817:
1793:
1792:
1761:
1756:
1734:10.2307/2685469
1683:
1678:
1677:
1662:
1658:
1643:
1639:
1631:
1627:
1616:
1612:
1605:
1591:
1587:
1578:
1574:
1569:
1547:
1526:
1514:
1509:
1488:
1485:
1484:
1462:
1456:
1452:
1443:
1439:
1437:
1434:
1433:
1372:
1359:
1355:
1346:
1342:
1341:
1339:
1330:
1319:
1318:
1317:
1315:
1312:
1311:
1291:
1287:
1285:
1282:
1281:
1262:
1257:
1254:
1253:
1230:
1226:
1211:
1207:
1198:
1194:
1183:
1181:
1178:
1177:
1174:
1131:
1118:
1114:
1113:
1111:
1109:
1106:
1105:
1078:
1075:
1074:
1072:
1050:
1047:
1046:
1013:
1000:
996:
995:
993:
991:
988:
987:
966:
963:
962:
960:
937:
893:Cromwell's rule
861:halting problem
855:
851:
831:
828:
827:
810:
806:
804:
801:
800:
767:
754:
750:
749:
747:
737:
727:
723:
721:
718:
717:
686:
682:
680:
670:
663:
659:
657:
654:
653:
634:
631:
630:
614:
611:
610:
593:
589:
587:
584:
583:
567:
564:
563:
547:
544:
543:
523:
520:
519:
504:
496:sunrise problem
491:
469:with parameter
417:
412:
409:
408:
386:
380:
376:
374:
371:
370:
325:
314:
313:
312:
300:
289:
288:
287:
285:
282:
281:
220:
207:
203:
202:
200:
191:
180:
179:
178:
176:
173:
172:
150:
147:
146:
126:
123:
122:
102:
98:
83:
79:
70:
66:
55:
53:
50:
49:
24:
17:
12:
11:
5:
1826:
1816:
1815:
1810:
1805:
1791:
1790:
1785:
1784:
1783:
1773:
1760:
1759:External links
1757:
1755:
1754:
1728:(2): 119ā126.
1717:
1684:
1682:
1679:
1676:
1675:
1656:
1637:
1625:
1610:
1603:
1585:
1571:
1570:
1568:
1565:
1564:
1563:
1558:
1553:
1546:
1543:
1525:
1522:
1513:
1512:Classification
1510:
1508:
1505:
1492:
1472:
1469:
1465:
1459:
1455:
1451:
1446:
1442:
1430:
1429:
1418:
1415:
1412:
1409:
1406:
1403:
1400:
1397:
1394:
1391:
1384:
1381:
1378:
1375:
1370:
1367:
1362:
1358:
1354:
1349:
1345:
1338:
1333:
1326:
1323:
1294:
1290:
1269:
1265:
1261:
1241:
1238:
1233:
1229:
1225:
1222:
1219:
1214:
1210:
1206:
1201:
1197:
1193:
1190:
1186:
1173:
1170:
1158:
1157:
1146:
1140:
1137:
1134:
1129:
1126:
1121:
1117:
1088:
1085:
1082:
1060:
1057:
1054:
1043:
1042:
1031:
1025:
1022:
1019:
1016:
1011:
1008:
1003:
999:
970:
936:
933:
911:Jeffreys prior
835:
813:
809:
797:
796:
785:
779:
776:
773:
770:
765:
762:
757:
753:
746:
736:
733:
730:
726:
711:
710:
699:
694:
689:
685:
679:
669:
666:
662:
638:
618:
596:
592:
571:
551:
527:
503:
500:
490:
487:
459:expected value
427:
424:
420:
416:
393:
389:
383:
379:
328:
321:
318:
311:
308:
303:
296:
293:
278:
277:
266:
263:
260:
257:
254:
251:
248:
245:
242:
239:
232:
229:
226:
223:
218:
215:
210:
206:
199:
194:
187:
184:
154:
130:
110:
105:
101:
97:
94:
91:
86:
82:
78:
73:
69:
65:
62:
58:
34:, also called
15:
9:
6:
4:
3:
2:
1825:
1814:
1811:
1809:
1806:
1804:
1801:
1800:
1798:
1789:
1786:
1782:
1779:
1778:
1777:
1774:
1771:
1767:
1763:
1762:
1751:
1747:
1743:
1739:
1735:
1731:
1727:
1723:
1718:
1714:
1710:
1706:
1702:
1698:
1694:
1690:
1689:Wilson, E. B.
1686:
1685:
1671:
1667:
1660:
1652:
1648:
1641:
1634:
1629:
1621:
1614:
1606:
1600:
1596:
1589:
1582:
1576:
1572:
1562:
1559:
1557:
1554:
1552:
1549:
1548:
1542:
1540:
1536:
1531:
1521:
1519:
1504:
1490:
1470:
1467:
1463:
1457:
1453:
1449:
1444:
1440:
1416:
1410:
1407:
1404:
1401:
1398:
1395:
1392:
1382:
1379:
1376:
1373:
1368:
1365:
1360:
1356:
1352:
1347:
1343:
1336:
1331:
1321:
1310:
1309:
1308:
1292:
1288:
1267:
1263:
1259:
1239:
1231:
1227:
1223:
1220:
1217:
1212:
1208:
1204:
1199:
1195:
1188:
1169:
1167:
1163:
1144:
1138:
1135:
1132:
1127:
1124:
1119:
1115:
1104:
1103:
1102:
1086:
1083:
1080:
1058:
1055:
1052:
1029:
1023:
1020:
1017:
1014:
1009:
1006:
1001:
997:
986:
985:
984:
968:
958:
954:
953:Wilson (1927)
950:
946:
942:
932:
930:
925:
923:
919:
914:
912:
907:
905:
901:
896:
894:
890:
886:
882:
878:
873:
869:
864:
862:
847:
833:
811:
807:
783:
777:
774:
771:
768:
763:
760:
755:
751:
744:
734:
731:
728:
724:
716:
715:
714:
697:
692:
687:
683:
677:
667:
664:
660:
652:
651:
650:
636:
616:
594:
590:
569:
549:
541:
525:
517:
513:
509:
499:
497:
486:
484:
480:
476:
472:
468:
464:
460:
456:
451:
449:
445:
441:
425:
422:
418:
414:
407:
391:
387:
381:
377:
368:
364:
360:
356:
352:
348:
344:
326:
316:
309:
306:
301:
291:
264:
258:
255:
252:
249:
246:
243:
240:
230:
227:
224:
221:
216:
213:
208:
204:
197:
192:
182:
171:
170:
169:
168:
152:
144:
141:-dimensional
128:
103:
99:
95:
92:
89:
84:
80:
76:
71:
67:
60:
47:
45:
40:
38:
33:
29:
22:
1776:Pseudocounts
1769:
1725:
1721:
1696:
1692:
1669:
1659:
1650:
1640:
1628:
1619:
1613:
1594:
1588:
1580:
1575:
1527:
1515:
1507:Applications
1431:
1175:
1159:
1044:
938:
926:
915:
908:
899:
897:
865:
848:
798:
712:
507:
505:
492:
470:
452:
447:
443:
350:
342:
279:
42:
35:
31:
25:
512:probability
508:pseudocount
502:Pseudocount
1797:Categories
1567:References
909:Using the
846:a priori.
28:statistics
1491:α
1441:μ
1405:…
1380:α
1366:α
1357:μ
1325:^
1322:θ
1289:μ
1237:⟩
1228:μ
1221:…
1209:μ
1196:μ
1192:⟨
1185:μ
1084:≈
834:α
775:α
764:α
739:-smoothed
735:α
672:empirical
550:α
526:α
347:parameter
320:^
317:θ
295:^
253:…
228:α
217:α
186:^
183:θ
167:estimator
109:⟩
93:…
64:⟨
46:smoothing
39:smoothing
1545:See also
935:Examples
455:Bayesian
404:and the
44:Lidstone
1750:1628435
1742:2685469
1713:2276774
1681:Sources
1099:
1073:
1045:Taking
981:
961:
609:out of
489:History
461:of the
453:From a
349:, with
121:from a
37:Laplace
1748:
1740:
1711:
1601:
872:events
868:sample
1738:JSTOR
1709:JSTOR
1528:In a
951:, in
516:model
514:in a
473:as a
145:with
1599:ISBN
1537:and
1087:1.96
883:and
649:is
1768:".
1730:doi
1701:doi
1168:).
900:one
826:by
582:is
498:).
41:or
26:In
1799::
1746:MR
1744:.
1736:.
1726:52
1724:.
1707:.
1697:22
1695:.
1668:.
1649:.
1541:.
1520:.
895:.
506:A
485:.
369:)
30:,
1772:.
1752:.
1732::
1715:.
1703::
1672:.
1653:.
1607:.
1471:,
1468:N
1464:/
1458:i
1454:x
1450:=
1445:i
1417:.
1414:)
1411:d
1408:,
1402:,
1399:1
1396:=
1393:i
1390:(
1383:d
1377:+
1374:N
1369:d
1361:i
1353:+
1348:i
1344:x
1337:=
1332:i
1293:i
1268:d
1264:/
1260:1
1240:.
1232:d
1224:,
1218:,
1213:2
1205:,
1200:1
1189:=
1164:(
1145:.
1139:4
1136:+
1133:n
1128:2
1125:+
1120:S
1116:n
1081:z
1059:2
1056:=
1053:z
1030:.
1024:z
1021:2
1018:+
1015:n
1010:z
1007:+
1002:S
998:n
969:z
856:Ļ
852:Ļ
812:i
808:x
784:,
778:d
772:+
769:N
761:+
756:i
752:x
745:=
732:,
729:i
725:p
698:,
693:N
688:i
684:x
678:=
668:,
665:i
661:p
637:i
617:N
595:i
591:x
570:i
471:Ī±
444:Ī±
426:.
423:d
419:/
415:1
392:N
388:/
382:i
378:x
365:(
351:Ī±
343:Ī±
327:i
310:N
307:=
302:i
292:x
265:,
262:)
259:d
256:,
250:,
247:1
244:=
241:i
238:(
231:d
225:+
222:N
214:+
209:i
205:x
198:=
193:i
153:N
129:d
104:d
100:x
96:,
90:,
85:2
81:x
77:,
72:1
68:x
61:=
57:x
23:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.