Knowledge

Word n-gram language model

Source 📝

939: 725: 743: 551: 1847:-dimensional space (the first dimension measures the number of occurrences of "aaa", the second "aab", and so forth for all possible combinations of three letters). Using this representation, we lose information about the string. However, we know empirically that if two strings of real text have a similar vector representation (as measured by 1745: 934:{\displaystyle P({\text{I, saw, the, red, house}})\approx P({\text{I}}\mid \langle s\rangle ,\langle s\rangle )P({\text{saw}}\mid \langle s\rangle ,I)P({\text{the}}\mid {\text{I, saw}})P({\text{red}}\mid {\text{saw, the}})P({\text{house}}\mid {\text{the, red}})P(\langle /s\rangle \mid {\text{red, house}})} 1772:
when the input includes words which were not present in a system's dictionary or database during its preparation. By default, when a language model is estimated, the entire observed vocabulary is used. In some cases, it may be necessary to estimate the language model with a specific fixed vocabulary.
1304: 448:
Unigram models of different documents have different probabilities of words in it. The probability distributions from different documents are used to generate hit probabilities for each query. Documents can be ranked for a query according to the probabilities. Example of unigram models of two
1912:-gram model, it is necessary to find the right trade-off between the stability of the estimate against its appropriateness. This means that trigram (i.e. triplets of words) is a common choice with large training corpora (millions of words), whereas a bigram is often used with smaller ones. 720:{\displaystyle P({\text{I, saw, the, red, house}})\approx P({\text{I}}\mid \langle s\rangle )P({\text{saw}}\mid {\text{I}})P({\text{the}}\mid {\text{saw}})P({\text{red}}\mid {\text{the}})P({\text{house}}\mid {\text{red}})P(\langle /s\rangle \mid {\text{house}})} 1482: 103:
To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen
138: = 1, is called a unigram model. Probability of each word in a sequence is independent from probabilities of other word in the sequence. Each word's probability in the sequence is equal to the word's probability in an entire document. 35:. It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if 2267: 1074: 2295:-grams defined by paths in syntactic dependency or constituent trees rather than the linear structure of the text. For example, the sentence "economic news has little effect on financial markets" can be transformed to syntactic 444: 1468: 2895: 264: 2423:
systems when it is hoped to find similar "documents" (a term for which the conventional meaning is sometimes stretched, depending on the data set) given a single query document and a database of reference
382: 2720: 2145: 1022: 2976: 1068: 98: 67: 2018: 2844: 1845: 1740:{\displaystyle P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})={\frac {\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1},w_{i})}{\mathrm {count} (w_{i-(n-1)},\ldots ,w_{i-1})}}} 391: 1998:-gram probabilities. However, the more sophisticated smoothing models were typically not derived in this fashion, but instead through independent considerations. 143: 1299:{\displaystyle P(w_{1},\ldots ,w_{m})=\prod _{i=1}^{m}P(w_{i}\mid w_{1},\ldots ,w_{i-1})\approx \prod _{i=2}^{m}P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})} 2716: 3024: 2707: 1820:-grams were also used for approximate matching. If we convert strings (with only letters in the English alphabet) into character 3-grams, we get a 340: 2745: 1346: 1788:
Nonetheless, it is essential in some cases to explicitly model the probability of out-of-vocabulary words by introducing a special token (e.g.
2049:-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are 2337:-grams, defined as fixed-length contiguous overlapping subsequences that are extracted from part-of-speech sequences of text. Part-of-speech 2653: 1323: − 1 words) can be approximated by the probability of observing it in the shortened context window consisting of the preceding 2957: 1868: 2792: 2917:
Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana (2014). "Syntactic
1792:) into the vocabulary. Out-of-vocabulary words in the corpus are effectively replaced with this special <unk> token before 2685:
Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient estimation of word representations in vector space".
2818: 2786: 2396:
assess the probability of a given word sequence appearing in text of a language of interest in pattern recognition systems,
2012: 1882:-grams, modeling similarity as the likelihood that two strings came from the same source directly in terms of a problem in 2762:
Sidorov, Grigori; Velasquez, Francisco; Stamatatos, Efstathios; Gelbukh, Alexander; Chanona-Hernández, Liliana (2013).
2409: 2405: 3001: 1963:. Various smoothing methods were used, from simple "add-one" (Laplace) smoothing (assign a count of 1 to unseen 1755: 2401: 972: 1796:-grams counts are cumulated. With this option, it is possible to estimate the transition probabilities of 2045:
Skip-gram language model is an attempt at overcoming the data sparsity problem that preceding (i.e. word
1972: 1812: 1769: 1027: 113: 72: 3019: 2570:
Wołk, K.; Marasek, K.; Glinkowski, W. (2015). "Telemedicine as a special case of Machine Translation".
46: 1785:-gram probabilities are smoothed over all the words in the vocabulary even if they were not observed. 2353: 2262:{\displaystyle v(\mathrm {king} )-v(\mathrm {male} )+v(\mathrm {female} )\approx v(\mathrm {queen} )} 1960: 1765: 2839: 2739: 2471: 2034: 2028: 1976: 117: 28: 2631:. Proceedings of the 11th International Workshop on Spoken Language Translation. Tahoe Lake, USA. 2272: 2660: 3029: 2834: 2530: 2372: 2085:
the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences
1992: 954:
Additionally, without an end-of-sentence marker, the probability of an ungrammatical sequence
2428: 2420: 1823: 2585: 2002: 1893: 269: 32: 1947:
the probability distributions by also assigning non-zero probabilities to unseen words or
8: 2868:"Contextual Language Models For Ranking Answers To Natural Language Definition Questions" 2466: 2461: 2413: 1851:) then they are likely to be similar. Other metrics have also been applied to vectors of 109: 2763: 2589: 2434:
identify the language a text is in or the species a small sequence of DNA was taken from
1867:, of documents (which form the "background" vector). In the event of small counts, the 2938: 2887: 2686: 2632: 2609: 2575: 2397: 2315: 2300: 2121: 2023: 1988: 1980: 1968: 1883: 129: 2706:
Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado irst4=Greg S.; Dean, Jeff (2013).
2597: 2506:
Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (March 1, 2003).
335:
Total mass of word probabilities distributed across the document's vocabulary, is 1.
43:-gram model. Special tokens were introduced to denote the start and end of a sentence 2942: 2883: 2867: 2782: 2601: 2438: 1848: 2891: 2613: 2930: 2879: 2774: 2593: 2386: 2368: 2125: 1859:
have been used to compare documents by examining how many standard deviations each
1764:-gram language models are out-of-vocabulary (OOV) words. They are encountered in 1332: 2778: 2629:
Polish-English Speech Statistical Machine Translation Systems for the IWSLT 2014
272:. Words with their probabilities in a document can be illustrated as follows. 2934: 2507: 2310:-grams are intended to reflect syntactic structure more faithfully than linear 439:{\displaystyle P({\text{query}})=\prod _{\text{word in query}}P({\text{word}})} 951:-grams is filled with start-of-sentence markers, typically denoted <s>. 3013: 2823:-grams in Rule Based Automatic English as Second Language Grammar Correction" 2486: 2444: 2006: 2529:
Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models".
2314:-grams, and have many of the same applications, especially as features in a 2271:
where ≈ is made precise by stipulating that its right-hand side must be the
1878:
It is also possible to take a more principled approach to the statistics of
2709:
Distributed Representations of Words and Phrases and their Compositionality
2605: 2364: 1463:{\displaystyle P(w_{i}\mid w_{i-(n-1)},\ldots ,w_{i-1})=P(w_{2}\mid w_{1})} 2341:-grams have several applications, most commonly in information retrieval. 2705: 2456: 2379: 1937: 1929: 1864: 1778: 259:{\displaystyle P_{\text{uni}}(t_{1}t_{2}t_{3})=P(t_{1})P(t_{2})P(t_{3}).} 2120:
In skip-gram model, semantic relations between words are represented by
2322:-grams for certain tasks gives better results than the use of standard 2921:-Grams as Machine Learning Features for Natural Language Processing". 2437:
predict letters or words at random in order to create text, as in the
2427:
improve retrieval performance in genetic sequence analysis as in the
1955:-gram frequency counts have severe problems when confronted with any 1933: 27:
is a purely statistical model of language. It has been superseded by
2773:. Lecture Notes in Computer Science. Vol. 7630. pp. 1–11. 2556:
Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze (2009).
2827:
International Journal of Computational Linguistics and Applications
2637: 2580: 2691: 1924:(for example, if a proper name appeared in the training data) and 2916: 2761: 2481: 2476: 1936:. For unseen but plausible data from a sample, one can introduce 1863:-gram differs from its mean occurrence in a large collection, or 1856: 542: = 2) language model, the probability of the sentence 386:
The probability generated for a specific query is calculated as
3002:
Method for rule-based correction of spelling and grammar errors
2764:"Syntactic Dependency-Based N-grams as Classification Features" 1872: 1928:. Also, items not seen in the training data will be given a 2303:: news-economic, effect-little, effect-on-markets-financial. 1951:-grams. The reason is that models derived directly from the 1940:. Pseudocounts are generally motivated on Bayesian grounds. 1855:-grams with varying, sometimes better, results. For example, 377:{\displaystyle \sum _{\text{word in doc}}P({\text{word}})=1} 2505: 2068:
subsequence where the components occur at distance at most
1781:
that contain an out-of-vocabulary word are ignored. The
958:
would always be higher than that of the longer sentence
1979:. Some of these methods are equivalent to assigning a 268:
The model consists of units, each treated as one-state
2684: 2352:-grams find use in several areas of computer science, 2569: 2148: 1826: 1485: 1349: 1077: 1030: 975: 746: 738: = 3) language model, the approximation is 554: 394: 343: 146: 75: 49: 1308:
It is assumed that the probability of observing the
969:
The approximation method calculates the probability
2955: 2626: 1959:-grams that have not explicitly been seen before – 1803: 1473:The conditional probability can be calculated from 1319:(in the context window consisting of the preceding 2261: 1839: 1739: 1462: 1298: 1062: 1016: 933: 719: 438: 376: 258: 92: 61: 2717:Advances in Neural Information Processing Systems 2326:-grams, for example, for authorship attribution. 1899: 3011: 2865: 2699: 2528: 2651: 1920:There are problems of balance weight between 31:–based models, which have been superseded by 2956:Lioma, C.; van Rijsbergen, C. J. K. (2008). 2866:Figueroa, Alejandro; Atkinson, John (2012). 2518:: 1137–1155 – via ACM Digital Library. 917: 906: 822: 816: 796: 790: 784: 778: 703: 692: 592: 586: 87: 76: 56: 50: 2769:. In Batyrshin, I.; Mendoza, M. G. (eds.). 2744:: CS1 maint: numeric names: authors list ( 2299:-grams following the tree structure of its 2080:the rain in Spain falls mainly on the plain 2040: 2560:. pp. 237–240. Cambridge University Press. 1800:-grams involving out-of-vocabulary words. 1749: 2838: 2690: 2636: 2579: 2572:Computerized Medical Imaging and Graphics 2558:An Introduction to Information Retrieval 2512:The Journal of Machine Learning Research 1971:) to more sophisticated models, such as 1892:-gram-based searching was also used for 112:) to more sophisticated models, such as 3025:Statistical natural language processing 2816: 2508:"A neural probabilistic language model" 2128:. For example, in some such models, if 1915: 964: 3012: 2771:Advances in Computational Intelligence 2678: 2654:"A Closer Look at Skip-gram Modelling" 1017:{\displaystyle P(w_{1},\ldots ,w_{m})} 2757: 2755: 2344: 2275:of the value of the left-hand side. 2969:French Review of Applied Linguistics 2898:from the original on 27 October 2021 2726:from the original on 29 October 2020 2389:where a small area of data requires 2278: 2009:of the unigram, bigram, and trigram) 16:Purely statistical model of language 2847:from the original on 7 October 2021 2652:David Guthrie; et al. (2006). 1063:{\displaystyle w_{1},\ldots ,w_{m}} 943:Note that the context of the first 13: 2982:from the original on 13 March 2018 2798:from the original on 8 August 2017 2752: 2252: 2249: 2246: 2243: 2240: 2223: 2220: 2217: 2214: 2211: 2208: 2191: 2188: 2185: 2182: 2165: 2162: 2159: 2156: 1671: 1668: 1665: 1662: 1659: 1580: 1577: 1574: 1571: 1568: 93:{\displaystyle \langle /s\rangle } 14: 3041: 2962:-Grams and Information Retrieval" 2598:10.1016/j.compmedimag.2015.09.005 2522: 2406:Intelligent Character Recognition 2132:is the function that maps a word 62:{\displaystyle \langle s\rangle } 2923:Expert Systems with Applications 2884:10.1111/j.1467-8640.2012.00426.x 2075:For example, in the input text: 1943:In practice it was necessary to 729: 123: 2994: 2949: 2910: 2859: 2810: 2378:find likely candidates for the 2140:-d vector representation, then 1807:-grams for approximate matching 1756:Statistical machine translation 533: 2645: 2620: 2563: 2550: 2532:Speech and Language Processing 2499: 2256: 2236: 2227: 2204: 2195: 2178: 2169: 2152: 1900:Bias-versus-variance trade-off 1731: 1701: 1689: 1675: 1653: 1610: 1598: 1584: 1558: 1528: 1516: 1489: 1477:-gram model frequency counts: 1457: 1431: 1422: 1392: 1380: 1353: 1293: 1263: 1251: 1224: 1194: 1143: 1113: 1081: 1011: 979: 928: 903: 897: 881: 875: 859: 853: 837: 831: 805: 799: 767: 758: 750: 714: 689: 683: 667: 661: 645: 639: 623: 617: 601: 595: 575: 566: 558: 433: 425: 406: 398: 365: 357: 250: 237: 231: 218: 212: 199: 190: 157: 1: 2492: 2402:optical character recognition 2819:"Syntactic Dependency-Based 2627:Wołk K., Marasek K. (2014). 2538:(3rd edition draft ed.) 1983:to the probabilities of the 7: 2779:10.1007/978-3-642-37798-3_1 2450: 2356:, and applied mathematics. 1813:Approximate string matching 1770:natural language processing 10: 3046: 2935:10.1016/j.eswa.2013.08.015 2872:Computational Intelligence 2333:-grams are part-of-speech 2329:Another type of syntactic 1961:the zero-frequency problem 1810: 1753: 1024:of observing the sentence 127: 2817:Sidorov, Grigori (2013). 2375:to learn from string data 2354:computational linguistics 1991:to compute the resulting 1766:computational linguistics 39: − 1 words, an 2472:Longest common substring 2416:and similar applications 2393:-grams of greater length 2359:They have been used to: 2041:Skip-gram language model 1773:In such a scenario, the 461:Its probability in Doc2 29:recurrent neural network 2385:improve compression in 2373:support vector machines 2019:Witten–Bell discounting 1973:Good–Turing discounting 1875:) gave better results. 1750:Out-of-vocabulary words 1343: = 2 we have 755:I, saw, the, red, house 563:I, saw, the, red, house 458:Its probability in Doc1 281:Its probability in doc 114:Good–Turing discounting 2719:. pp. 3111–3119. 2387:compression algorithms 2263: 2124:, capturing a form of 1904:To choose a value for 1841: 1840:{\displaystyle 26^{3}} 1741: 1464: 1327: − 1 words ( 1300: 1220: 1139: 1064: 1018: 935: 721: 440: 378: 260: 134:A special case, where 94: 63: 3000:U.S. Patent 6618697, 2421:information retrieval 2419:improve retrieval in 2264: 2029:Katz's back-off model 1842: 1742: 1465: 1301: 1200: 1119: 1065: 1019: 936: 722: 441: 379: 261: 95: 64: 33:large language models 2382:of a misspelled word 2301:dependency relations 2146: 2035:Kneser–Ney smoothing 2024:Lidstone's smoothing 2003:Linear interpolation 1916:Smoothing techniques 1894:plagiarism detection 1824: 1760:An issue when using 1483: 1347: 1075: 1028: 973: 965:Approximation method 960:I saw the red house. 744: 552: 392: 341: 144: 73: 47: 25:-gram language model 2590:2015arXiv151004600W 2574:. 46 Pt 2: 249–56. 2467:Hidden Markov model 2462:Feature engineering 2414:machine translation 2371:algorithms such as 2122:linear combinations 1339: = 3 and 1335:). To clarify, for 546:is approximated as 544:I saw the red house 110:uninformative prior 2990:– via Cairn. 2431:family of programs 2398:speech recognition 2345:Other applications 2316:vector space model 2259: 2064:-gram is a length- 2005:(e.g., taking the 1989:Bayesian inference 1981:prior distribution 1969:Rule of succession 1884:Bayesian inference 1837: 1737: 1460: 1296: 1060: 1014: 931: 717: 538:In a bigram word ( 436: 421: 374: 353: 256: 130:Bag-of-words model 90: 59: 3020:Language modeling 2788:978-3-642-37797-6 2439:dissociated press 2072:from each other. 1987:-grams and using 1735: 926: 895: 887: 873: 865: 851: 843: 811: 773: 756: 712: 681: 673: 659: 651: 637: 629: 615: 607: 581: 564: 531: 530: 431: 419: 412: 404: 363: 351: 344: 333: 332: 154: 3037: 3004: 2998: 2992: 2991: 2989: 2987: 2981: 2966: 2958:"Part of Speech 2953: 2947: 2946: 2914: 2908: 2907: 2905: 2903: 2863: 2857: 2856: 2854: 2852: 2842: 2814: 2808: 2807: 2805: 2803: 2797: 2768: 2759: 2750: 2749: 2743: 2735: 2733: 2731: 2725: 2714: 2703: 2697: 2696: 2694: 2682: 2676: 2675: 2673: 2671: 2665: 2659:. Archived from 2658: 2649: 2643: 2642: 2640: 2624: 2618: 2617: 2583: 2567: 2561: 2554: 2548: 2547: 2545: 2543: 2537: 2526: 2520: 2519: 2503: 2380:correct spelling 2369:machine learning 2273:nearest neighbor 2268: 2266: 2265: 2260: 2255: 2226: 2194: 2168: 2139: 2135: 2131: 2126:compositionality 2071: 2067: 2063: 2059: 1922:infrequent grams 1846: 1844: 1843: 1838: 1836: 1835: 1746: 1744: 1743: 1738: 1736: 1734: 1730: 1729: 1705: 1704: 1674: 1656: 1652: 1651: 1639: 1638: 1614: 1613: 1583: 1565: 1557: 1556: 1532: 1531: 1501: 1500: 1469: 1467: 1466: 1461: 1456: 1455: 1443: 1442: 1421: 1420: 1396: 1395: 1365: 1364: 1305: 1303: 1302: 1297: 1292: 1291: 1267: 1266: 1236: 1235: 1219: 1214: 1193: 1192: 1168: 1167: 1155: 1154: 1138: 1133: 1112: 1111: 1093: 1092: 1069: 1067: 1066: 1061: 1059: 1058: 1040: 1039: 1023: 1021: 1020: 1015: 1010: 1009: 991: 990: 940: 938: 937: 932: 927: 924: 913: 896: 893: 888: 885: 874: 871: 866: 863: 852: 849: 844: 841: 812: 809: 774: 771: 757: 754: 726: 724: 723: 718: 713: 710: 699: 682: 679: 674: 671: 660: 657: 652: 649: 638: 635: 630: 627: 616: 613: 608: 605: 582: 579: 565: 562: 452: 451: 445: 443: 442: 437: 432: 429: 420: 417: 405: 402: 383: 381: 380: 375: 364: 361: 352: 349: 275: 274: 265: 263: 262: 257: 249: 248: 230: 229: 211: 210: 189: 188: 179: 178: 169: 168: 156: 155: 152: 99: 97: 96: 91: 83: 68: 66: 65: 60: 3045: 3044: 3040: 3039: 3038: 3036: 3035: 3034: 3010: 3009: 3008: 3007: 2999: 2995: 2985: 2983: 2979: 2964: 2954: 2950: 2915: 2911: 2901: 2899: 2864: 2860: 2850: 2848: 2815: 2811: 2801: 2799: 2795: 2789: 2766: 2760: 2753: 2740:cite conference 2737: 2736: 2729: 2727: 2723: 2712: 2704: 2700: 2683: 2679: 2669: 2667: 2663: 2656: 2650: 2646: 2625: 2621: 2568: 2564: 2555: 2551: 2541: 2539: 2535: 2527: 2523: 2504: 2500: 2495: 2453: 2347: 2285: 2239: 2207: 2181: 2155: 2147: 2144: 2143: 2137: 2133: 2129: 2069: 2065: 2061: 2057: 2043: 1977:back-off models 1932:of 0.0 without 1918: 1902: 1871:(also known as 1849:cosine distance 1831: 1827: 1825: 1822: 1821: 1815: 1809: 1758: 1752: 1719: 1715: 1682: 1678: 1658: 1657: 1647: 1643: 1628: 1624: 1591: 1587: 1567: 1566: 1564: 1546: 1542: 1509: 1505: 1496: 1492: 1484: 1481: 1480: 1451: 1447: 1438: 1434: 1410: 1406: 1373: 1369: 1360: 1356: 1348: 1345: 1344: 1333:Markov property 1317: 1281: 1277: 1244: 1240: 1231: 1227: 1215: 1204: 1182: 1178: 1163: 1159: 1150: 1146: 1134: 1123: 1107: 1103: 1088: 1084: 1076: 1073: 1072: 1054: 1050: 1035: 1031: 1029: 1026: 1025: 1005: 1001: 986: 982: 974: 971: 970: 967: 947: – 1 923: 909: 892: 884: 870: 862: 848: 840: 808: 770: 753: 745: 742: 741: 732: 709: 695: 678: 670: 656: 648: 634: 626: 612: 604: 578: 561: 553: 550: 549: 536: 428: 416: 401: 393: 390: 389: 360: 348: 342: 339: 338: 270:finite automata 244: 240: 225: 221: 206: 202: 184: 180: 174: 170: 164: 160: 151: 147: 145: 142: 141: 132: 126: 118:back-off models 79: 74: 71: 70: 48: 45: 44: 17: 12: 11: 5: 3043: 3033: 3032: 3027: 3022: 3006: 3005: 2993: 2948: 2929:(3): 853–860. 2909: 2878:(4): 528–548. 2858: 2840:10.1.1.644.907 2833:(2): 169–188. 2809: 2787: 2751: 2698: 2677: 2666:on 17 May 2017 2644: 2619: 2562: 2549: 2521: 2497: 2496: 2494: 2491: 2490: 2489: 2484: 2479: 2474: 2469: 2464: 2459: 2452: 2449: 2448: 2447: 2442: 2435: 2432: 2425: 2417: 2394: 2383: 2376: 2346: 2343: 2284: 2277: 2258: 2254: 2251: 2248: 2245: 2242: 2238: 2235: 2232: 2229: 2225: 2222: 2219: 2216: 2213: 2210: 2206: 2203: 2200: 2197: 2193: 2190: 2187: 2184: 2180: 2177: 2174: 2171: 2167: 2164: 2161: 2158: 2154: 2151: 2118: 2117: 2083: 2082: 2042: 2039: 2038: 2037: 2032: 2026: 2021: 2016: 2010: 1926:frequent grams 1917: 1914: 1901: 1898: 1834: 1830: 1811:Main article: 1808: 1802: 1777:-grams in the 1754:Main article: 1751: 1748: 1733: 1728: 1725: 1722: 1718: 1714: 1711: 1708: 1703: 1700: 1697: 1694: 1691: 1688: 1685: 1681: 1677: 1673: 1670: 1667: 1664: 1661: 1655: 1650: 1646: 1642: 1637: 1634: 1631: 1627: 1623: 1620: 1617: 1612: 1609: 1606: 1603: 1600: 1597: 1594: 1590: 1586: 1582: 1579: 1576: 1573: 1570: 1563: 1560: 1555: 1552: 1549: 1545: 1541: 1538: 1535: 1530: 1527: 1524: 1521: 1518: 1515: 1512: 1508: 1504: 1499: 1495: 1491: 1488: 1459: 1454: 1450: 1446: 1441: 1437: 1433: 1430: 1427: 1424: 1419: 1416: 1413: 1409: 1405: 1402: 1399: 1394: 1391: 1388: 1385: 1382: 1379: 1376: 1372: 1368: 1363: 1359: 1355: 1352: 1315: 1295: 1290: 1287: 1284: 1280: 1276: 1273: 1270: 1265: 1262: 1259: 1256: 1253: 1250: 1247: 1243: 1239: 1234: 1230: 1226: 1223: 1218: 1213: 1210: 1207: 1203: 1199: 1196: 1191: 1188: 1185: 1181: 1177: 1174: 1171: 1166: 1162: 1158: 1153: 1149: 1145: 1142: 1137: 1132: 1129: 1126: 1122: 1118: 1115: 1110: 1106: 1102: 1099: 1096: 1091: 1087: 1083: 1080: 1057: 1053: 1049: 1046: 1043: 1038: 1034: 1013: 1008: 1004: 1000: 997: 994: 989: 985: 981: 978: 966: 963: 930: 922: 919: 916: 912: 908: 905: 902: 899: 891: 883: 880: 877: 869: 861: 858: 855: 847: 839: 836: 833: 830: 827: 824: 821: 818: 815: 807: 804: 801: 798: 795: 792: 789: 786: 783: 780: 777: 769: 766: 763: 760: 752: 749: 734:In a trigram ( 731: 728: 716: 708: 705: 702: 698: 694: 691: 688: 685: 677: 669: 666: 663: 655: 647: 644: 641: 633: 625: 622: 619: 611: 603: 600: 597: 594: 591: 588: 585: 577: 574: 571: 568: 560: 557: 535: 532: 529: 528: 525: 522: 518: 517: 514: 511: 507: 506: 503: 500: 496: 495: 492: 489: 485: 484: 481: 478: 474: 473: 470: 467: 463: 462: 459: 456: 435: 427: 424: 415: 411: 408: 400: 397: 373: 370: 367: 359: 356: 347: 331: 330: 327: 323: 322: 319: 315: 314: 311: 307: 306: 303: 299: 298: 295: 291: 290: 287: 283: 282: 279: 255: 252: 247: 243: 239: 236: 233: 228: 224: 220: 217: 214: 209: 205: 201: 198: 195: 192: 187: 183: 177: 173: 167: 163: 159: 150: 125: 122: 108:-grams, as an 89: 86: 82: 78: 58: 55: 52: 15: 9: 6: 4: 3: 2: 3042: 3031: 3030:Markov models 3028: 3026: 3023: 3021: 3018: 3017: 3015: 3003: 2997: 2978: 2974: 2970: 2963: 2961: 2952: 2944: 2940: 2936: 2932: 2928: 2924: 2920: 2913: 2897: 2893: 2889: 2885: 2881: 2877: 2873: 2869: 2862: 2846: 2841: 2836: 2832: 2828: 2824: 2822: 2813: 2794: 2790: 2784: 2780: 2776: 2772: 2765: 2758: 2756: 2747: 2741: 2722: 2718: 2711: 2710: 2702: 2693: 2688: 2681: 2662: 2655: 2648: 2639: 2634: 2630: 2623: 2615: 2611: 2607: 2603: 2599: 2595: 2591: 2587: 2582: 2577: 2573: 2566: 2559: 2553: 2534: 2533: 2525: 2517: 2513: 2509: 2502: 2498: 2488: 2487:String kernel 2485: 2483: 2480: 2478: 2475: 2473: 2470: 2468: 2465: 2463: 2460: 2458: 2455: 2454: 2446: 2445:cryptanalysis 2443: 2440: 2436: 2433: 2430: 2426: 2422: 2418: 2415: 2411: 2407: 2403: 2399: 2395: 2392: 2388: 2384: 2381: 2377: 2374: 2370: 2366: 2362: 2361: 2360: 2357: 2355: 2351: 2342: 2340: 2336: 2332: 2327: 2325: 2321: 2317: 2313: 2309: 2304: 2302: 2298: 2294: 2290: 2282: 2276: 2274: 2269: 2233: 2230: 2201: 2198: 2175: 2172: 2149: 2141: 2127: 2123: 2115: 2111: 2107: 2103: 2099: 2095: 2091: 2088: 2087: 2086: 2081: 2078: 2077: 2076: 2073: 2054: 2052: 2048: 2036: 2033: 2030: 2027: 2025: 2022: 2020: 2017: 2014: 2011: 2008: 2007:weighted mean 2004: 2001: 2000: 1999: 1997: 1994: 1990: 1986: 1982: 1978: 1974: 1970: 1966: 1962: 1958: 1954: 1950: 1946: 1941: 1939: 1935: 1931: 1927: 1923: 1913: 1911: 1907: 1897: 1895: 1891: 1887: 1885: 1881: 1876: 1874: 1870: 1866: 1862: 1858: 1854: 1850: 1832: 1828: 1819: 1814: 1806: 1801: 1799: 1795: 1791: 1786: 1784: 1780: 1776: 1771: 1767: 1763: 1757: 1747: 1726: 1723: 1720: 1716: 1712: 1709: 1706: 1698: 1695: 1692: 1686: 1683: 1679: 1648: 1644: 1640: 1635: 1632: 1629: 1625: 1621: 1618: 1615: 1607: 1604: 1601: 1595: 1592: 1588: 1561: 1553: 1550: 1547: 1543: 1539: 1536: 1533: 1525: 1522: 1519: 1513: 1510: 1506: 1502: 1497: 1493: 1486: 1478: 1476: 1471: 1452: 1448: 1444: 1439: 1435: 1428: 1425: 1417: 1414: 1411: 1407: 1403: 1400: 1397: 1389: 1386: 1383: 1377: 1374: 1370: 1366: 1361: 1357: 1350: 1342: 1338: 1334: 1330: 1326: 1322: 1318: 1311: 1306: 1288: 1285: 1282: 1278: 1274: 1271: 1268: 1260: 1257: 1254: 1248: 1245: 1241: 1237: 1232: 1228: 1221: 1216: 1211: 1208: 1205: 1201: 1197: 1189: 1186: 1183: 1179: 1175: 1172: 1169: 1164: 1160: 1156: 1151: 1147: 1140: 1135: 1130: 1127: 1124: 1120: 1116: 1108: 1104: 1100: 1097: 1094: 1089: 1085: 1078: 1070: 1055: 1051: 1047: 1044: 1041: 1036: 1032: 1006: 1002: 998: 995: 992: 987: 983: 976: 962: 961: 957: 952: 950: 946: 941: 920: 914: 910: 900: 889: 878: 867: 856: 845: 834: 828: 825: 819: 813: 802: 793: 787: 781: 775: 764: 761: 747: 739: 737: 730:Trigram model 727: 706: 700: 696: 686: 675: 664: 653: 642: 631: 620: 609: 598: 589: 583: 572: 569: 555: 547: 545: 541: 526: 523: 520: 519: 515: 512: 509: 508: 504: 501: 498: 497: 493: 490: 487: 486: 482: 479: 476: 475: 471: 468: 465: 464: 460: 457: 454: 453: 450: 446: 422: 418:word in query 413: 409: 395: 387: 384: 371: 368: 354: 345: 336: 328: 325: 324: 320: 317: 316: 312: 309: 308: 304: 301: 300: 296: 293: 292: 288: 285: 284: 280: 277: 276: 273: 271: 266: 253: 245: 241: 234: 226: 222: 215: 207: 203: 196: 193: 185: 181: 175: 171: 165: 161: 148: 139: 137: 131: 124:Unigram model 121: 119: 115: 111: 107: 101: 84: 80: 53: 42: 38: 34: 30: 26: 24: 2996: 2984:. Retrieved 2972: 2968: 2959: 2951: 2926: 2922: 2918: 2912: 2900:. Retrieved 2875: 2871: 2861: 2849:. Retrieved 2830: 2826: 2820: 2812: 2800:. Retrieved 2770: 2728:. Retrieved 2708: 2701: 2680: 2668:. Retrieved 2661:the original 2647: 2628: 2622: 2571: 2565: 2557: 2552: 2540:. Retrieved 2531: 2524: 2515: 2511: 2501: 2390: 2358: 2349: 2348: 2338: 2334: 2330: 2328: 2323: 2319: 2318:. Syntactic 2311: 2307: 2305: 2296: 2292: 2288: 2286: 2280: 2270: 2142: 2119: 2113: 2109: 2105: 2102:Spain mainly 2101: 2097: 2093: 2089: 2084: 2079: 2074: 2056:Formally, a 2055: 2050: 2046: 2044: 1995: 1984: 1967:-grams; see 1964: 1956: 1952: 1948: 1944: 1942: 1938:pseudocounts 1925: 1921: 1919: 1909: 1905: 1903: 1889: 1888: 1879: 1877: 1860: 1852: 1817: 1816: 1804: 1797: 1793: 1789: 1787: 1782: 1774: 1761: 1759: 1479: 1474: 1472: 1340: 1336: 1328: 1324: 1320: 1313: 1309: 1307: 1071: 968: 959: 955: 953: 948: 944: 942: 740: 735: 733: 548: 543: 539: 537: 534:Bigram model 447: 388: 385: 337: 334: 267: 140: 135: 133: 105: 102: 40: 36: 22: 20: 18: 2975:(1): 9–22. 2457:Collocation 2367:that allow 2291:-grams are 2015:discounting 2013:Good–Turing 1930:probability 1865:text corpus 1790:<unk> 449:documents: 350:word in doc 3014:Categories 2638:1509.09097 2581:1510.04600 2493:References 2441:algorithm. 2306:Syntactic 2287:Syntactic 2279:Syntactic 2110:mainly the 2094:rain Spain 956:*I saw the 925:red, house 128:See also: 2943:207738654 2851:7 October 2835:CiteSeerX 2692:1301.3781 2424:documents 2231:≈ 2173:− 2031:(trigram) 1993:posterior 1934:smoothing 1724:− 1710:… 1696:− 1687:− 1633:− 1619:… 1605:− 1596:− 1551:− 1537:… 1523:− 1514:− 1503:∣ 1445:∣ 1415:− 1401:… 1387:− 1378:− 1367:∣ 1286:− 1272:… 1258:− 1249:− 1238:∣ 1202:∏ 1198:≈ 1187:− 1173:… 1157:∣ 1121:∏ 1098:… 1045:… 996:… 921:∣ 918:⟩ 907:⟨ 890:∣ 868:∣ 846:∣ 823:⟩ 817:⟨ 814:∣ 797:⟩ 791:⟨ 785:⟩ 779:⟨ 776:∣ 762:≈ 707:∣ 704:⟩ 693:⟨ 676:∣ 654:∣ 632:∣ 610:∣ 593:⟩ 587:⟨ 584:∣ 570:≈ 414:∏ 346:∑ 88:⟩ 77:⟨ 57:⟩ 51:⟨ 2986:12 March 2977:Archived 2896:Archived 2892:27378409 2845:Archived 2793:Archived 2721:Archived 2670:27 April 2614:12361426 2606:26617328 2451:See also 2114:on plain 2106:falls on 2098:in falls 1857:z-scores 894:the, red 872:saw, the 2730:22 June 2586:Bibcode 2482:n-tuple 2477:MinHash 2400:, OCR ( 2365:kernels 2363:design 2136:to its 2051:skipped 1869:g-score 1331:-order 2941:  2902:27 May 2890:  2837:  2802:18 May 2785:  2612:  2604:  2542:24 May 2283:-grams 2112:, and 2090:the in 2060:-skip- 2053:over. 1945:smooth 1908:in an 1873:g-test 1779:corpus 850:I, saw 2980:(PDF) 2965:(PDF) 2939:S2CID 2888:S2CID 2796:(PDF) 2767:(PDF) 2724:(PDF) 2713:(PDF) 2687:arXiv 2664:(PDF) 2657:(PDF) 2633:arXiv 2610:S2CID 2576:arXiv 2536:(PDF) 2429:BLAST 1312:word 886:house 711:house 672:house 510:share 505:0.02 494:0.03 488:likes 477:world 403:query 318:share 313:0.05 305:0.05 302:likes 294:world 21:word 2988:2018 2973:XIII 2904:2015 2853:2021 2804:2019 2783:ISBN 2746:link 2732:2015 2672:2014 2602:PMID 2544:2022 1768:and 527:... 516:0.2 502:0.05 491:0.05 483:0.1 472:0.3 455:Word 430:word 362:word 329:... 321:0.3 297:0.2 289:0.1 278:Word 69:and 2931:doi 2880:doi 2775:doi 2594:doi 2412:), 2410:ICR 2404:), 1975:or 864:red 842:the 810:saw 680:red 658:the 650:red 636:saw 628:the 606:saw 524:... 521:... 513:0.3 480:0.2 469:0.1 326:... 153:uni 116:or 3016:: 2971:. 2967:. 2937:. 2927:41 2925:. 2894:. 2886:. 2876:28 2874:. 2870:. 2843:. 2829:. 2825:. 2791:. 2781:. 2754:^ 2742:}} 2738:{{ 2715:. 2608:. 2600:. 2592:. 2584:. 2514:. 2510:. 2108:, 2104:, 2100:, 2096:, 2092:, 1896:. 1886:. 1829:26 1470:. 499:we 310:we 120:. 100:. 19:A 2960:n 2945:. 2933:: 2919:n 2906:. 2882:: 2855:. 2831:4 2821:n 2806:. 2777:: 2748:) 2734:. 2695:. 2689:: 2674:. 2641:. 2635:: 2616:. 2596:: 2588:: 2578:: 2546:. 2516:3 2408:( 2391:n 2350:n 2339:n 2335:n 2331:n 2324:n 2320:n 2312:n 2308:n 2297:n 2293:n 2289:n 2281:n 2257:) 2253:n 2250:e 2247:e 2244:u 2241:q 2237:( 2234:v 2228:) 2224:e 2221:l 2218:a 2215:m 2212:e 2209:f 2205:( 2202:v 2199:+ 2196:) 2192:e 2189:l 2186:a 2183:m 2179:( 2176:v 2170:) 2166:g 2163:n 2160:i 2157:k 2153:( 2150:v 2138:n 2134:w 2130:v 2116:. 2070:k 2066:n 2062:n 2058:k 2047:n 1996:n 1985:n 1965:n 1957:n 1953:n 1949:n 1910:n 1906:n 1890:n 1880:n 1861:n 1853:n 1833:3 1818:n 1805:n 1798:n 1794:n 1783:n 1775:n 1762:n 1732:) 1727:1 1721:i 1717:w 1713:, 1707:, 1702:) 1699:1 1693:n 1690:( 1684:i 1680:w 1676:( 1672:t 1669:n 1666:u 1663:o 1660:c 1654:) 1649:i 1645:w 1641:, 1636:1 1630:i 1626:w 1622:, 1616:, 1611:) 1608:1 1602:n 1599:( 1593:i 1589:w 1585:( 1581:t 1578:n 1575:u 1572:o 1569:c 1562:= 1559:) 1554:1 1548:i 1544:w 1540:, 1534:, 1529:) 1526:1 1520:n 1517:( 1511:i 1507:w 1498:i 1494:w 1490:( 1487:P 1475:n 1458:) 1453:1 1449:w 1440:2 1436:w 1432:( 1429:P 1426:= 1423:) 1418:1 1412:i 1408:w 1404:, 1398:, 1393:) 1390:1 1384:n 1381:( 1375:i 1371:w 1362:i 1358:w 1354:( 1351:P 1341:i 1337:n 1329:n 1325:n 1321:i 1316:i 1314:w 1310:i 1294:) 1289:1 1283:i 1279:w 1275:, 1269:, 1264:) 1261:1 1255:n 1252:( 1246:i 1242:w 1233:i 1229:w 1225:( 1222:P 1217:m 1212:2 1209:= 1206:i 1195:) 1190:1 1184:i 1180:w 1176:, 1170:, 1165:1 1161:w 1152:i 1148:w 1144:( 1141:P 1136:m 1131:1 1128:= 1125:i 1117:= 1114:) 1109:m 1105:w 1101:, 1095:, 1090:1 1086:w 1082:( 1079:P 1056:m 1052:w 1048:, 1042:, 1037:1 1033:w 1012:) 1007:m 1003:w 999:, 993:, 988:1 984:w 980:( 977:P 949:n 945:n 929:) 915:s 911:/ 904:( 901:P 898:) 882:( 879:P 876:) 860:( 857:P 854:) 838:( 835:P 832:) 829:I 826:, 820:s 806:( 803:P 800:) 794:s 788:, 782:s 772:I 768:( 765:P 759:) 751:( 748:P 736:n 715:) 701:s 697:/ 690:( 687:P 684:) 668:( 665:P 662:) 646:( 643:P 640:) 624:( 621:P 618:) 614:I 602:( 599:P 596:) 590:s 580:I 576:( 573:P 567:) 559:( 556:P 540:n 466:a 434:) 426:( 423:P 410:= 407:) 399:( 396:P 372:1 369:= 366:) 358:( 355:P 286:a 254:. 251:) 246:3 242:t 238:( 235:P 232:) 227:2 223:t 219:( 216:P 213:) 208:1 204:t 200:( 197:P 194:= 191:) 186:3 182:t 176:2 172:t 166:1 162:t 158:( 149:P 136:n 106:n 85:s 81:/ 54:s 41:n 37:n 23:n

Index

recurrent neural network
large language models
uninformative prior
Good–Turing discounting
back-off models
Bag-of-words model
finite automata
Markov property
Statistical machine translation
computational linguistics
natural language processing
corpus
Approximate string matching
cosine distance
z-scores
text corpus
g-score
g-test
Bayesian inference
plagiarism detection
probability
smoothing
pseudocounts
the zero-frequency problem
Rule of succession
Good–Turing discounting
back-off models
prior distribution
Bayesian inference
posterior

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.