Stemming - Knowledge

425:

to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).

308:

the language). Alternatively, some suffix stripping approaches maintain a database (a large list) of all known morphological word roots that exist as real words. These approaches check the list for the existence of the term prior to making a decision. Typically, if the term does not exist, alternate action is taken. This alternate action may involve several other criteria. The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules.

248:. The advantages of this approach are that it is simple, fast, and easily handles exceptions. The disadvantages are that all inflected forms must be explicitly listed in the table: new or unfamiliar words are not handled, even if they are perfectly regular (e.g. cats ~ cat), and the table may be large. For languages with simple morphology, like English, table sizes are modest, but highly inflected languages like Turkish may have hundreds of potential inflected forms for each root. 392:. In the rule-based approach, the three rules mentioned above would be applied in succession to converge on the same solution. Chances are that the brute force approach would be slower, as lookup algorithms have a direct access to the solution, while rule-based should try several options, and combinations of them, and then choose which result seems to be the best. 493:

Such algorithms use a stem database (for example a set of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the "brows" in "browse" and in "browsing"). In order to stem a word the algorithm tries to match

452:

Hybrid approaches use two or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept

428:

Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional

424:

algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they "learn") on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature

307:

Suffix stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in

287:

Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as

412:

This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over

216:

at Lancaster University in the late 1980s, it is an iterative stemmer and features an externally stored set of stemming rules. The standard set of rules provides a 'strong' stemmer and may specify the removal or replacement of an ending. The replacement technique avoids the need for a separate stage

494:

it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix "be", which is the stem of such words as "be", "been" and "being", would not be considered as the stem of the word "beside")..

311:

It can be the case that two or more suffix stripping rules apply to the same input term, which creates an ambiguity as to which rule to apply. The algorithm may assign (by human hand or stochastically) a priority to one rule or another. Or the algorithm may reject one rule application because it

272:

Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" is stored which provides a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:

505:

Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers

624:

adopted word stemming in 2003. Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings will obviously find "fish" in "fishing" but when searching for "fishes" will not find

263:

The lookup table used by a stemmer is generally produced semi-automatically. For example, if the word is "run", then the inverted algorithm might automatically generate the forms "running", "runs", "runned", and "runly". The last two forms are valid constructions, but they are unlikely..

506:

become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an Italian stemmer is more complex than an English one (because of a greater number of verb inflections), a Russian one is more complex (more noun

559:

Stemming is used as an approximate method for grouping words with a similar basic meaning together. For example, a text mentioning "daffodils" is probably closely related to a text mentioning "daffodil" (without the s). But in some cases, words with the same morphological stem have

196:

Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official

820:

Proceedings of the ACL-2009, Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2–7,

360:

Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say). After applying the suffix substitution rule in this example scenario, a second pass is made to identify matching rules on the term

429:

information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.

413:

suffix stripping algorithms. The basic idea is that, if the stemmer is able to grasp more information about the word being stemmed, then it can apply more accurate normalization rules (which unlike suffix stripping rules can also modify the stem).

408:

of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.

344:

suffix stripping rule as well as the suffix substitution rule apply. Since the stripping rule results in a non-existent term in the lexicon, but the substitution rule does not, the substitution rule is applied instead. In this example,

969: 331:

One improvement upon basic suffix stripping is the use of suffix substitution. Similar to a stripping rule, a substitution rule replaces a suffix with an alternate suffix. For example, there could exist a rule that replaces

522:

Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.

550:

An example of understemming in the Porter stemmer is "alumnus" → "alumnu", "alumni" → "alumni", "alumna"/"alumnae" → "alumna". This English word keeps Latin morphology, and so these near-synonyms are not conflated.

531:

There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been—a

453:

small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.

158:

in 1968. This paper was remarkable for its early date and had great influence on later work in this area. Her paper refers to three earlier major attempts at stemming algorithms, by Professor

502:

While much of the early academic work in this area was focused on the English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.

1011: 217:

in the process to recode or provide partial matching. Paice also developed a direct measurement for comparing stemmers based on counting the over-stemming and under-stemming errors.

1136: 288:

crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those

547:

related, their modern meanings are in widely different domains, so treating them as synonyms in a search engine will likely reduce the relevance of the search results.

543:

For example, the widely used Porter stemmer stems "universal", "university", and "universe" to "univers". This is a case of overstemming: though these three words are

384:

This example also helps illustrate the difference between a rule-based approach and a brute force approach. In a brute force approach, the algorithm would search for

1244: 951: 1435: 1301: 1595: 564:

meanings which are not closely related: a user searching for "marketing" will not be satisfied by most documents mentioning "markets" but not "marketing".

514:, a writing system without vowels, and the requirement of prefix stripping: Hebrew stems can be two, three or four characters, but not more), and so on. 1364: 1221: 1200: 871: 292:

which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules.

1235: 241:

There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.

611:

Many commercial companies have been using stemming since at least the 1980s and have produced algorithmic and lexical stemmers in many languages.

1345: 209:, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages. 2194: 1573: 235: 1006: 1213: 481:, identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name 1188: 1171: 2204: 1984: 1428: 1227: 877: 189:. This stemmer was very widely used and became the de facto standard algorithm used for English stemming. Dr. Porter received the 2153: 583:. The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early 39:

of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.

477:. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word 1162: 1111: 1086: 1049: 1149: 2189: 1894: 1585: 1421: 540:. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other. 340:. How this affects the algorithm varies on the algorithm's design. To illustrate, the algorithm may identify that both the 205:-licensed) implementation of the algorithm around the year 2000. He extended this work over the next few years by building 1397: 1391: 2148: 1028:

CLEF 2003: Stephen Tomlinson compared the Snowball stemmers with the Hummingbird lexical stemming (lemmatization) system

2199: 1755: 1909: 1740: 859: 591:

rather than stems, may be used instead. Also, stemmers may provide greater benefits in other languages than English.

990: 1680: 2097: 1750: 1298:—free stemming algorithms for many languages, includes source code, including stemmers for five romance languages 1120: 697: 615: 536:. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not—a 312:

results in a non-existent term whereas the other overlapping rule does not. For example, given the English term

2184: 1745: 1490: 889: 2014: 1735: 206: 1361: 1197: 1707: 816:

Automatic Training of Lemmatization Rules that Handle Morphological Changes in pre-, in- and Suffixes Alike

2052: 2037: 2009: 1874: 1869: 1444: 679: 511: 388:

in the set of hundreds of thousands of inflected word forms and ideally find the corresponding root form

1385: 1789: 1760: 1538: 1039:

CLEF 2004: Stephen Tomlinson "Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServer"

1632: 1485: 1102: 1038: 642: 1342: 587:

researchers to deem stemming irrelevant in general. An alternative approach, based on searching for

2158: 2082: 1814: 1770: 1655: 1553: 674: 647: 299:

Prefix stripping may also be implemented. Of course, not all languages use prefixing or suffixing.

178:, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California. 2062: 2032: 1699: 835: 633:

Stemming is used as a task in pre-processing texts before performing text mining analyses on it.

1533: 932: 1919: 1612: 1590: 1580: 1548: 1523: 784: 252: 1779: 1312: 584: 573: 190: 20: 1253: 2132: 1808: 1784: 1637: 1336: 213: 163: 849: 8: 2112: 2042: 1999: 1955: 1727: 1717: 1712: 1600: 1382:—open source JavaScript implementation of Snowball stemming algorithms for many languages 1330: 1292:—open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API) 1262: 2122: 1994: 1859: 1622: 1605: 1463: 1403: 703: 691: 658: 167: 36: 32: 933:

Language-Dependent and Language-Independent Approaches to Cross-Lingual Text Retrieval

2127: 1839: 1647: 1558: 855: 824: 155: 851:

Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages

2004: 1889: 1864: 1665: 1568: 1307: 1286:—free online service, includes Porter and Paice/Husk' Lancaster stemmers (Java API) 930:

Kamps, Jaap; Monz, Christof; de Rijke, Maarten; and Sigurbjörnsson, Börkur (2004);

618:

stemmers have been compared with commercial lexical stemmers with varying results.

289: 63: 44: 2116: 2077: 2072: 1940: 1670: 1543: 1518: 1500: 1368: 1355: 1349: 1324: 1223:

The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data

1204: 1015: 994: 873:

The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data

600: 56: 1074:, Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33–46 903: 1824: 1804: 1528: 1408: 577: 537: 533: 405: 91: 1413: 1404:

Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers

485:. A study of affix stemming for several European languages can be found here. 2178: 2087: 1899: 1879: 1660: 1027: 1008:

Building Multilingual Solutions by using Sharepoint Products and Technologies

769: 755: 664: 621: 580: 401: 293: 198: 182: 175: 159: 48: 224: 2067: 1685: 245: 202: 171: 27:

is the process of reducing inflected (or sometimes derived) words to their

1095:

Stemming algorithms, Information retrieval: data structures and algorithms

400:

A more complex approach to the problem of determining a stem of a word is

35:

form—generally a written word form. The stem need not be identical to the

2024: 1904: 1617: 1510: 1458: 1142:

Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18–22

985: 709: 462: 1277: 328:

is likely not found in the lexicon, and therefore the rule is rejected.

122:. The stem need not be a word, for example the Porter algorithm reduces 1627: 1379: 1373: 1289: 653: 507: 421: 1180: 1140:, in Frei, H.-P.; Harman, D.; Schauble, P.; and Wilkinson, R. (eds.); 1115:, Journal of the American Society for Information Science 42 (1), 7–15 730: 1495: 1163:

An Evaluation of some Conflation Algorithms for Information Retrieval

544: 40: 28: 1318: 1283: 1214:

Method for Evaluation of Stemming Algorithms based on Error Counting

1970: 1950: 1935: 1914: 1884: 1829: 1794: 1248:

Journal of the American Society for Information Science, 44(1), 1–9

1160:

Lennon, M.; Pierce, D. S.; Tarry, B. D.; & Willett, P. (1981);

936:, in Peters, C.; Gonzalo, J.; Braschler, M.; and Kluck, M. (eds.); 2107: 1965: 1945: 1819: 1563: 1478: 1184:, Mechanical Translation and Computational Linguistics, 11, 22—31 1172:

Error Evaluation for Stemming Algorithms as Clustering Algorithms

938:

Comparative Evaluation of Multilingual Information Access Systems

712:—stemming algorithms play a major role in commercial NLP software 52: 1376:—A Java/Python/.Net stemming toolkit for the Portuguese language 905:

Uma revisão dos algoritmos de radicalização em língua portuguesa

320:

suffix and apply the appropriate rule and achieve the result of

1473: 1468: 1121:

Stemming Algorithms – A Case Study for Detailed Evaluation

1055: 669: 588: 474: 470: 441: 952:

Word Normalization and Decompounding in Mono- and Bilingual IR

369:

stripping rule is likely identified and accepted. In summary,

2163: 1799: 1106:, Information Processing & Management 10 (11/12), 371–386 561: 466: 1087:

Strength and Similarity of Affix Removal Stemming Algorithms

232:

Is there any perfect stemming algorithm in English language?

193:

in 2000 for his work on stemming and information retrieval.

1254:

A Practical Stemming Algorithm for Online Search Assistance

837:

Stemming Approaches for East European Languages (CLEF 2007)

685: 1295: 1960: 917:

Baeza-Yates, Ricardo; and Ribeiro-Neto, Berthier (1999);

444:

context of a word to choose the correct stem for a word.

1263:

Corpus-Based Stemming Using Coocurrence of Word Variants

1245:

Stemming of French Words Based on Grammatical Categories

185:

and was published in the July 1980 issue of the journal

1266:, ACM Transactions on Information Systems, 16(1), 61–81 1343:

Official home page of the Lancaster stemming algorithm

1331:

Unofficial home page of the Lovins stemming algorithm

1251:

Ulmschneider, John E.; & Doszkocs, Tamas (1983);

599:

Stemming is used to determine domain vocabularies in

968:

Frakes, W.; Prieto-Diaz, R.; & Fox, C. (1998). "

738:

Mechanical Translation and Computational Linguistics

226: 1337:

Official home page of the Porter stemming algorithm

785:"Exploring New Languages with HAIRCUT at CLEF 2005" 106:. A stemming algorithm might also reduce the words 1129:A Detailed Analysis of English Stemming Algorithms 244:A simple stemmer looks up the inflected form in a 688:—implements several stemming algorithms in Python 2176: 1646: 1304:—port of Snowball stemmers for C# (14 languages) 854:, ACM Symposium on Applied Computing, SAC 2006, 682:—stemming is generally regarded as a form of NLP 302: 1443: 1103:Word segmentation by letter successor varieties 776: 1187:Jenkins, Marie-Claire; and Smith, Dan (2005); 404:. This process involves first determining the 267: 66:or subroutine that stems word may be called a 1429: 1230:, Volume 43, Issue 5 (June), pp. 384–390 1190:Conservative Stemming for Search and Indexing 1097:, Upper Saddle River, NJ: Prentice-Hall, Inc. 606: 510:), a Hebrew one is even more complex (due to 1166:, Journal of Information Science, 3: 177–183 236:(more unsolved problems in computer science) 86:A stemmer for English operating on the stem 1356:Official home page of the UEA-Lite Stemmer 1339:—including source code in several languages 1327:—stemming library in C++ released under BSD 1220:Popovič, Mirko; and Willett, Peter (1992); 1127:Hull, D. A. & Grefenstette, G. (1996); 970:DARE: Domain Analysis and Reuse Environment 870:Popovič, Mirko; and Willett, Peter (1992); 395: 280:if the word ends in 'ing', remove the 'ing' 258: 154:The first published stemmer was written by 1436: 1422: 1333:—with source code in a couple of languages 1150:Viewing Morphology as an Inference Process 700:—designed for creating stemming algorithms 1079:Term Conflation for Information Retrieval 902:Viera, A. F. G. & Virgil, J. (2007); 706:—linguistic definition of the term "stem" 694:—linguistic definition of the term "root" 650:—stemming is a form of reverse derivation 296:attempts to improve upon this challenge. 283:if the word ends in 'ly', remove the 'ly' 277:if the word ends in 'ed', remove the 'ed' 1228:American Society for Information Science 1100:Hafer, M. A. & Weiss, S. F. (1974); 908:, Information Research, 12(3), paper 315 880:, Volume 43, Issue 5 (June), pp. 384–390 878:American Society for Information Science 567: 517: 416: 212:The Paice-Husk Stemmer was developed by 1084:Frakes, W. B. & Fox, C. J. (2003); 834:Dolamic, Ljiljana; and Savoy, Jacques; 782: 16:Process of reducing words to word stems 2177: 1280:—includes Porter and Snowball stemmers 1137:Viewing Stemming as Recall Enhancement 1134:Kraaij, W. & Pohlmann, R. (1996); 728: 497: 488: 251:A lookup approach may use preliminary 1417: 731:"Development of a Stemming Algorithm" 2195:Tasks of natural language processing 1895:Simple Knowledge Organization System 572:Stemmers can be used as elements in 447: 227:Unsolved problem in computer science 1260:Xu, J.; & Croft, W. B. (1998); 1181:Development of a Stemming Algorithm 174:, under the direction of Professor 13: 1321:—PHP extension to the Snowball API 1072:Suffix Removal for Word Conflation 1063: 891:Stemming in Hungarian at CLEF 2005 594: 432: 51:treat words with the same stem as 43:for stemming have been studied in 14: 2216: 1910:Thesaurus (information retrieval) 1271: 1236:An Algorithm for Suffix Stripping 456: 440:Some stemming techniques use the 316:, the algorithm may identify the 2205:Information retrieval techniques 814:Jongejan, B.; and Dalianis, H.; 783:McNamee, Paul (September 2005). 625:occurrences of the word "fish". 526: 1362:Overview of stemming algorithms 1315:—Ruby extension to Snowball API 1308:Python bindings to Snowball API 1051:The Essentials of Google Search 1043: 1032: 1021: 1000: 979: 962: 943: 924: 911: 896: 698:Snowball (programming language) 554: 181:A later stemmer was written by 59:, a process called conflation. 1491:Natural language understanding 1400:—open source stemmer for Czech 1394:—open source stemmer for Hindi 1358:—University of East Anglia, UK 1257:, Online Review, 7(4), 301–318 974:Annals of Software Engineering 940:, Springer Verlag, pp. 152–165 883: 864: 842: 828: 808: 762: 748: 722: 628: 377:which becomes (via stripping) 1: 2015:Optical character recognition 716: 303:Additional algorithm criteria 220: 166:, the algorithm developed at 1708:Multi-document summarization 1081:, Cambridge University Press 919:Modern Information Retrieval 7: 2190:Natural language processing 2038:Latent Dirichlet allocation 2010:Natural language generation 1875:Machine-readable dictionary 1870:Linguistic Linked Open Data 1445:Natural language processing 1325:Oleander Porter's algorithm 1112:How Effective is Suffixing? 756:"Porter Stemming Algorithm" 729:Lovins, Julie Beth (1968). 680:Natural language processing 636: 512:nonconcatenative morphology 373:becomes (via substitution) 268:Suffix-stripping algorithms 81: 23:and information retrieval, 10: 2221: 1790:Explicit semantic analysis 1539:Deep linguistic processing 1233:Porter, Martin F. (1980); 1155:Proceedings of ACM-SIGIR93 1054:, Web Search Help Center, 921:, ACM Press/Addison Wesley 607:Use in commercial products 149: 2200:Computational linguistics 2141: 2096: 2051: 2023: 1983: 1928: 1850: 1838: 1769: 1726: 1698: 1633:Word-sense disambiguation 1509: 1486:Computational linguistics 1451: 1352:—Lancaster University, UK 1239:, Program, 14(3): 130–137 993:14 September 2011 at the 792:CEUR Workshop Proceedings 643:Computational linguistics 2159:Natural Language Toolkit 2083:Pronunciation assessment 1985:Automatic identification 1815:Latent semantic analysis 1771:Distributional semantics 1656:Compound-term processing 1554:Named-entity recognition 1388:—implementation for Java 1208:, SIGIR Forum, 24: 56–61 1131:, Xerox Technical Report 1090:, SIGIR Forum, 37: 26–30 987:Language Extension Packs 955:, Information Retrieval 675:Morphology (linguistics) 396:Lemmatisation algorithms 259:The production technique 2063:Automated essay scoring 2033:Document classification 1700:Automatic summarization 1217:, JASIS, 47(8): 632–649 1014:17 January 2008 at the 255:to avoid overstemming. 1920:Universal Dependencies 1613:Terminology extraction 1596:Semantic decomposition 1591:Semantic role labeling 1581:Part-of-speech tagging 1549:Information extraction 1534:Coreference resolution 1524:Collocation extraction 1178:Lovins, J. B. (1968); 1093:Frakes, W. B. (1992); 1077:Frakes, W. B. (1984); 1070:Dawson, J. L. (1974); 661:—linguistic definition 253:part-of-speech tagging 47:since the 1960s. Many 2185:Linguistic morphology 1681:Sentence segmentation 1195:Paice, C. D. (1990); 1124:, JASIS, 47(1): 70–84 585:information retrieval 568:Information retrieval 518:Multilingual stemming 417:Stochastic algorithms 191:Tony Kent Strix award 90:should identify such 21:linguistic morphology 2133:Voice user interface 1844:datasets and corpora 1785:Document-term matrix 1638:Word-sense induction 1211:Paice, C. D. (1996) 1147:Krovetz, R. (1993); 1118:Hull, D. A. (1996); 949:Airio, Eija (2006); 164:Princeton University 2113:Interactive fiction 2043:Pachinko allocation 2000:Speech segmentation 1956:Google Ngram Viewer 1728:Machine translation 1718:Text simplification 1713:Sentence extraction 1601:Semantic similarity 1169:Lovins, J. (1971); 1109:Harman, D. (1991); 1018:, Microsoft Technet 498:Language challenges 489:Matching algorithms 469:refers to either a 2123:Question answering 1995:Speech recognition 1860:Corpus linguistics 1840:Language resources 1623:Textual entailment 1606:Sentiment analysis 1367:2011-07-02 at the 1348:2011-07-22 at the 1242:Savoy, J. (1993); 1203:2011-07-22 at the 1175:, JASIS, 22: 28–40 1157:, pp. 191–203 704:Stem (linguistics) 692:Root (linguistics) 659:Lemma (morphology) 290:lexical categories 168:Harvard University 72:stemming algorithm 37:morphological root 2172: 2171: 2128:Virtual assistant 2053:Computer-assisted 1979: 1978: 1736:Computer-assisted 1694: 1693: 1686:Word segmentation 1648:Text segmentation 1586:Semantic analysis 1574:Syntactic parsing 1559:Ontology learning 1226:, Journal of the 876:, Journal of the 448:Hybrid approaches 156:Julie Beth Lovins 2212: 2149:Formal semantics 2098:Natural language 2005:Speech synthesis 1987:and data capture 1890:Semantic network 1865:Lexical resource 1848: 1847: 1666:Lexical analysis 1644: 1643: 1569:Semantic parsing 1438: 1431: 1424: 1415: 1414: 1386:Snowball Stemmer 1144:, pp. 40–48 1058: 1047: 1041: 1036: 1030: 1025: 1019: 1004: 998: 983: 977: 976:(5), pp. 125-141 966: 960: 947: 941: 928: 922: 915: 909: 900: 894: 887: 881: 868: 862: 848:Savoy, Jacques; 846: 840: 832: 826: 812: 806: 805: 803: 802: 789: 780: 774: 766: 760: 759: 752: 746: 745: 735: 726: 228: 68:stemming program 64:computer program 45:computer science 2220: 2219: 2215: 2214: 2213: 2211: 2210: 2209: 2175: 2174: 2173: 2168: 2137: 2117:Syntax guessing 2099: 2092: 2078:Predictive text 2073:Grammar checker 2054: 2047: 2019: 1986: 1975: 1941:Bank of English 1924: 1852: 1843: 1834: 1765: 1722: 1690: 1642: 1544:Distant reading 1519:Argument mining 1505: 1501:Text processing 1447: 1442: 1369:Wayback Machine 1350:Wayback Machine 1274: 1269: 1205:Wayback Machine 1198:Another Stemmer 1066: 1064:Further reading 1061: 1048: 1044: 1037: 1033: 1026: 1022: 1016:Wayback Machine 1005: 1001: 995:Wayback Machine 984: 980: 967: 963: 948: 944: 929: 925: 916: 912: 901: 897: 888: 884: 869: 865: 847: 843: 833: 829: 813: 809: 800: 798: 787: 781: 777: 768:Yatsko, V. A.; 767: 763: 754: 753: 749: 733: 727: 723: 719: 639: 631: 609: 601:domain analysis 597: 595:Domain analysis 570: 557: 529: 520: 500: 491: 483:affix stripping 459: 450: 438: 419: 398: 305: 270: 261: 239: 238: 233: 230: 223: 152: 84: 57:query expansion 17: 12: 11: 5: 2218: 2208: 2207: 2202: 2197: 2192: 2187: 2170: 2169: 2167: 2166: 2161: 2156: 2151: 2145: 2143: 2139: 2138: 2136: 2135: 2130: 2125: 2120: 2110: 2104: 2102: 2100:user interface 2094: 2093: 2091: 2090: 2085: 2080: 2075: 2070: 2065: 2059: 2057: 2049: 2048: 2046: 2045: 2040: 2035: 2029: 2027: 2021: 2020: 2018: 2017: 2012: 2007: 2002: 1997: 1991: 1989: 1981: 1980: 1977: 1976: 1974: 1973: 1968: 1963: 1958: 1953: 1948: 1943: 1938: 1932: 1930: 1926: 1925: 1923: 1922: 1917: 1912: 1907: 1902: 1897: 1892: 1887: 1882: 1877: 1872: 1867: 1862: 1856: 1854: 1845: 1836: 1835: 1833: 1832: 1827: 1825:Word embedding 1822: 1817: 1812: 1805:Language model 1802: 1797: 1792: 1787: 1782: 1776: 1774: 1767: 1766: 1764: 1763: 1758: 1756:Transfer-based 1753: 1748: 1743: 1738: 1732: 1730: 1724: 1723: 1721: 1720: 1715: 1710: 1704: 1702: 1696: 1695: 1692: 1691: 1689: 1688: 1683: 1678: 1673: 1668: 1663: 1658: 1652: 1650: 1641: 1640: 1635: 1630: 1625: 1620: 1615: 1609: 1608: 1603: 1598: 1593: 1588: 1583: 1578: 1577: 1576: 1571: 1561: 1556: 1551: 1546: 1541: 1536: 1531: 1529:Concept mining 1526: 1521: 1515: 1513: 1507: 1506: 1504: 1503: 1498: 1493: 1488: 1483: 1482: 1481: 1476: 1466: 1461: 1455: 1453: 1449: 1448: 1441: 1440: 1433: 1426: 1418: 1412: 1411: 1406: 1401: 1395: 1389: 1383: 1377: 1371: 1359: 1353: 1340: 1334: 1328: 1322: 1316: 1310: 1305: 1302:Snowball on C# 1299: 1293: 1287: 1281: 1278:Apache OpenNLP 1273: 1272:External links 1270: 1268: 1267: 1258: 1249: 1240: 1231: 1218: 1209: 1193: 1185: 1176: 1167: 1158: 1145: 1132: 1125: 1116: 1107: 1098: 1091: 1082: 1075: 1067: 1065: 1062: 1060: 1059: 1042: 1031: 1020: 999: 978: 961: 942: 923: 910: 895: 882: 863: 841: 827: 823:, pp. 145-153 807: 775: 761: 747: 720: 718: 715: 714: 713: 707: 701: 695: 689: 683: 677: 672: 667: 662: 656: 651: 645: 638: 635: 630: 627: 608: 605: 596: 593: 581:search engines 569: 566: 556: 553: 545:etymologically 538:false negative 534:false positive 528: 525: 519: 516: 499: 496: 490: 487: 458: 457:Affix stemmers 455: 449: 446: 437: 436:-gram analysis 431: 418: 415: 406:part of speech 397: 394: 304: 301: 285: 284: 281: 278: 269: 266: 260: 257: 234: 231: 225: 222: 219: 151: 148: 83: 80: 49:search engines 15: 9: 6: 4: 3: 2: 2217: 2206: 2203: 2201: 2198: 2196: 2193: 2191: 2188: 2186: 2183: 2182: 2180: 2165: 2162: 2160: 2157: 2155: 2154:Hallucination 2152: 2150: 2147: 2146: 2144: 2140: 2134: 2131: 2129: 2126: 2124: 2121: 2118: 2114: 2111: 2109: 2106: 2105: 2103: 2101: 2095: 2089: 2088:Spell checker 2086: 2084: 2081: 2079: 2076: 2074: 2071: 2069: 2066: 2064: 2061: 2060: 2058: 2056: 2050: 2044: 2041: 2039: 2036: 2034: 2031: 2030: 2028: 2026: 2022: 2016: 2013: 2011: 2008: 2006: 2003: 2001: 1998: 1996: 1993: 1992: 1990: 1988: 1982: 1972: 1969: 1967: 1964: 1962: 1959: 1957: 1954: 1952: 1949: 1947: 1944: 1942: 1939: 1937: 1934: 1933: 1931: 1927: 1921: 1918: 1916: 1913: 1911: 1908: 1906: 1903: 1901: 1900:Speech corpus 1898: 1896: 1893: 1891: 1888: 1886: 1883: 1881: 1880:Parallel text 1878: 1876: 1873: 1871: 1868: 1866: 1863: 1861: 1858: 1857: 1855: 1849: 1846: 1841: 1837: 1831: 1828: 1826: 1823: 1821: 1818: 1816: 1813: 1810: 1806: 1803: 1801: 1798: 1796: 1793: 1791: 1788: 1786: 1783: 1781: 1778: 1777: 1775: 1772: 1768: 1762: 1759: 1757: 1754: 1752: 1749: 1747: 1744: 1742: 1741:Example-based 1739: 1737: 1734: 1733: 1731: 1729: 1725: 1719: 1716: 1714: 1711: 1709: 1706: 1705: 1703: 1701: 1697: 1687: 1684: 1682: 1679: 1677: 1674: 1672: 1671:Text chunking 1669: 1667: 1664: 1662: 1661:Lemmatisation 1659: 1657: 1654: 1653: 1651: 1649: 1645: 1639: 1636: 1634: 1631: 1629: 1626: 1624: 1621: 1619: 1616: 1614: 1611: 1610: 1607: 1604: 1602: 1599: 1597: 1594: 1592: 1589: 1587: 1584: 1582: 1579: 1575: 1572: 1570: 1567: 1566: 1565: 1562: 1560: 1557: 1555: 1552: 1550: 1547: 1545: 1542: 1540: 1537: 1535: 1532: 1530: 1527: 1525: 1522: 1520: 1517: 1516: 1514: 1512: 1511:Text analysis 1508: 1502: 1499: 1497: 1494: 1492: 1489: 1487: 1484: 1480: 1477: 1475: 1472: 1471: 1470: 1467: 1465: 1462: 1460: 1457: 1456: 1454: 1452:General terms 1450: 1446: 1439: 1434: 1432: 1427: 1425: 1420: 1419: 1416: 1410: 1409:Tamil Stemmer 1407: 1405: 1402: 1399: 1398:czech_stemmer 1396: 1393: 1392:hindi_stemmer 1390: 1387: 1384: 1381: 1378: 1375: 1372: 1370: 1366: 1363: 1360: 1357: 1354: 1351: 1347: 1344: 1341: 1338: 1335: 1332: 1329: 1326: 1323: 1320: 1317: 1314: 1311: 1309: 1306: 1303: 1300: 1297: 1294: 1291: 1288: 1285: 1284:SMILE Stemmer 1282: 1279: 1276: 1275: 1265: 1264: 1259: 1256: 1255: 1250: 1247: 1246: 1241: 1238: 1237: 1232: 1229: 1225: 1224: 1219: 1216: 1215: 1210: 1207: 1206: 1202: 1199: 1194: 1192: 1191: 1186: 1183: 1182: 1177: 1174: 1173: 1168: 1165: 1164: 1159: 1156: 1152: 1151: 1146: 1143: 1139: 1138: 1133: 1130: 1126: 1123: 1122: 1117: 1114: 1113: 1108: 1105: 1104: 1099: 1096: 1092: 1089: 1088: 1083: 1080: 1076: 1073: 1069: 1068: 1057: 1053: 1052: 1046: 1040: 1035: 1029: 1024: 1017: 1013: 1010: 1009: 1003: 996: 992: 989: 988: 982: 975: 971: 965: 958: 954: 953: 946: 939: 935: 934: 927: 920: 914: 907: 906: 899: 893: 892: 886: 879: 875: 874: 867: 861: 860:1-59593-108-2 857: 853: 852: 845: 839: 838: 831: 825: 822: 817: 811: 797: 793: 786: 779: 773: 772: 765: 757: 751: 743: 739: 732: 725: 721: 711: 708: 705: 702: 699: 696: 693: 690: 687: 684: 681: 678: 676: 673: 671: 668: 666: 665:Lemmatization 663: 660: 657: 655: 652: 649: 646: 644: 641: 640: 634: 626: 623: 622:Google Search 619: 617: 612: 604: 602: 592: 590: 586: 582: 579: 575: 574:query systems 565: 563: 552: 548: 546: 541: 539: 535: 527:Error metrics 524: 515: 513: 509: 503: 495: 486: 484: 480: 476: 472: 468: 464: 454: 445: 443: 435: 430: 426: 423: 414: 410: 407: 403: 402:lemmatisation 393: 391: 387: 382: 380: 376: 372: 368: 364: 358: 356: 352: 348: 343: 339: 335: 329: 327: 323: 319: 315: 309: 300: 297: 295: 294:Lemmatisation 291: 282: 279: 276: 275: 274: 265: 256: 254: 249: 247: 242: 237: 218: 215: 214:Chris D Paice 210: 208: 204: 200: 199:free software 194: 192: 188: 184: 183:Martin Porter 179: 177: 176:Gerard Salton 173: 169: 165: 161: 160:John W. Tukey 157: 147: 145: 141: 137: 133: 129: 125: 121: 117: 113: 109: 105: 101: 97: 93: 89: 79: 77: 73: 69: 65: 60: 58: 55:as a kind of 54: 50: 46: 42: 38: 34: 30: 26: 22: 2068:Concordancer 1675: 1464:Bag-of-words 1313:Ruby-Stemmer 1261: 1252: 1243: 1234: 1222: 1212: 1196: 1189: 1179: 1170: 1161: 1154: 1148: 1141: 1135: 1128: 1119: 1110: 1101: 1094: 1085: 1078: 1071: 1050: 1045: 1034: 1023: 1007: 1002: 986: 981: 973: 964: 956: 950: 945: 937: 931: 926: 918: 913: 904: 898: 890: 885: 872: 866: 850: 844: 836: 830: 819: 815: 810: 799:. Retrieved 795: 791: 778: 770: 764: 750: 741: 737: 724: 632: 620: 613: 610: 598: 571: 558: 555:Applications 549: 542: 530: 521: 504: 501: 492: 482: 479:indefinitely 478: 460: 451: 439: 433: 427: 420: 411: 399: 389: 385: 383: 378: 374: 370: 366: 365:, where the 362: 359: 354: 350: 346: 341: 337: 333: 330: 325: 321: 317: 313: 310: 306: 298: 286: 271: 262: 250: 246:lookup table 243: 240: 211: 195: 186: 180: 172:Michael Lesk 153: 143: 142:to the stem 139: 135: 131: 127: 123: 119: 118:to the stem 115: 111: 107: 103: 99: 95: 87: 85: 75: 71: 67: 61: 24: 18: 2025:Topic model 1905:Text corpus 1751:Statistical 1618:Text mining 1459:AI-complete 1056:Google Inc. 710:Text mining 629:Text mining 508:declensions 465:, the term 463:linguistics 353:instead of 2179:Categories 1746:Rule-based 1628:Truecasing 1496:Stop words 1380:jsSnowball 997:, dtSearch 801:2017-12-21 717:References 654:Inflection 648:Derivation 422:Stochastic 386:friendlies 371:friendlies 347:friendlies 314:friendlies 221:Algorithms 41:Algorithms 31:, base or 2055:reviewing 1853:standards 1851:Types and 1374:PTStemmer 818:, in the 771:Y-stemmer 562:idiomatic 29:word stem 1971:Wikidata 1951:FrameNet 1936:BabelNet 1915:Treebank 1885:PropBank 1830:Word2vec 1795:fastText 1676:Stemming 1365:Archived 1346:Archived 1296:Snowball 1201:Archived 1012:Archived 991:Archived 959::249–271 744:: 22–31. 637:See also 616:Snowball 576:such as 375:friendly 363:friendly 355:friendl' 351:friendly 349:becomes 207:Snowball 201:(mostly 82:Examples 53:synonyms 25:stemming 2142:Related 2108:Chatbot 1966:WordNet 1946:DBpedia 1820:Seq2seq 1564:Parsing 1479:Trigram 589:n-grams 326:Friendl 322:friendl 187:Program 150:History 136:arguing 108:fishing 100:catlike 92:strings 76:stemmer 2115:(c.f. 1773:models 1761:Neural 1474:Bigram 1469:n-gram 1290:Themis 858: 670:Lexeme 475:suffix 471:prefix 442:n-gram 390:friend 379:friend 138:, and 132:argues 128:argued 116:fisher 114:, and 112:fished 102:, and 2164:spaCy 1809:large 1800:GloVe 1153:, in 788:(PDF) 734:(PDF) 473:or a 467:affix 336:with 140:argus 124:argue 104:catty 74:, or 1929:Data 1780:BERT 1319:PECL 856:ISBN 821:2009 796:1171 686:NLTK 614:The 144:argu 120:fish 96:cats 33:root 1961:UBY 972:", 578:Web 461:In 342:ies 334:ies 318:ies 203:BSD 170:by 162:of 94:as 88:cat 19:In 2181:: 794:. 790:. 742:11 740:. 736:. 603:. 381:. 367:ly 357:. 324:. 146:. 134:, 130:, 126:, 110:, 98:, 78:. 70:, 62:A 2119:) 1842:, 1811:) 1807:( 1437:e 1430:t 1423:v 957:9 804:. 758:. 434:n 338:y 229::

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.