425:
to those in suffix stripping or lemmatisation. Stemming is performed by inputting an inflected form to the trained model and having the model produce the root form according to its internal ruleset, which again is similar to suffix stripping and lemmatisation, except that the decisions involved in applying the most appropriate rule, or whether or not to stem the word and just return the same word, or whether to apply two different rules sequentially, are applied on the grounds that the output word will have the highest probability of being correct (which is to say, the smallest probability of being incorrect, which is how it is typically measured).
308:
the language). Alternatively, some suffix stripping approaches maintain a database (a large list) of all known morphological word roots that exist as real words. These approaches check the list for the existence of the term prior to making a decision. Typically, if the term does not exist, alternate action is taken. This alternate action may involve several other criteria. The non-existence of an output term may serve to cause the algorithm to try alternate suffix stripping rules.
248:. The advantages of this approach are that it is simple, fast, and easily handles exceptions. The disadvantages are that all inflected forms must be explicitly listed in the table: new or unfamiliar words are not handled, even if they are perfectly regular (e.g. cats ~ cat), and the table may be large. For languages with simple morphology, like English, table sizes are modest, but highly inflected languages like Turkish may have hundreds of potential inflected forms for each root.
392:. In the rule-based approach, the three rules mentioned above would be applied in succession to converge on the same solution. Chances are that the brute force approach would be slower, as lookup algorithms have a direct access to the solution, while rule-based should try several options, and combinations of them, and then choose which result seems to be the best.
493:
Such algorithms use a stem database (for example a set of documents that contain stem words). These stems, as mentioned above, are not necessarily valid words themselves (but rather common sub-strings, as the "brows" in "browse" and in "browsing"). In order to stem a word the algorithm tries to match
452:
Hybrid approaches use two or more of the approaches described above in unison. A simple example is a suffix tree algorithm which first consults a lookup table using brute force. However, instead of trying to store the entire set of relations between words in a given language, the lookup table is kept
428:
Some lemmatisation algorithms are stochastic in that, given a word which may belong to multiple parts of speech, a probability is assigned to each possible part. This may take into account the surrounding words, called the context, or not. Context-free grammars do not take into account any additional
424:
algorithms involve using probability to identify the root form of a word. Stochastic algorithms are trained (they "learn") on a table of root form to inflected form relations to develop a probabilistic model. This model is typically expressed in the form of complex linguistic rules, similar in nature
307:
Suffix stripping algorithms may differ in results for a variety of reasons. One such reason is whether the algorithm constrains whether the output word must be a real word in the given language. Some approaches do not require the word to actually exist in the language lexicon (the set of all words in
287:
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes regarded as
412:
This approach is highly conditional upon obtaining the correct lexical category (part of speech). While there is overlap between the normalization rules for certain categories, identifying the wrong category or being unable to produce the right category limits the added benefit of this approach over
216:
at
Lancaster University in the late 1980s, it is an iterative stemmer and features an externally stored set of stemming rules. The standard set of rules provides a 'strong' stemmer and may specify the removal or replacement of an ending. The replacement technique avoids the need for a separate stage
494:
it with stems from the database, applying various constraints, such as on the relative length of the candidate stem within the word (so that, for example, the short prefix "be", which is the stem of such words as "be", "been" and "being", would not be considered as the stem of the word "beside")..
311:
It can be the case that two or more suffix stripping rules apply to the same input term, which creates an ambiguity as to which rule to apply. The algorithm may assign (by human hand or stochastically) a priority to one rule or another. Or the algorithm may reject one rule application because it
272:
Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root form relations. Instead, a typically smaller list of "rules" is stored which provides a path for the algorithm, given an input word form, to find its root form. Some examples of the rules include:
505:
Hebrew and Arabic are still considered difficult research languages for stemming. English stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but stemmers
624:
adopted word stemming in 2003. Previously a search for "fish" would not have returned "fishing". Other software search algorithms vary in their use of word stemming. Programs that simply search for substrings will obviously find "fish" in "fishing" but when searching for "fishes" will not find
263:
The lookup table used by a stemmer is generally produced semi-automatically. For example, if the word is "run", then the inverted algorithm might automatically generate the forms "running", "runs", "runned", and "runly". The last two forms are valid constructions, but they are unlikely..
506:
become harder to design as the morphology, orthography, and character encoding of the target language becomes more complex. For example, an
Italian stemmer is more complex than an English one (because of a greater number of verb inflections), a Russian one is more complex (more noun
559:
Stemming is used as an approximate method for grouping words with a similar basic meaning together. For example, a text mentioning "daffodils" is probably closely related to a text mentioning "daffodil" (without the s). But in some cases, words with the same morphological stem have
196:
Many implementations of the Porter stemming algorithm were written and freely distributed; however, many of these implementations contained subtle flaws. As a result, these stemmers did not match their potential. To eliminate this source of error, Martin Porter released an official
820:
Proceedings of the ACL-2009, Joint conference of the 47th Annual
Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, Singapore, August 2–7,
360:
Diving further into the details, a common technique is to apply rules in a cyclical fashion (recursively, as computer scientists would say). After applying the suffix substitution rule in this example scenario, a second pass is made to identify matching rules on the term
429:
information. In either case, after assigning the probabilities to each possible part of speech, the most likely part of speech is chosen, and from there the appropriate normalization rules are applied to the input word to produce the normalized (root) form.
413:
suffix stripping algorithms. The basic idea is that, if the stemmer is able to grasp more information about the word being stemmed, then it can apply more accurate normalization rules (which unlike suffix stripping rules can also modify the stem).
408:
of a word, and applying different normalization rules for each part of speech. The part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.
344:
suffix stripping rule as well as the suffix substitution rule apply. Since the stripping rule results in a non-existent term in the lexicon, but the substitution rule does not, the substitution rule is applied instead. In this example,
969:
331:
One improvement upon basic suffix stripping is the use of suffix substitution. Similar to a stripping rule, a substitution rule replaces a suffix with an alternate suffix. For example, there could exist a rule that replaces
522:
Multilingual stemming applies morphological rules of two or more languages simultaneously instead of rules for only a single language when interpreting a search query. Commercial systems using multilingual stemming exist.
550:
An example of understemming in the Porter stemmer is "alumnus" → "alumnu", "alumni" → "alumni", "alumna"/"alumnae" → "alumna". This
English word keeps Latin morphology, and so these near-synonyms are not conflated.
531:
There are two error measurements in stemming algorithms, overstemming and understemming. Overstemming is an error where two separate inflected words are stemmed to the same root, but should not have been—a
453:
small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.
158:
in 1968. This paper was remarkable for its early date and had great influence on later work in this area. Her paper refers to three earlier major attempts at stemming algorithms, by
Professor
502:
While much of the early academic work in this area was focused on the
English language (with significant use of the Porter Stemmer algorithm), many other languages have been investigated.
1011:
217:
in the process to recode or provide partial matching. Paice also developed a direct measurement for comparing stemmers based on counting the over-stemming and under-stemming errors.
1136:
288:
crude given the poor performance when dealing with exceptional relations (like 'ran' and 'run'). The solutions produced by suffix stripping algorithms are limited to those
547:
related, their modern meanings are in widely different domains, so treating them as synonyms in a search engine will likely reduce the relevance of the search results.
543:
For example, the widely used Porter stemmer stems "universal", "university", and "universe" to "univers". This is a case of overstemming: though these three words are
384:
This example also helps illustrate the difference between a rule-based approach and a brute force approach. In a brute force approach, the algorithm would search for
1244:
951:
1435:
1301:
1595:
564:
meanings which are not closely related: a user searching for "marketing" will not be satisfied by most documents mentioning "markets" but not "marketing".
514:, a writing system without vowels, and the requirement of prefix stripping: Hebrew stems can be two, three or four characters, but not more), and so on.
1364:
1221:
1200:
871:
292:
which have well known suffixes with few exceptions. This, however, is a problem, as not all parts of speech have such a well formulated set of rules.
1235:
241:
There are several types of stemming algorithms which differ in respect to performance and accuracy and how certain stemming obstacles are overcome.
611:
Many commercial companies have been using stemming since at least the 1980s and have produced algorithmic and lexical stemmers in many languages.
1345:
209:, a framework for writing stemming algorithms, and implemented an improved English stemmer together with stemmers for several other languages.
2194:
1573:
235:
1006:
1213:
481:, identify that the leading "in" is a prefix that can be removed. Many of the same approaches mentioned earlier apply, but go by the name
1188:
1171:
2204:
1984:
1428:
1227:
877:
189:. This stemmer was very widely used and became the de facto standard algorithm used for English stemming. Dr. Porter received the
2153:
583:. The effectiveness of stemming for English query systems were soon found to be rather limited, however, and this has led early
39:
of the word; it is usually sufficient that related words map to the same stem, even if this stem is not in itself a valid root.
477:. In addition to dealing with suffixes, several approaches also attempt to remove common prefixes. For example, given the word
1162:
1111:
1086:
1049:
1149:
2189:
1894:
1585:
1421:
540:. Stemming algorithms attempt to minimize each type of error, although reducing one type can lead to increasing the other.
340:. How this affects the algorithm varies on the algorithm's design. To illustrate, the algorithm may identify that both the
205:-licensed) implementation of the algorithm around the year 2000. He extended this work over the next few years by building
1397:
1391:
2148:
1028:
CLEF 2003: Stephen
Tomlinson compared the Snowball stemmers with the Hummingbird lexical stemming (lemmatization) system
2199:
1755:
1909:
1740:
859:
591:
rather than stems, may be used instead. Also, stemmers may provide greater benefits in other languages than
English.
990:
1680:
2097:
1750:
1298:—free stemming algorithms for many languages, includes source code, including stemmers for five romance languages
1120:
697:
615:
536:. Understemming is an error where two separate inflected words should be stemmed to the same root, but are not—a
312:
results in a non-existent term whereas the other overlapping rule does not. For example, given the
English term
2184:
1745:
1490:
889:
2014:
1735:
206:
1361:
1197:
1707:
816:
Automatic
Training of Lemmatization Rules that Handle Morphological Changes in pre-, in- and Suffixes Alike
2052:
2037:
2009:
1874:
1869:
1444:
679:
511:
388:
in the set of hundreds of thousands of inflected word forms and ideally find the corresponding root form
1385:
1789:
1760:
1538:
1039:
CLEF 2004: Stephen Tomlinson "Finnish, Portuguese and Russian Retrieval with Hummingbird SearchServer"
1632:
1485:
1102:
1038:
642:
1342:
587:
researchers to deem stemming irrelevant in general. An alternative approach, based on searching for
2158:
2082:
1814:
1770:
1655:
1553:
674:
647:
299:
Prefix stripping may also be implemented. Of course, not all languages use prefixing or suffixing.
178:, and a third algorithm developed by James L. Dolby of R and D Consultants, Los Altos, California.
2062:
2032:
1699:
835:
633:
Stemming is used as a task in pre-processing texts before performing text mining analyses on it.
1533:
932:
1919:
1612:
1590:
1580:
1548:
1523:
784:
252:
1779:
1312:
584:
573:
190:
20:
1253:
2132:
1808:
1784:
1637:
1336:
213:
163:
849:
8:
2112:
2042:
1999:
1955:
1727:
1717:
1712:
1600:
1382:—open source JavaScript implementation of Snowball stemming algorithms for many languages
1330:
1292:—open source IR framework, includes Porter stemmer implementation (PostgreSQL, Java API)
1262:
2122:
1994:
1859:
1622:
1605:
1463:
1403:
703:
691:
658:
167:
36:
32:
933:
Language-Dependent and Language-Independent Approaches to Cross-Lingual Text Retrieval
2127:
1839:
1647:
1558:
855:
824:
155:
851:
Light Stemming Approaches for the French, Portuguese, German and Hungarian Languages
2004:
1889:
1864:
1665:
1568:
1307:
1286:—free online service, includes Porter and Paice/Husk' Lancaster stemmers (Java API)
930:
Kamps, Jaap; Monz, Christof; de Rijke, Maarten; and Sigurbjörnsson, Börkur (2004);
618:
stemmers have been compared with commercial lexical stemmers with varying results.
289:
63:
44:
2116:
2077:
2072:
1940:
1670:
1543:
1518:
1500:
1368:
1355:
1349:
1324:
1223:
The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data
1204:
1015:
994:
873:
The Effectiveness of Stemming for Natural-Language Access to Slovene Textual Data
600:
56:
1074:, Bulletin of the Association for Literary and Linguistic Computing, 2(3): 33–46
903:
1824:
1804:
1528:
1408:
577:
537:
533:
405:
91:
1413:
1404:
Comparative Evaluation of Arabic Language Morphological Analysers and Stemmers
485:. A study of affix stemming for several European languages can be found here.
2178:
2087:
1899:
1879:
1660:
1027:
1008:
Building Multilingual Solutions by using Sharepoint Products and Technologies
769:
755:
664:
621:
580:
401:
293:
198:
182:
175:
159:
48:
224:
2067:
1685:
245:
202:
171:
27:
is the process of reducing inflected (or sometimes derived) words to their
1095:
Stemming algorithms, Information retrieval: data structures and algorithms
400:
A more complex approach to the problem of determining a stem of a word is
35:
form—generally a written word form. The stem need not be identical to the
2024:
1904:
1617:
1510:
1458:
1142:
Proceedings of the 17th ACM SIGIR conference held at Zurich, August 18–22
985:
709:
462:
1277:
328:
is likely not found in the lexicon, and therefore the rule is rejected.
122:. The stem need not be a word, for example the Porter algorithm reduces
1627:
1379:
1373:
1289:
653:
507:
421:
1180:
1140:, in Frei, H.-P.; Harman, D.; Schauble, P.; and Wilkinson, R. (eds.);
1115:, Journal of the American Society for Information Science 42 (1), 7–15
730:
1495:
1163:
An Evaluation of some Conflation Algorithms for Information Retrieval
544:
40:
28:
1318:
1283:
1214:
Method for Evaluation of Stemming Algorithms based on Error Counting
1970:
1950:
1935:
1914:
1884:
1829:
1794:
1248:
Journal of the American Society for Information Science, 44(1), 1–9
1160:
Lennon, M.; Pierce, D. S.; Tarry, B. D.; & Willett, P. (1981);
936:, in Peters, C.; Gonzalo, J.; Braschler, M.; and Kluck, M. (eds.);
2107:
1965:
1945:
1819:
1563:
1478:
1184:, Mechanical Translation and Computational Linguistics, 11, 22—31
1172:
Error Evaluation for Stemming Algorithms as Clustering Algorithms
938:
Comparative Evaluation of Multilingual Information Access Systems
712:—stemming algorithms play a major role in commercial NLP software
52:
1376:—A Java/Python/.Net stemming toolkit for the Portuguese language
905:
Uma revisĂŁo dos algoritmos de radicalização em lĂngua portuguesa
320:
suffix and apply the appropriate rule and achieve the result of
1473:
1468:
1121:
Stemming Algorithms – A Case Study for Detailed Evaluation
1055:
669:
588:
474:
470:
441:
952:
Word Normalization and Decompounding in Mono- and Bilingual IR
369:
stripping rule is likely identified and accepted. In summary,
2163:
1799:
1106:, Information Processing & Management 10 (11/12), 371–386
561:
466:
1087:
Strength and Similarity of Affix Removal Stemming Algorithms
232:
Is there any perfect stemming algorithm in English language?
193:
in 2000 for his work on stemming and information retrieval.
1254:
A Practical Stemming Algorithm for Online Search Assistance
837:
Stemming Approaches for East European Languages (CLEF 2007)
685:
1295:
1960:
917:
Baeza-Yates, Ricardo; and Ribeiro-Neto, Berthier (1999);
444:
context of a word to choose the correct stem for a word.
1263:
Corpus-Based Stemming Using Coocurrence of Word Variants
1245:
Stemming of French Words Based on Grammatical Categories
185:
and was published in the July 1980 issue of the journal
1266:, ACM Transactions on Information Systems, 16(1), 61–81
1343:
Official home page of the Lancaster stemming algorithm
1331:
Unofficial home page of the Lovins stemming algorithm
1251:
Ulmschneider, John E.; & Doszkocs, Tamas (1983);
599:
Stemming is used to determine domain vocabularies in
968:
Frakes, W.; Prieto-Diaz, R.; & Fox, C. (1998). "
738:
Mechanical Translation and Computational Linguistics
226:
1337:
Official home page of the Porter stemming algorithm
785:"Exploring New Languages with HAIRCUT at CLEF 2005"
106:. A stemming algorithm might also reduce the words
1129:A Detailed Analysis of English Stemming Algorithms
244:A simple stemmer looks up the inflected form in a
688:—implements several stemming algorithms in Python
2176:
1646:
1304:—port of Snowball stemmers for C# (14 languages)
854:, ACM Symposium on Applied Computing, SAC 2006,
682:—stemming is generally regarded as a form of NLP
302:
1443:
1103:Word segmentation by letter successor varieties
776:
1187:Jenkins, Marie-Claire; and Smith, Dan (2005);
404:. This process involves first determining the
267:
66:or subroutine that stems word may be called a
1429:
1230:, Volume 43, Issue 5 (June), pp. 384–390
1190:Conservative Stemming for Search and Indexing
1097:, Upper Saddle River, NJ: Prentice-Hall, Inc.
606:
510:), a Hebrew one is even more complex (due to
1166:, Journal of Information Science, 3: 177–183
236:(more unsolved problems in computer science)
86:A stemmer for English operating on the stem
1356:Official home page of the UEA-Lite Stemmer
1339:—including source code in several languages
1327:—stemming library in C++ released under BSD
1220:PopoviÄŤ, Mirko; and Willett, Peter (1992);
1127:Hull, D. A. & Grefenstette, G. (1996);
970:DARE: Domain Analysis and Reuse Environment
870:PopoviÄŤ, Mirko; and Willett, Peter (1992);
395:
280:if the word ends in 'ing', remove the 'ing'
258:
154:The first published stemmer was written by
1436:
1422:
1333:—with source code in a couple of languages
1150:Viewing Morphology as an Inference Process
700:—designed for creating stemming algorithms
1079:Term Conflation for Information Retrieval
902:Viera, A. F. G. & Virgil, J. (2007);
706:—linguistic definition of the term "stem"
694:—linguistic definition of the term "root"
650:—stemming is a form of reverse derivation
296:attempts to improve upon this challenge.
283:if the word ends in 'ly', remove the 'ly'
277:if the word ends in 'ed', remove the 'ed'
1228:American Society for Information Science
1100:Hafer, M. A. & Weiss, S. F. (1974);
908:, Information Research, 12(3), paper 315
880:, Volume 43, Issue 5 (June), pp. 384–390
878:American Society for Information Science
567:
517:
416:
212:The Paice-Husk Stemmer was developed by
1084:Frakes, W. B. & Fox, C. J. (2003);
834:Dolamic, Ljiljana; and Savoy, Jacques;
782:
16:Process of reducing words to word stems
2177:
1280:—includes Porter and Snowball stemmers
1137:Viewing Stemming as Recall Enhancement
1134:Kraaij, W. & Pohlmann, R. (1996);
728:
497:
488:
251:A lookup approach may use preliminary
1417:
731:"Development of a Stemming Algorithm"
2195:Tasks of natural language processing
1895:Simple Knowledge Organization System
572:Stemmers can be used as elements in
447:
227:Unsolved problem in computer science
1260:Xu, J.; & Croft, W. B. (1998);
1181:Development of a Stemming Algorithm
174:, under the direction of Professor
13:
1321:—PHP extension to the Snowball API
1072:Suffix Removal for Word Conflation
1063:
891:Stemming in Hungarian at CLEF 2005
594:
432:
51:treat words with the same stem as
43:for stemming have been studied in
14:
2216:
1910:Thesaurus (information retrieval)
1271:
1236:An Algorithm for Suffix Stripping
456:
440:Some stemming techniques use the
316:, the algorithm may identify the
2205:Information retrieval techniques
814:Jongejan, B.; and Dalianis, H.;
783:McNamee, Paul (September 2005).
625:occurrences of the word "fish".
526:
1362:Overview of stemming algorithms
1315:—Ruby extension to Snowball API
1308:Python bindings to Snowball API
1051:The Essentials of Google Search
1043:
1032:
1021:
1000:
979:
962:
943:
924:
911:
896:
698:Snowball (programming language)
554:
181:A later stemmer was written by
59:, a process called conflation.
1491:Natural language understanding
1400:—open source stemmer for Czech
1394:—open source stemmer for Hindi
1358:—University of East Anglia, UK
1257:, Online Review, 7(4), 301–318
974:Annals of Software Engineering
940:, Springer Verlag, pp. 152–165
883:
864:
842:
828:
808:
762:
748:
722:
628:
377:which becomes (via stripping)
1:
2015:Optical character recognition
716:
303:Additional algorithm criteria
220:
166:, the algorithm developed at
1708:Multi-document summarization
1081:, Cambridge University Press
919:Modern Information Retrieval
7:
2190:Natural language processing
2038:Latent Dirichlet allocation
2010:Natural language generation
1875:Machine-readable dictionary
1870:Linguistic Linked Open Data
1445:Natural language processing
1325:Oleander Porter's algorithm
1112:How Effective is Suffixing?
756:"Porter Stemming Algorithm"
729:Lovins, Julie Beth (1968).
680:Natural language processing
636:
512:nonconcatenative morphology
373:becomes (via substitution)
268:Suffix-stripping algorithms
81:
23:and information retrieval,
10:
2221:
1790:Explicit semantic analysis
1539:Deep linguistic processing
1233:Porter, Martin F. (1980);
1155:Proceedings of ACM-SIGIR93
1054:, Web Search Help Center,
921:, ACM Press/Addison Wesley
607:Use in commercial products
149:
2200:Computational linguistics
2141:
2096:
2051:
2023:
1983:
1928:
1850:
1838:
1769:
1726:
1698:
1633:Word-sense disambiguation
1509:
1486:Computational linguistics
1451:
1352:—Lancaster University, UK
1239:, Program, 14(3): 130–137
993:14 September 2011 at the
792:CEUR Workshop Proceedings
643:Computational linguistics
2159:Natural Language Toolkit
2083:Pronunciation assessment
1985:Automatic identification
1815:Latent semantic analysis
1771:Distributional semantics
1656:Compound-term processing
1554:Named-entity recognition
1388:—implementation for Java
1208:, SIGIR Forum, 24: 56–61
1131:, Xerox Technical Report
1090:, SIGIR Forum, 37: 26–30
987:Language Extension Packs
955:, Information Retrieval
675:Morphology (linguistics)
396:Lemmatisation algorithms
259:The production technique
2063:Automated essay scoring
2033:Document classification
1700:Automatic summarization
1217:, JASIS, 47(8): 632–649
1014:17 January 2008 at the
255:to avoid overstemming.
1920:Universal Dependencies
1613:Terminology extraction
1596:Semantic decomposition
1591:Semantic role labeling
1581:Part-of-speech tagging
1549:Information extraction
1534:Coreference resolution
1524:Collocation extraction
1178:Lovins, J. B. (1968);
1093:Frakes, W. B. (1992);
1077:Frakes, W. B. (1984);
1070:Dawson, J. L. (1974);
661:—linguistic definition
253:part-of-speech tagging
47:since the 1960s. Many
2185:Linguistic morphology
1681:Sentence segmentation
1195:Paice, C. D. (1990);
1124:, JASIS, 47(1): 70–84
585:information retrieval
568:Information retrieval
518:Multilingual stemming
417:Stochastic algorithms
191:Tony Kent Strix award
90:should identify such
21:linguistic morphology
2133:Voice user interface
1844:datasets and corpora
1785:Document-term matrix
1638:Word-sense induction
1211:Paice, C. D. (1996)
1147:Krovetz, R. (1993);
1118:Hull, D. A. (1996);
949:Airio, Eija (2006);
164:Princeton University
2113:Interactive fiction
2043:Pachinko allocation
2000:Speech segmentation
1956:Google Ngram Viewer
1728:Machine translation
1718:Text simplification
1713:Sentence extraction
1601:Semantic similarity
1169:Lovins, J. (1971);
1109:Harman, D. (1991);
1018:, Microsoft Technet
498:Language challenges
489:Matching algorithms
469:refers to either a
2123:Question answering
1995:Speech recognition
1860:Corpus linguistics
1840:Language resources
1623:Textual entailment
1606:Sentiment analysis
1367:2011-07-02 at the
1348:2011-07-22 at the
1242:Savoy, J. (1993);
1203:2011-07-22 at the
1175:, JASIS, 22: 28–40
1157:, pp. 191–203
704:Stem (linguistics)
692:Root (linguistics)
659:Lemma (morphology)
290:lexical categories
168:Harvard University
72:stemming algorithm
37:morphological root
2172:
2171:
2128:Virtual assistant
2053:Computer-assisted
1979:
1978:
1736:Computer-assisted
1694:
1693:
1686:Word segmentation
1648:Text segmentation
1586:Semantic analysis
1574:Syntactic parsing
1559:Ontology learning
1226:, Journal of the
876:, Journal of the
448:Hybrid approaches
156:Julie Beth Lovins
2212:
2149:Formal semantics
2098:Natural language
2005:Speech synthesis
1987:and data capture
1890:Semantic network
1865:Lexical resource
1848:
1847:
1666:Lexical analysis
1644:
1643:
1569:Semantic parsing
1438:
1431:
1424:
1415:
1414:
1386:Snowball Stemmer
1144:, pp. 40–48
1058:
1047:
1041:
1036:
1030:
1025:
1019:
1004:
998:
983:
977:
976:(5), pp. 125-141
966:
960:
947:
941:
928:
922:
915:
909:
900:
894:
887:
881:
868:
862:
848:Savoy, Jacques;
846:
840:
832:
826:
812:
806:
805:
803:
802:
789:
780:
774:
766:
760:
759:
752:
746:
745:
735:
726:
228:
68:stemming program
64:computer program
45:computer science
2220:
2219:
2215:
2214:
2213:
2211:
2210:
2209:
2175:
2174:
2173:
2168:
2137:
2117:Syntax guessing
2099:
2092:
2078:Predictive text
2073:Grammar checker
2054:
2047:
2019:
1986:
1975:
1941:Bank of English
1924:
1852:
1843:
1834:
1765:
1722:
1690:
1642:
1544:Distant reading
1519:Argument mining
1505:
1501:Text processing
1447:
1442:
1369:Wayback Machine
1350:Wayback Machine
1274:
1269:
1205:Wayback Machine
1198:Another Stemmer
1066:
1064:Further reading
1061:
1048:
1044:
1037:
1033:
1026:
1022:
1016:Wayback Machine
1005:
1001:
995:Wayback Machine
984:
980:
967:
963:
948:
944:
929:
925:
916:
912:
901:
897:
888:
884:
869:
865:
847:
843:
833:
829:
813:
809:
800:
798:
787:
781:
777:
768:Yatsko, V. A.;
767:
763:
754:
753:
749:
733:
727:
723:
719:
639:
631:
609:
601:domain analysis
597:
595:Domain analysis
570:
557:
529:
520:
500:
491:
483:affix stripping
459:
450:
438:
419:
398:
305:
270:
261:
239:
238:
233:
230:
223:
152:
84:
57:query expansion
17:
12:
11:
5:
2218:
2208:
2207:
2202:
2197:
2192:
2187:
2170:
2169:
2167:
2166:
2161:
2156:
2151:
2145:
2143:
2139:
2138:
2136:
2135:
2130:
2125:
2120:
2110:
2104:
2102:
2100:user interface
2094:
2093:
2091:
2090:
2085:
2080:
2075:
2070:
2065:
2059:
2057:
2049:
2048:
2046:
2045:
2040:
2035:
2029:
2027:
2021:
2020:
2018:
2017:
2012:
2007:
2002:
1997:
1991:
1989:
1981:
1980:
1977:
1976:
1974:
1973:
1968:
1963:
1958:
1953:
1948:
1943:
1938:
1932:
1930:
1926:
1925:
1923:
1922:
1917:
1912:
1907:
1902:
1897:
1892:
1887:
1882:
1877:
1872:
1867:
1862:
1856:
1854:
1845:
1836:
1835:
1833:
1832:
1827:
1825:Word embedding
1822:
1817:
1812:
1805:Language model
1802:
1797:
1792:
1787:
1782:
1776:
1774:
1767:
1766:
1764:
1763:
1758:
1756:Transfer-based
1753:
1748:
1743:
1738:
1732:
1730:
1724:
1723:
1721:
1720:
1715:
1710:
1704:
1702:
1696:
1695:
1692:
1691:
1689:
1688:
1683:
1678:
1673:
1668:
1663:
1658:
1652:
1650:
1641:
1640:
1635:
1630:
1625:
1620:
1615:
1609:
1608:
1603:
1598:
1593:
1588:
1583:
1578:
1577:
1576:
1571:
1561:
1556:
1551:
1546:
1541:
1536:
1531:
1529:Concept mining
1526:
1521:
1515:
1513:
1507:
1506:
1504:
1503:
1498:
1493:
1488:
1483:
1482:
1481:
1476:
1466:
1461:
1455:
1453:
1449:
1448:
1441:
1440:
1433:
1426:
1418:
1412:
1411:
1406:
1401:
1395:
1389:
1383:
1377:
1371:
1359:
1353:
1340:
1334:
1328:
1322:
1316:
1310:
1305:
1302:Snowball on C#
1299:
1293:
1287:
1281:
1278:Apache OpenNLP
1273:
1272:External links
1270:
1268:
1267:
1258:
1249:
1240:
1231:
1218:
1209:
1193:
1185:
1176:
1167:
1158:
1145:
1132:
1125:
1116:
1107:
1098:
1091:
1082:
1075:
1067:
1065:
1062:
1060:
1059:
1042:
1031:
1020:
999:
978:
961:
942:
923:
910:
895:
882:
863:
841:
827:
823:, pp. 145-153
807:
775:
761:
747:
720:
718:
715:
714:
713:
707:
701:
695:
689:
683:
677:
672:
667:
662:
656:
651:
645:
638:
635:
630:
627:
608:
605:
596:
593:
581:search engines
569:
566:
556:
553:
545:etymologically
538:false negative
534:false positive
528:
525:
519:
516:
499:
496:
490:
487:
458:
457:Affix stemmers
455:
449:
446:
437:
436:-gram analysis
431:
418:
415:
406:part of speech
397:
394:
304:
301:
285:
284:
281:
278:
269:
266:
260:
257:
234:
231:
225:
222:
219:
151:
148:
83:
80:
49:search engines
15:
9:
6:
4:
3:
2:
2217:
2206:
2203:
2201:
2198:
2196:
2193:
2191:
2188:
2186:
2183:
2182:
2180:
2165:
2162:
2160:
2157:
2155:
2154:Hallucination
2152:
2150:
2147:
2146:
2144:
2140:
2134:
2131:
2129:
2126:
2124:
2121:
2118:
2114:
2111:
2109:
2106:
2105:
2103:
2101:
2095:
2089:
2088:Spell checker
2086:
2084:
2081:
2079:
2076:
2074:
2071:
2069:
2066:
2064:
2061:
2060:
2058:
2056:
2050:
2044:
2041:
2039:
2036:
2034:
2031:
2030:
2028:
2026:
2022:
2016:
2013:
2011:
2008:
2006:
2003:
2001:
1998:
1996:
1993:
1992:
1990:
1988:
1982:
1972:
1969:
1967:
1964:
1962:
1959:
1957:
1954:
1952:
1949:
1947:
1944:
1942:
1939:
1937:
1934:
1933:
1931:
1927:
1921:
1918:
1916:
1913:
1911:
1908:
1906:
1903:
1901:
1900:Speech corpus
1898:
1896:
1893:
1891:
1888:
1886:
1883:
1881:
1880:Parallel text
1878:
1876:
1873:
1871:
1868:
1866:
1863:
1861:
1858:
1857:
1855:
1849:
1846:
1841:
1837:
1831:
1828:
1826:
1823:
1821:
1818:
1816:
1813:
1810:
1806:
1803:
1801:
1798:
1796:
1793:
1791:
1788:
1786:
1783:
1781:
1778:
1777:
1775:
1772:
1768:
1762:
1759:
1757:
1754:
1752:
1749:
1747:
1744:
1742:
1741:Example-based
1739:
1737:
1734:
1733:
1731:
1729:
1725:
1719:
1716:
1714:
1711:
1709:
1706:
1705:
1703:
1701:
1697:
1687:
1684:
1682:
1679:
1677:
1674:
1672:
1671:Text chunking
1669:
1667:
1664:
1662:
1661:Lemmatisation
1659:
1657:
1654:
1653:
1651:
1649:
1645:
1639:
1636:
1634:
1631:
1629:
1626:
1624:
1621:
1619:
1616:
1614:
1611:
1610:
1607:
1604:
1602:
1599:
1597:
1594:
1592:
1589:
1587:
1584:
1582:
1579:
1575:
1572:
1570:
1567:
1566:
1565:
1562:
1560:
1557:
1555:
1552:
1550:
1547:
1545:
1542:
1540:
1537:
1535:
1532:
1530:
1527:
1525:
1522:
1520:
1517:
1516:
1514:
1512:
1511:Text analysis
1508:
1502:
1499:
1497:
1494:
1492:
1489:
1487:
1484:
1480:
1477:
1475:
1472:
1471:
1470:
1467:
1465:
1462:
1460:
1457:
1456:
1454:
1452:General terms
1450:
1446:
1439:
1434:
1432:
1427:
1425:
1420:
1419:
1416:
1410:
1409:Tamil Stemmer
1407:
1405:
1402:
1399:
1398:czech_stemmer
1396:
1393:
1392:hindi_stemmer
1390:
1387:
1384:
1381:
1378:
1375:
1372:
1370:
1366:
1363:
1360:
1357:
1354:
1351:
1347:
1344:
1341:
1338:
1335:
1332:
1329:
1326:
1323:
1320:
1317:
1314:
1311:
1309:
1306:
1303:
1300:
1297:
1294:
1291:
1288:
1285:
1284:SMILE Stemmer
1282:
1279:
1276:
1275:
1265:
1264:
1259:
1256:
1255:
1250:
1247:
1246:
1241:
1238:
1237:
1232:
1229:
1225:
1224:
1219:
1216:
1215:
1210:
1207:
1206:
1202:
1199:
1194:
1192:
1191:
1186:
1183:
1182:
1177:
1174:
1173:
1168:
1165:
1164:
1159:
1156:
1152:
1151:
1146:
1143:
1139:
1138:
1133:
1130:
1126:
1123:
1122:
1117:
1114:
1113:
1108:
1105:
1104:
1099:
1096:
1092:
1089:
1088:
1083:
1080:
1076:
1073:
1069:
1068:
1057:
1053:
1052:
1046:
1040:
1035:
1029:
1024:
1017:
1013:
1010:
1009:
1003:
996:
992:
989:
988:
982:
975:
971:
965:
958:
954:
953:
946:
939:
935:
934:
927:
920:
914:
907:
906:
899:
893:
892:
886:
879:
875:
874:
867:
861:
860:1-59593-108-2
857:
853:
852:
845:
839:
838:
831:
825:
822:
817:
811:
797:
793:
786:
779:
773:
772:
765:
757:
751:
743:
739:
732:
725:
721:
711:
708:
705:
702:
699:
696:
693:
690:
687:
684:
681:
678:
676:
673:
671:
668:
666:
665:Lemmatization
663:
660:
657:
655:
652:
649:
646:
644:
641:
640:
634:
626:
623:
622:Google Search
619:
617:
612:
604:
602:
592:
590:
586:
582:
579:
575:
574:query systems
565:
563:
552:
548:
546:
541:
539:
535:
527:Error metrics
524:
515:
513:
509:
503:
495:
486:
484:
480:
476:
472:
468:
464:
454:
445:
443:
435:
430:
426:
423:
414:
410:
407:
403:
402:lemmatisation
393:
391:
387:
382:
380:
376:
372:
368:
364:
358:
356:
352:
348:
343:
339:
335:
329:
327:
323:
319:
315:
309:
300:
297:
295:
294:Lemmatisation
291:
282:
279:
276:
275:
274:
265:
256:
254:
249:
247:
242:
237:
218:
215:
214:Chris D Paice
210:
208:
204:
200:
199:free software
194:
192:
188:
184:
183:Martin Porter
179:
177:
176:Gerard Salton
173:
169:
165:
161:
160:John W. Tukey
157:
147:
145:
141:
137:
133:
129:
125:
121:
117:
113:
109:
105:
101:
97:
93:
89:
79:
77:
73:
69:
65:
60:
58:
55:as a kind of
54:
50:
46:
42:
38:
34:
30:
26:
22:
2068:Concordancer
1675:
1464:Bag-of-words
1313:Ruby-Stemmer
1261:
1252:
1243:
1234:
1222:
1212:
1196:
1189:
1179:
1170:
1161:
1154:
1148:
1141:
1135:
1128:
1119:
1110:
1101:
1094:
1085:
1078:
1071:
1050:
1045:
1034:
1023:
1007:
1002:
986:
981:
973:
964:
956:
950:
945:
937:
931:
926:
918:
913:
904:
898:
890:
885:
872:
866:
850:
844:
836:
830:
819:
815:
810:
799:. Retrieved
795:
791:
778:
770:
764:
750:
741:
737:
724:
632:
620:
613:
610:
598:
571:
558:
555:Applications
549:
542:
530:
521:
504:
501:
492:
482:
479:indefinitely
478:
460:
451:
439:
433:
427:
420:
411:
399:
389:
385:
383:
378:
374:
370:
366:
365:, where the
362:
359:
354:
350:
346:
341:
337:
333:
330:
325:
321:
317:
313:
310:
306:
298:
286:
271:
262:
250:
246:lookup table
243:
240:
211:
195:
186:
180:
172:Michael Lesk
153:
143:
142:to the stem
139:
135:
131:
127:
123:
119:
118:to the stem
115:
111:
107:
103:
99:
95:
87:
85:
75:
71:
67:
61:
24:
18:
2025:Topic model
1905:Text corpus
1751:Statistical
1618:Text mining
1459:AI-complete
1056:Google Inc.
710:Text mining
629:Text mining
508:declensions
465:, the term
463:linguistics
353:instead of
2179:Categories
1746:Rule-based
1628:Truecasing
1496:Stop words
1380:jsSnowball
997:, dtSearch
801:2017-12-21
717:References
654:Inflection
648:Derivation
422:Stochastic
386:friendlies
371:friendlies
347:friendlies
314:friendlies
221:Algorithms
41:Algorithms
31:, base or
2055:reviewing
1853:standards
1851:Types and
1374:PTStemmer
818:, in the
771:Y-stemmer
562:idiomatic
29:word stem
1971:Wikidata
1951:FrameNet
1936:BabelNet
1915:Treebank
1885:PropBank
1830:Word2vec
1795:fastText
1676:Stemming
1365:Archived
1346:Archived
1296:Snowball
1201:Archived
1012:Archived
991:Archived
959::249–271
744:: 22–31.
637:See also
616:Snowball
576:such as
375:friendly
363:friendly
355:friendl'
351:friendly
349:becomes
207:Snowball
201:(mostly
82:Examples
53:synonyms
25:stemming
2142:Related
2108:Chatbot
1966:WordNet
1946:DBpedia
1820:Seq2seq
1564:Parsing
1479:Trigram
589:n-grams
326:Friendl
322:friendl
187:Program
150:History
136:arguing
108:fishing
100:catlike
92:strings
76:stemmer
2115:(c.f.
1773:models
1761:Neural
1474:Bigram
1469:n-gram
1290:Themis
858:
670:Lexeme
475:suffix
471:prefix
442:n-gram
390:friend
379:friend
138:, and
132:argues
128:argued
116:fisher
114:, and
112:fished
102:, and
2164:spaCy
1809:large
1800:GloVe
1153:, in
788:(PDF)
734:(PDF)
473:or a
467:affix
336:with
140:argus
124:argue
104:catty
74:, or
1929:Data
1780:BERT
1319:PECL
856:ISBN
821:2009
796:1171
686:NLTK
614:The
144:argu
120:fish
96:cats
33:root
1961:UBY
972:",
578:Web
461:In
342:ies
334:ies
318:ies
203:BSD
170:by
162:of
94:as
88:cat
19:In
2181::
794:.
790:.
742:11
740:.
736:.
603:.
381:.
367:ly
357:.
324:.
146:.
134:,
130:,
126:,
110:,
98:,
78:.
70:,
62:A
2119:)
1842:,
1811:)
1807:(
1437:e
1430:t
1423:v
957:9
804:.
758:.
434:n
338:y
229::
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.