Knowledge

Morphological dictionary

Source 📝

148:(FSTs) are a popular technique for the computational handling of morphology, esp., inflectional morphology. In rule-based morphological parsers, both lexicon and rules are normally formalized as finite state automata and subsequently combined. They thus require morphological dictionaries with specific processing instructions (which often have a linguistic interpretation, but, technically, are just treated like arbitrary string symbols). Popular FST packages such as SFST (as available from the fst package in Debian and Ubuntu) allow to define application-specific file formats for morphological lexica, that bundle different pieces of morphological information with every individual morpheme. These are thus aligned morphological dictionaries, but very rich (and also, idiosyncratic) in structure. 186:
Toolbox, the FieldWorks Language Explorer (FLEx) or open source alternatives such as Xigt. Toolbox and FLEx support semi-automated annotation by means of an internal morphological dictionary. Whenever a morphological segment is encountered for which an annotation in the dictionary can be found, this annotations is applied. Whenever a morphological segment is newly annotated, the annotation is stored in the dictionary. FLEx and Toolbox provide different editor functionalities for annotating text and editing dictionaries, so that additional information beyond that found in annotations can be added, but at its core, their formats provide aligned morphological dictionaries.
201:
is a community standard for machine-readable dictionaries on the web. In 2019, the OntoLex-Morph module has been proposed to facilitate data modelling of morphology in lexicography, as well as to provide a data model for morphological dictionaries for Natural Language Processing. OntoLex-Morph does
185:
is a popular formalism in language documentation, linguistic typology and other branches of linguistics and the philologies. Although IGT can be created without any specialized software (but just with a conventional editor), such specialized software has been developed, with notable examples such as
189:
FLEx and Xigt are based on XML formats, Toolbox uses a plain text format with idiosyncratic "markers". FLEx and Toolbox are not directly interoperable with each other, but a semiautomated converter for Toolbox to FLEx does exist. Xigt comes with FLEx and Toolbox importers, but is less widely used
136:
At the time of writing (2021), all of these are non-aligned morphological dictionaries (see below). Their simplistic format is particularly well-suited for the application of machine learning techniques, and UniMorph in particular, has been subject of numerous shared tasks.
400: 514:
It is possible to convert a non-aligned dictionary into an aligned dictionary. Besides trivial alignments to the left or to the right, linguistically motivated alignments which align characters to their corresponding morphemes are possible.
31:
is a linguistic resource that contains correspondences between surface form and lexical forms of words. Surface forms of words are those found in natural language text. The corresponding lexical form of a surface form is the
202:
support both aligned and non-aligned morphological dictionaries. A specific goal is to establish interoperability between and among IGT dictionaries, FST lexicons and morphological dictionaries used for machine learning.
497: 668: 737: 190:
that either FLEx or Toolbox. Their formats of FLEx and Toolbox are not intended for human consumption, nor are they well-supported by any processing software other than their native tools.
764:
Kyjánek, L., Žabokrtský, Z., Ševčíková, M., & Vidra, J. (2019, September). Universal derivations kickoff: a collection of harmonized derivational resources for eleven languages. In
622: 755:
Kirov, Christo, Ryan Cotterell, John Sylak-Glassman, Géraldine Walther, Ekaterina Vylomova, Patrick Xia, Manaal Faruqui et al. "UniMorph 2.0: universal morphology." In LREC (2018).
565: 313: 321: 899:
Klimek, B., McCrae, J. P., Bosque-Gil, J., Ionov, M., Tauber, J. K., & Chiarcos, C. (2019). Challenges for the representation of morphology in ontology lexicons.
267: 72:. The lexical form would be "give", verb. There are two kinds of morphological dictionaries: morpheme-aligned dictionaries and full-form (non-aligned) dictionaries. 430: 215:
In an aligned morphological dictionary, the correspondence between the surface form and the lexical form of a word is aligned at the character level, for example:
89:
for cross-linguistic annotation of syntactic dependencies, similar efforts have emerged for morphology, e.g., UniMorph and UDer. These feature simple tabular (
167:<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<>:e<>:n<NN><SUFF><kompos><frei> 523:
Frequently there exists more than one lexical form associated with a surface form of a word. For example, "house" may be a noun in the singular,
170:<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<NN><SUFF><base><frei><NMasc_en_en> 814:
Schmid, Helmut, Arne Fitschen, and Ulrich Heid. "SMOR: A German computational morphology covering derivation, composition and inflection." In
405:
is the alphabet of all the possible alignments including the empty symbol. That is, an aligned morphological dictionary is a set of string in
242:
In the example the left hand side is the surface form (input), and the right hand side is the lexical form (output). This order is used in
443: 531:. As a result of this it is necessary to have a function which relates input strings with their corresponding output strings. 173:<Suff_Stems><suffderiv><gebunden><kompos><NN>nom<NN><SUFF><deriv><frei> 627: 919: 499:
of pairs of input and output strings. A non-aligned morphological dictionary would represent the previous example as:
246:
where a lexical form is generated from a surface form. In morphological generation this order would be reversed.
929: 924: 673: 537: 528: 524: 272: 570: 93:) formats with one form in a row, and its derivation (UDer), resp., inflection information (UniMorph): 395:{\displaystyle L=((\Sigma \cup {\theta })\times \Gamma )\cup (\Sigma \times (\Gamma \cup {\theta }))} 20: 766:
Proceedings of the Second International Workshop on Resources and Tools for Derivational Morphology
243: 145: 778: 86: 252: 408: 90: 8: 24: 269:
is the alphabet of the output symbols, an aligned morphological dictionary is a subset
45: 41: 33: 193: 182: 435: 440:
A non-aligned morphological dictionary (or full-form dictionary) is simply a set
158:<Base_Stems>Aachen<NN><base><nativ><Name-Neut_s> 113:
In UDer, additional information (part of speech) is encoded within the columns:
37: 164:<Base_Stems>Aarau<NN><base><nativ><Name-Neut_s> 913: 126:
Abart_Nf        abartig_A       dNA05>
161:<Base_Stems>Aal<NN><base><nativ><NMasc_es_e> 801:
Schmid, Helmut. "A programming language for finite state transducers." In
852: 828: 205: 194:
OntoLex-Morph: A community standard for morphological dictionaries
198: 436:
Non-aligned morphological dictionaries (full-form dictionaries)
129:
abbaggern_V     Abbaggern_Nn    dVN09>
123:
abartig_A       Abartigkeit_Nf  dAN03>
117:
abändern_V      Abänderung_Nf   dVN07>
132:(UDer, German DErivBase 0.5. Columns are BASE, DERIVED, RULE) 492:{\displaystyle U\subset 2^{(\Gamma ^{*}\times \Sigma ^{*})}} 249:
Formally, if Σ is the alphabet of the input symbols, and
120:
Abarbeiten_Nn   abarbeiten_V    dNV09>
676: 630: 573: 540: 446: 411: 324: 275: 255: 210: 109:(UniMorph, German. Columns are LEMMA, FORM, FEATURES) 36:
followed by grammatical information (for example the
876: 177: 731: 663:{\displaystyle \tau :E\rightarrow 2^{\Gamma ^{*}}} 662: 616: 559: 491: 424: 394: 307: 261: 75: 206:Types and structure of morphological dictionaries 911: 154:Sample data from SMOR (German SFST grammar): 140: 779:"A Short History of Two-Level Morphology" 527:, or may be a verb in the present tense, 103:aalen   aalen   V;IND;PRS;3;PL 100:aalen   aalen   V;IND;PRS;1;PL 80: 732:{\displaystyle \tau (w)=w':(w,w')\in U} 624:, the correspondence function would be 912: 518: 97:aalen   aalend  V.PTCP;PRS 560:{\displaystyle E\subset \Sigma ^{*}} 308:{\displaystyle A\subset 2^{(L^{*})}} 13: 649: 548: 475: 462: 375: 366: 354: 337: 256: 211:Aligned morphological dictionaries 14: 941: 617:{\displaystyle E={w:(w,w')\in U}} 219:(h,h) (o,o) (u,u) (s,s) (e,e) (s, 106:aalen   aalen   V;NFIN 231:Where θ is the empty symbol and 178:Interlinear Glossed Text editors 805:, vol. 4002, pp. 308-309. 2005. 85:Inspired by the success of the 76:Notable examples and formalisms 893: 869: 845: 821: 808: 795: 771: 758: 749: 720: 703: 686: 680: 640: 604: 587: 484: 458: 389: 386: 372: 363: 357: 348: 334: 331: 300: 287: 183:Interlinear Glossed Text (IGT) 68:are surface forms of the verb 1: 742: 7: 10: 946: 829:"Field Linguist's Toolbox" 920:Computational linguistics 567:of input words such that 21:computational linguistics 146:Finite State Transducers 141:Finite State Transducers 29:morphological dictionary 262:{\displaystyle \Gamma } 733: 664: 618: 561: 493: 426: 396: 309: 263: 244:morphological analysis 235:signifies "noun", and 175: 134: 111: 87:Universal Dependencies 81:Universal Morphologies 930:Linguistic morphology 925:Translation databases 734: 665: 619: 562: 534:If we define the set 494: 427: 425:{\displaystyle L^{*}} 397: 310: 264: 156: 115: 95: 783:www.ling.helsinki.fi 674: 628: 571: 538: 444: 409: 322: 273: 253: 239:signifies "plural". 901:Proceedings of eLex 519:Lexical ambiguities 25:applied linguistics 16:Linguistic resource 818:, pp. 1-263. 2004. 729: 660: 614: 557: 508:⟨pl⟩ 489: 422: 392: 305: 259: 237:⟨pl⟩ 225:⟨pl⟩ 505:⟨n⟩ 233:⟨n⟩ 221:⟨n⟩ 19:In the fields of 937: 904: 897: 891: 890: 888: 887: 873: 867: 866: 864: 863: 857:software.sil.org 849: 843: 842: 840: 839: 833:software.sil.org 825: 819: 812: 806: 799: 793: 792: 790: 789: 775: 769: 762: 756: 753: 738: 736: 735: 730: 719: 699: 669: 667: 666: 661: 659: 658: 657: 656: 623: 621: 620: 615: 613: 603: 566: 564: 563: 558: 556: 555: 530: 526: 509: 506: 498: 496: 495: 490: 488: 487: 483: 482: 470: 469: 431: 429: 428: 423: 421: 420: 401: 399: 398: 393: 385: 347: 314: 312: 311: 306: 304: 303: 299: 298: 268: 266: 265: 260: 238: 234: 226: 222: 945: 944: 940: 939: 938: 936: 935: 934: 910: 909: 908: 907: 898: 894: 885: 883: 875: 874: 870: 861: 859: 851: 850: 846: 837: 835: 827: 826: 822: 813: 809: 800: 796: 787: 785: 777: 776: 772: 763: 759: 754: 750: 745: 712: 692: 675: 672: 671: 652: 648: 647: 643: 629: 626: 625: 596: 580: 572: 569: 568: 551: 547: 539: 536: 535: 521: 507: 504: 478: 474: 465: 461: 457: 453: 445: 442: 441: 438: 416: 412: 410: 407: 406: 381: 343: 323: 320: 319: 294: 290: 286: 282: 274: 271: 270: 254: 251: 250: 236: 232: 224: 220: 213: 208: 196: 180: 151: 143: 83: 78: 17: 12: 11: 5: 943: 933: 932: 927: 922: 906: 905: 892: 868: 844: 820: 807: 794: 770: 768:(pp. 101-110). 757: 747: 746: 744: 741: 728: 725: 722: 718: 715: 711: 708: 705: 702: 698: 695: 691: 688: 685: 682: 679: 655: 651: 646: 642: 639: 636: 633: 612: 609: 606: 602: 599: 595: 592: 589: 586: 583: 579: 576: 554: 550: 546: 543: 520: 517: 512: 511: 503:(houses, house 486: 481: 477: 473: 468: 464: 460: 456: 452: 449: 437: 434: 419: 415: 403: 402: 391: 388: 384: 380: 377: 374: 371: 368: 365: 362: 359: 356: 353: 350: 346: 342: 339: 336: 333: 330: 327: 302: 297: 293: 289: 285: 281: 278: 258: 229: 228: 212: 209: 207: 204: 195: 192: 179: 176: 142: 139: 82: 79: 77: 74: 48:). In English 38:part of speech 15: 9: 6: 4: 3: 2: 942: 931: 928: 926: 923: 921: 918: 917: 915: 902: 896: 882: 878: 872: 858: 854: 848: 834: 830: 824: 817: 811: 804: 798: 784: 780: 774: 767: 761: 752: 748: 740: 726: 723: 716: 713: 709: 706: 700: 696: 693: 689: 683: 677: 653: 644: 637: 634: 631: 610: 607: 600: 597: 593: 590: 584: 581: 577: 574: 552: 544: 541: 532: 516: 502: 501: 500: 479: 471: 466: 454: 450: 447: 433: 417: 413: 382: 378: 369: 360: 351: 344: 340: 328: 325: 318: 317: 316: 295: 291: 283: 279: 276: 247: 245: 240: 218: 217: 216: 203: 200: 191: 187: 184: 174: 171: 168: 165: 162: 159: 155: 152: 149: 147: 138: 133: 130: 127: 124: 121: 118: 114: 110: 107: 104: 101: 98: 94: 92: 91:tab-separated 88: 73: 71: 67: 63: 59: 55: 51: 47: 43: 39: 35: 30: 26: 22: 900: 895: 884:. Retrieved 880: 871: 860:. Retrieved 856: 853:"FieldWorks" 847: 836:. Retrieved 832: 823: 815: 810: 802: 797: 786:. Retrieved 782: 773: 765: 760: 751: 533: 522: 513: 439: 404: 248: 241: 230: 214: 197: 188: 181: 172: 169: 166: 163: 160: 157: 153: 150: 144: 135: 131: 128: 125: 122: 119: 116: 112: 108: 105: 102: 99: 96: 84: 69: 65: 61: 57: 53: 49: 28: 18: 670:defined as 914:Categories 886:2021-11-27 862:2021-11-27 838:2021-11-27 788:2021-11-30 743:References 724:∈ 678:τ 654:∗ 650:Γ 641:→ 632:τ 608:∈ 553:∗ 549:Σ 545:⊂ 480:∗ 476:Σ 472:× 467:∗ 463:Γ 451:⊂ 418:∗ 383:θ 379:∪ 376:Γ 370:× 367:Σ 361:∪ 355:Γ 352:× 345:θ 341:∪ 338:Σ 315:, where: 296:∗ 280:⊂ 257:Γ 717:′ 697:′ 601:′ 199:OntoLex 877:"XIGT" 803:FSMNLP 529:/haʊz/ 525:/haʊs/ 223:), (θ, 58:giving 46:number 42:gender 66:given 54:gives 34:lemma 881:XIGT 816:LREC 70:give 64:and 62:gave 50:give 44:and 27:, a 23:and 916:: 879:. 855:. 831:. 781:. 739:. 432:. 60:, 56:, 52:, 40:, 903:. 889:. 865:. 841:. 791:. 727:U 721:) 714:w 710:, 707:w 704:( 701:: 694:w 690:= 687:) 684:w 681:( 645:2 638:E 635:: 611:U 605:) 598:w 594:, 591:w 588:( 585:: 582:w 578:= 575:E 542:E 510:) 485:) 459:( 455:2 448:U 414:L 390:) 387:) 373:( 364:( 358:) 349:) 335:( 332:( 329:= 326:L 301:) 292:L 288:( 284:2 277:A 227:)

Index

computational linguistics
applied linguistics
lemma
part of speech
gender
number
Universal Dependencies
tab-separated
Finite State Transducers
Interlinear Glossed Text (IGT)
OntoLex
morphological analysis
"A Short History of Two-Level Morphology"
"Field Linguist's Toolbox"
"FieldWorks"
"XIGT"
Categories
Computational linguistics
Translation databases
Linguistic morphology

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.