Text corpus - Knowledge

122:. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for 220:, the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis. 191:

as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective

224:

algorithms for translating between two languages are often trained using parallel fragments comprising a first-language corpus and a second-language corpus, which is an element-for-element translation of the first-language

434:

Wolk, Krzysztof; Marasek, Krzysztof (2015). "Tuned and GPU-accelerated parallel data mining from comparable corpora". In Král, Pavel; Matousek, Václav (eds.).

559: 249:. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time may be the 15–30 year 719: 697: 498:

Free samples (not free), web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese

1108: 552: 1277: 436:

Text, Speech, and Dialogue – 18th International Conference, TSD 2015, Pilsen, Czech Republic, September 14–17, 2015, Proceedings

358: 1323: 298: 461: 402: 91:, in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of 1018: 709: 545: 526:, Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library 1272: 1318: 879: 79:

In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as

1033: 864: 804: 99:(base) form of each word. When the language of the corpus is not a working language of the researchers who use it, 1221: 874: 492: 869: 614: 57: 1138: 859: 483: 831: 505:

Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University.

188: 1313: 1308: 1176: 1161: 1133: 998: 993: 568: 308: 154: 26: 913: 884: 662: 303: 756: 609: 164: 158: 1282: 1206: 938: 894: 779: 677: 313: 288: 123: 1186: 1156: 823: 657: 1043: 736: 714: 704: 672: 647: 204:

Multilingual corpora that have been specially formatted for side-by-side comparison are called

84: 903: 163:

The analysis and processing of various types of corpora are also the subject of much work in

379:. Advances in Intelligent Systems and Computing. Vol. 275. Springer. pp. 107–114. 60:, checking occurrences or validating linguistic rules within a specific language territory. 1256: 932: 908: 761: 277: 265:

Texts" of Turkey), may go through a series of corpora, determined by their find site dates.

246: 8: 1328: 1236: 1166: 1123: 1079: 851: 841: 836: 724: 238: 221: 199: 176: 172: 150: 1246: 1118: 983: 746: 729: 587: 439: 416: 380: 328: 293: 168: 143: 96: 53: 1251: 963: 771: 682: 457: 408: 398: 184: 100: 46: 420: 1128: 1013: 988: 789: 692: 523: 449: 390: 209: 375:

Wołk, K.; Marasek, K. (7 April 2014). "Real-Time Statistical Speech Translation".

216:, the texts in one language are translations of texts in the other language. In a 1240: 1201: 1196: 1064: 794: 667: 642: 624: 487: 394: 453: 948: 928: 652: 438:. Lecture Notes in Computer Science. Vol. 9302. Springer. pp. 32–40. 250: 180: 537: 1302: 1211: 1023: 1003: 784: 412: 338: 323: 318: 1191: 809: 519:

Turkish National Corpus – A general-purpose corpus for contemporary Turkish

502: 242: 1148: 741: 634: 582: 22: 110:

levels of analysis applied. In particular, smaller corpora may be fully

751: 508: 131: 80: 262: 45:

is a dataset, consisting of natively digital and older, digitalized,

619: 232: 127: 514:

TS Corpus – A Turkish Corpus freely available for academic research.

1094: 1074: 1059: 1038: 1008: 953: 918: 799: 444: 385: 333: 119: 115: 377:

New Perspectives in Information Systems and Technologies, Volume 1

1231: 1089: 1069: 943: 687: 602: 480: 254: 111: 597: 592: 1287: 923: 529: 513: 179:

for part of speech tagging and other purposes. Corpora and

1084: 518: 497: 493:

Developing Linguistic Corpora: a Guide to Good Practice

359:

ESL Student Attitudes toward Corpus Use in L2 Writing

68:A corpus may contain texts in a single language ( 1300: 770: 567: 146:. Other notable areas of application include: 553: 509:Sketch Engine: Open corpora with free access 271: 16:Digital collections of natural language data 433: 374: 237:Text corpora are also used in the study of 212:which contain texts in two languages. In a 560: 546: 103:is used to make the annotation bilingual. 443: 384: 187:. Corpora can be considered as a type of 481:ACL SIGLEX Resource Links: Text Corpora 142:Corpora are the main knowledge base in 83:. An example of annotating a corpus is 1301: 365:(4), 257–283. Retrieved 21 March 2012. 363:Journal of Second Language Writing, 13 261:of an ancient city, (for example the " 175:, where they are often used to create 72:) or text data in multiple languages ( 541: 1019:Simple Knowledge Organization System 357:Yoon, H., & Hirvela, A. (2004). 95:. Another example is indicating the 351: 49:, either annotated or unannotated. 34: 13: 299:Distributional–relational database 114:. Such corpora are usually called 52:Annotated, they have been used in 14: 1340: 1034:Thesaurus (information retrieval) 474: 183:derived from them are useful for 137: 615:Natural language understanding 427: 368: 208:. There are two main types of 1: 1324:Works based on multiple works 1139:Optical character recognition 344: 241:, for example in attempts to 832:Multi-document summarization 524:Corpus of Political Speeches 395:10.1007/978-3-319-05951-8_11 189:foreign language writing aid 7: 1162:Latent Dirichlet allocation 1134:Natural language generation 999:Machine-readable dictionary 994:Linguistic Linked Open Data 569:Natural language processing 454:10.1007/978-3-319-24033-6_4 309:Natural language processing 282: 155:natural language processing 63: 27:natural language processing 10: 1345: 914:Explicit semantic analysis 663:Deep linguistic processing 304:Linguistic Data Consortium 275: 106:Some corpora have further 1319:Computational linguistics 1265: 1220: 1175: 1147: 1107: 1052: 974: 962: 893: 850: 822: 757:Word-sense disambiguation 633: 610:Computational linguistics 575: 272:Some notable text corpora 165:computational linguistics 159:computational linguistics 1283:Natural Language Toolkit 1207:Pronunciation assessment 1109:Automatic identification 939:Latent semantic analysis 895:Distributional semantics 780:Compound-term processing 678:Named-entity recognition 314:Natural Language Toolkit 206:aligned parallel corpora 1187:Automated essay scoring 1157:Document classification 824:Automatic summarization 530:Russian National Corpus 245:ancient scripts, or in 1044:Universal Dependencies 737:Terminology extraction 720:Semantic decomposition 715:Semantic role labeling 705:Part-of-speech tagging 673:Information extraction 658:Coreference resolution 648:Collocation extraction 85:part-of-speech tagging 805:Sentence segmentation 1257:Voice user interface 968:datasets and corpora 909:Document-term matrix 762:Word-sense induction 278:List of text corpora 247:Biblical scholarship 239:historical documents 177:hidden Markov models 101:interlinear glossing 1237:Interactive fiction 1167:Pachinko allocation 1124:Speech segmentation 1080:Google Ngram Viewer 852:Machine translation 842:Text simplification 837:Sentence extraction 725:Semantic similarity 222:Machine translation 200:Machine translation 173:machine translation 151:Language technology 74:multilingual corpus 1314:Corpus linguistics 1309:Discourse analysis 1247:Question answering 1119:Speech recognition 984:Corpus linguistics 964:Language resources 747:Textual entailment 730:Sentiment analysis 486:2013-08-13 at the 329:Translation memory 294:Corpus linguistics 214:translation corpus 169:speech recognition 144:corpus linguistics 70:monolingual corpus 58:hypothesis testing 54:corpus linguistics 47:language resources 1296: 1295: 1252:Virtual assistant 1177:Computer-assisted 1103: 1102: 860:Computer-assisted 818: 817: 810:Word segmentation 772:Text segmentation 710:Semantic analysis 698:Syntactic parsing 683:Ontology learning 463:978-3-319-24032-9 404:978-3-319-05950-1 218:comparable corpus 185:language teaching 1336: 1273:Formal semantics 1222:Natural language 1129:Speech synthesis 1111:and data capture 1014:Semantic network 989:Lexical resource 972: 971: 790:Lexical analysis 768: 767: 693:Semantic parsing 562: 555: 548: 539: 538: 468: 467: 447: 431: 425: 424: 388: 372: 366: 355: 210:parallel corpora 56:for statistical 36: 1344: 1343: 1339: 1338: 1337: 1335: 1334: 1333: 1299: 1298: 1297: 1292: 1261: 1241:Syntax guessing 1223: 1216: 1202:Predictive text 1197:Grammar checker 1178: 1171: 1143: 1110: 1099: 1065:Bank of English 1048: 976: 967: 958: 889: 846: 814: 766: 668:Distant reading 643:Argument mining 629: 625:Text processing 571: 566: 535: 488:Wayback Machine 477: 472: 471: 464: 432: 428: 405: 373: 369: 356: 352: 347: 285: 280: 274: 181:frequency lists 140: 66: 17: 12: 11: 5: 1342: 1332: 1331: 1326: 1321: 1316: 1311: 1294: 1293: 1291: 1290: 1285: 1280: 1275: 1269: 1267: 1263: 1262: 1260: 1259: 1254: 1249: 1244: 1234: 1228: 1226: 1224:user interface 1218: 1217: 1215: 1214: 1209: 1204: 1199: 1194: 1189: 1183: 1181: 1173: 1172: 1170: 1169: 1164: 1159: 1153: 1151: 1145: 1144: 1142: 1141: 1136: 1131: 1126: 1121: 1115: 1113: 1105: 1104: 1101: 1100: 1098: 1097: 1092: 1087: 1082: 1077: 1072: 1067: 1062: 1056: 1054: 1050: 1049: 1047: 1046: 1041: 1036: 1031: 1026: 1021: 1016: 1011: 1006: 1001: 996: 991: 986: 980: 978: 969: 960: 959: 957: 956: 951: 949:Word embedding 946: 941: 936: 929:Language model 926: 921: 916: 911: 906: 900: 898: 891: 890: 888: 887: 882: 880:Transfer-based 877: 872: 867: 862: 856: 854: 848: 847: 845: 844: 839: 834: 828: 826: 820: 819: 816: 815: 813: 812: 807: 802: 797: 792: 787: 782: 776: 774: 765: 764: 759: 754: 749: 744: 739: 733: 732: 727: 722: 717: 712: 707: 702: 701: 700: 695: 685: 680: 675: 670: 665: 660: 655: 653:Concept mining 650: 645: 639: 637: 631: 630: 628: 627: 622: 617: 612: 607: 606: 605: 600: 590: 585: 579: 577: 573: 572: 565: 564: 557: 550: 542: 533: 532: 527: 521: 516: 511: 506: 500: 495: 490: 476: 475:External links 473: 470: 469: 462: 426: 403: 367: 349: 348: 346: 343: 342: 341: 336: 331: 326: 321: 316: 311: 306: 301: 296: 291: 284: 281: 276:Main article: 273: 270: 269: 268: 267: 266: 251:Amarna letters 229: 228: 227: 226: 196: 195: 194: 193: 139: 136: 120:Parsed Corpora 65: 62: 15: 9: 6: 4: 3: 2: 1341: 1330: 1327: 1325: 1322: 1320: 1317: 1315: 1312: 1310: 1307: 1306: 1304: 1289: 1286: 1284: 1281: 1279: 1278:Hallucination 1276: 1274: 1271: 1270: 1268: 1264: 1258: 1255: 1253: 1250: 1248: 1245: 1242: 1238: 1235: 1233: 1230: 1229: 1227: 1225: 1219: 1213: 1212:Spell checker 1210: 1208: 1205: 1203: 1200: 1198: 1195: 1193: 1190: 1188: 1185: 1184: 1182: 1180: 1174: 1168: 1165: 1163: 1160: 1158: 1155: 1154: 1152: 1150: 1146: 1140: 1137: 1135: 1132: 1130: 1127: 1125: 1122: 1120: 1117: 1116: 1114: 1112: 1106: 1096: 1093: 1091: 1088: 1086: 1083: 1081: 1078: 1076: 1073: 1071: 1068: 1066: 1063: 1061: 1058: 1057: 1055: 1051: 1045: 1042: 1040: 1037: 1035: 1032: 1030: 1027: 1025: 1024:Speech corpus 1022: 1020: 1017: 1015: 1012: 1010: 1007: 1005: 1004:Parallel text 1002: 1000: 997: 995: 992: 990: 987: 985: 982: 981: 979: 973: 970: 965: 961: 955: 952: 950: 947: 945: 942: 940: 937: 934: 930: 927: 925: 922: 920: 917: 915: 912: 910: 907: 905: 902: 901: 899: 896: 892: 886: 883: 881: 878: 876: 873: 871: 868: 866: 865:Example-based 863: 861: 858: 857: 855: 853: 849: 843: 840: 838: 835: 833: 830: 829: 827: 825: 821: 811: 808: 806: 803: 801: 798: 796: 795:Text chunking 793: 791: 788: 786: 785:Lemmatisation 783: 781: 778: 777: 775: 773: 769: 763: 760: 758: 755: 753: 750: 748: 745: 743: 740: 738: 735: 734: 731: 728: 726: 723: 721: 718: 716: 713: 711: 708: 706: 703: 699: 696: 694: 691: 690: 689: 686: 684: 681: 679: 676: 674: 671: 669: 666: 664: 661: 659: 656: 654: 651: 649: 646: 644: 641: 640: 638: 636: 635:Text analysis 632: 626: 623: 621: 618: 616: 613: 611: 608: 604: 601: 599: 596: 595: 594: 591: 589: 586: 584: 581: 580: 578: 576:General terms 574: 570: 563: 558: 556: 551: 549: 544: 543: 540: 536: 531: 528: 525: 522: 520: 517: 515: 512: 510: 507: 504: 501: 499: 496: 494: 491: 489: 485: 482: 479: 478: 465: 459: 455: 451: 446: 441: 437: 430: 422: 418: 414: 410: 406: 400: 396: 392: 387: 382: 378: 371: 364: 360: 354: 350: 340: 337: 335: 332: 330: 327: 325: 324:Speech corpus 322: 320: 319:Parallel text 317: 315: 312: 310: 307: 305: 302: 300: 297: 295: 292: 290: 287: 286: 279: 264: 260: 256: 252: 248: 244: 240: 236: 235: 234: 231: 230: 223: 219: 215: 211: 207: 203: 202: 201: 198: 197: 190: 186: 182: 178: 174: 170: 166: 162: 161: 160: 156: 152: 149: 148: 147: 145: 135: 133: 129: 125: 121: 117: 113: 109: 104: 102: 98: 94: 90: 86: 82: 77: 75: 71: 61: 59: 55: 50: 48: 44: 40: 32: 28: 24: 19: 1192:Concordancer 1028: 588:Bag-of-words 534: 435: 429: 376: 370: 362: 353: 258: 217: 213: 205: 141: 138:Applications 107: 105: 92: 88: 78: 73: 69: 67: 51: 42: 38: 30: 20: 18: 1149:Topic model 1029:Text corpus 875:Statistical 742:Text mining 583:AI-complete 289:Concordance 233:Philologies 89:POS-tagging 43:text corpus 23:linguistics 1329:Test items 1303:Categories 870:Rule-based 752:Truecasing 620:Stop words 445:1509.08639 386:1509.09090 345:References 339:Zipf's law 132:pragmatics 124:morphology 108:structured 81:annotation 1179:reviewing 977:standards 975:Types and 503:Intercorp 413:2194-5357 128:semantics 116:Treebanks 1095:Wikidata 1075:FrameNet 1060:BabelNet 1039:Treebank 1009:PropBank 954:Word2vec 919:fastText 800:Stemming 484:Archived 421:15361632 334:Treebank 283:See also 243:decipher 192:writing. 64:Overview 1266:Related 1232:Chatbot 1090:WordNet 1070:DBpedia 944:Seq2seq 688:Parsing 603:Trigram 263:Kültepe 257:). The 255:1350 BC 253:texts ( 225:corpus. 39:corpora 1239:(c.f. 897:models 885:Neural 598:Bigram 593:n-gram 460: 419: 411: 401: 259:corpus 112:parsed 31:corpus 1288:spaCy 933:large 924:GloVe 440:arXiv 417:S2CID 381:arXiv 97:lemma 87:, or 41:) or 1053:Data 904:BERT 458:ISBN 409:ISSN 399:ISBN 171:and 130:and 93:tags 29:, a 25:and 1085:UBY 450:doi 391:doi 118:or 76:). 35:pl. 21:In 1305:: 456:. 448:. 415:. 407:. 397:. 389:. 361:. 167:, 157:, 153:, 134:. 126:, 37:: 1243:) 966:, 935:) 931:( 561:e 554:t 547:v 466:. 452:: 442:: 423:. 393:: 383:: 33:(

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index