Knowledge

Language documentation tools and methods

Source 📝

123:
resolution (1080p or 720p) or higher when possible, while for audio this means recording minimally in uncompressed PCM 44,100 samples per second, 16-bit resolution. Arguably, however, good recording techniques (isolation, microphone selection and usage, using a tripod to minimize blur) is more important than resolution. A microphone that gives a clear recording of a speaker telling a folktale (high signal/noise ratio) in MP3 format (perhaps via a phone) is better than an extremely noisy recording in WAV format where all that can be heard are cars going by. To ensure that good recordings can be obtained, linguists should practice with their recording devices as much as possible and compare the results to observe which techniques yield the best results.
453:) conversion tool. It is also possible to use Toolbox as a transcription environment. By comparison with ELAN and FLEx, Toolbox has relatively limited functionality, and is felt by some to have an unintuitive design and interface. However, a large number of projects have been carried-out in the Shoebox/Toolbox environment over its lifespan, and its user base continues to enjoy its advantages of familiarity, speed, and community support. Toolbox also has the advantage of working directly with human-readable text files that can be opened in any text editor and easily manipulated and archived. Toolbox files can also be easily converted for storage in XML (recommended for archives), such as with open source Python libraries like 70:. Most postgraduate programs that involve some form of language documentation and description require researchers to submit their proposed protocols to an internal Institutional Review Board which ensures that research is being conducted ethically. Minimally, participants should be informed of the process and the intended use of the recordings, and give recorded audible or written permission for the audiovisual materials to be used for linguistic investigation by the researcher(s). Many participants will want to be named as consultants, but others will not - this will determine whether the data needs to be anonymized or restricted from public access. 34:
challenging, not every type of recording tool is necessary or ideal, and compromises must often be struck between quality, cost and usability. It is also important to envision one's complete workflow and intended outcomes; for example, if video files are made, some amount of processing may be required to expose the audio component to processing in various ways by different software packages.
266:. It is an advantage in most fieldwork situations if a condenser microphone is self-powered (via a battery); however, when power is not a major factor, phantom-powered models can also be used. A stereo microphone setup is needed whenever more than one speaker is involved in a recording; this can be achieved via an array of two mono microphones, or by a dedicated stereo microphone. 426:. FLEx allows the user to build a "lexicon" of the language, i.e. a word-list with definitions and grammatical information, and also to store texts from the language. Within the texts, each word or part of a word (i.e. a "morpheme") is linked to an entry in the lexicon. For new projects and for students learning for the first time, 449:, Toolbox's primary functions are construction of a lexical database, and interlinearization of texts through interaction with the lexical database. Both lexical database and texts can be exported to a word processing environment, in the case of the lexical database using the Multi-Dictionary Formatter ( 285:
or "lapel" microphones may be used in some situations, however, depending on the microphone they can produce recordings which are inferior to a headset microphone for phonetic analysis, and are subject to some of the same concerns that headset microphones are in terms of restriction of a recording to
205:
When using a video recorder that does not record audio in WAV format (such as most DSLR cameras), it is recommended to record audio separately on another recorder, following some of the guidelines below. As with the audio recorders described below, many video recorders also accept microphone input of
360:
The primary functions of SayMore are: (a) audio recording (b) file import from recording device (video and/or audio) (c) file organization (d) metadata entry at session and file levels (e) association of AV files with evidence of informed consent and other supplementary objects (such as photographs)
269:
Directional microphones should be used in most cases, in order to isolate a speaker's voice from other potential noise sources. However, omnidirectional microphones may be preferred in situations involving larger numbers of speakers arrayed in a relatively large space. Among directional microphones,
214:
Audio-only recorders can be used in scenarios where video is impractical or otherwise undesirable. In most cases it is advantageous to combine the use of an audio-only recorder with one or more external microphones, however many modern audio recorders include built-in microphones which are usable if
164:
for Toolbox that allows timecodes to reference an audio file and enable playback (of a complete text or a referenced sentence) from within Toolbox - in this workflow, time-alignment of text is performed in Transcriber, and then the relevant timecodes and text are converted into a format that Toolbox
337:
There is as yet no single software suite which is designed to or able to handle all aspects of a typical language documentation workflow. Instead, there is a large and increasing number of packages designed to handle various aspects of the workflow, many of which overlap considerably. Some of these
178:
Recorders that record video typically also record audio as well. However, the audio does not always meet the criteria of minimal needs and recommended best practices for language documentation (uncompressed WAV format, 44.1 kHz, 16-bit), and is often not useful for linguistic purposes such as
159:
their texts, as these programs build a dictionary of forms and parsing rules to help speed up analysis. Unfortunately, media files are generally not linked by these programs (as opposed to ELAN, in which linked files are preferred), making it difficult to view or listen back to recordings to check
131:
For many linguists the end-result of making recordings is language analysis, often investigation of a language's phonological or syntactic properties using various software tools. This requires transcription of the audio, generally in collaboration with native speakers of the language in question.
78:
Adhering to standards for formats is critical for interoperability between software tools. Many individual archives or data repositories have their own standards and requirements for data deposited on their servers - knowledge of these requirements ought to inform the data collection strategy and
122:
Since documentation of languages is often difficult, with many languages that linguists work with being endangered (they may not be spoken in the near future), it is recommended to record at the highest quality possible given the limitations of a recorder. For video, this means recording at HD
33:
Researchers in language documentation often conduct linguistic fieldwork to gather the data on which their work is based, recording audiovisual files that document language use in traditional contexts. Because the environments in which linguistic fieldwork often takes place may be logistically
215:
cost or setup speed are important concerns. Digital (solid state) recorders are preferred for most language documentation scenarios. Modern digital recorders achieve a very high level of quality at a relatively low price. Some of the most popular field recorders are found in the
486:
online tool allowing to record a large number of words and phrases in a short period (up to 1 000 words/hour with a clean word list and an experienced user). It automatizes the classic procedure for recording audio and video pronunciation files (for
20:
in the modern context involves a complex and ever-evolving set of tools and methods, and the study and development of their use – and, especially, identification and promotion of best practices – can be considered a sub-field of
206:
various kinds (generally through an 1/8-inch or TRS connector) - this can ensure a high-quality backup audio recording that is in sync with the recorded video, which can be helpful in some cases (i.e. for transcription).
132:
For general transcription, media files can be played back on a computer (or other device capable of playback) and paused for transcription in a text editor. Other (cross-platform) tools to assist this process include
258:
can be effectively used in language documentation scenarios, depending on the situation (especially, including factors such as number, position and mobility of speakers) and on budget. In general,
243:
is particularly suitable for situations in which cost and user-friendliness are major desiderata. Other popular recorders for situations where size is a factor are the
567:
Language resources map Searchable by Resource Type, Language(s), Language type, Modality, Resource Use, Availability, Production Status, Conference(s), Resource name
955: 970: 286:
a single speaker - while other speakers may be audible on the recording, they will be backgrounded in relation to the speaker wearing the lavalier microphone.
179:
phonetic analysis. Many video devices record instead to a compressed audio format such as AAC or MP3, which is combined with the video stream in a wrapper of
298: 275: 573:
A catalog of "open-source code that would be useful for documenting, conserving, developing, preserving, or working with endangered languages".
799: 137: 281:
Good quality headset microphones are comparatively expensive, but can produce recordings of extremely high quality in controlled situations.
495:
languages). Once the recording is done, the platform automatically uploads clean, well cut, well named and apps-friendly files, directly to
941: 396: 248: 199: 195: 814: 302: 99: 496: 357:
which primarily focuses on the initial stages in language documentation, and aims for a relatively uncomplicated user experience.
556:. Travelling across four states and doing extensive research, he spent twenty five years making this multilingual dictionary. 742: 640: 110:
use MPEG-4 (H264) as an encoding or storage format, which includes an AAC audio stream (generally of up to 320 kbit/s).
236: 701: 403:. ELAN is a full-featured transcription tool, particularly useful for researchers with complex annotation needs/goals. 191: 55: 294: 832: 775: 25:
proper. Among these are ethical and recording principles, workflows and methods, hardware tools, and software tools.
202:, which record to multiple video and audio resolutions/formats, most notably WAV (44.1/48/96 kHz, 16/24-bit). 1012: 445:
and has been one of the most widely used language documentation packages for some decades. Previously known as
988: 66:
for engaging in documentation work. The morality of ethics protocols has itself been brought into question by
83:
developed before the start of research. Some example guidelines from well-used repositories are given below:
43: 655:
Austin, Peter K. 2010. 'Communities, ethics and rights in language documentation.' In Peter K. Austin, Ed.,
47: 59: 42:
Ethical practices in language documentation have been the focus of much recent discussion and debate. The
564: 228: 187: 897:"Spectral Degradation of Speech Captured by Miniature Microphones Mounted on Persons' Heads and Chests" 669:
van Driem, George (2016). "Endangered Language Research and the Moral Depravity of Ethics Protocols".
224: 465:
Language documentation may be partially automated thanks to a number of software tools, including:
306: 240: 232: 220: 942:"82-year-old Kerala man's Dictionary is in the four Dravidian languages. 25 long years to compile" 521:
has published a large number of articles focusing on tools and methods in language documentation.
373: 22: 17: 824: 849: 259: 180: 537:, a fourth standard drop-out, who compiles a multilingual dictionary connecting four major 534: 244: 80: 865: 8: 538: 282: 271: 263: 251:(though in the latter case, ensure that the device can record to WAV/Linear PCM format). 793: 748: 338:
packages use standard formats and are inter-operable, whereas others are much less so.
694:
Phonetic data analysis : an introduction to fieldwork and instrumental techniques
570: 877: 869: 828: 781: 771: 738: 707: 697: 636: 530: 419: 350: 156: 752: 956:"83-YO Kerala School Dropout Creates Unique Dictionary in 4 South Indian Languages" 923: 861: 820: 730: 674: 628: 596: 67: 553: 488: 896: 54:
which is primarily focused on ethics in the language documentation context. The
549: 474: 93: 785: 765: 734: 289:
Some good quality microphones used for film-making and interviews include the
1006: 873: 492: 388: 362: 111: 711: 881: 479: 183:. Exceptions to this general rule are the following Video+Audio recorders: 141: 850:"Guidelines for Selecting Microphones for Human Voice Production Research" 632: 87: 579:
Research Network for Linguistic Diversity's page on linguistic software.
51: 450: 290: 274:
microphones are suitable for most applications, however in some cases a
678: 442: 369: 255: 152: 927: 317: 541: 161: 446: 438: 346: 148: 989:"The Man Who Wrote A Dictionary In Four Languages – Silver Talkies" 469: 400: 460: 545: 430:
is now the best tool for interlinearising and dictionary-making.
416:
SIL International formerly Summer Institute of Linguistics, Inc.
133: 576: 518: 454: 423: 354: 971:"For Keralites, door opens to three other Dravidian languages" 499:(it is possible to download datasets for a specific language). 63: 483: 427: 411: 107: 914:
Margetts, Andrew (2009). "Using Toolbox with Media Files".
766:
Meakins, Felicity; Green, Jennifer; Turpin, Myfany (2018).
377: 361:(f) AV file segmentation (g) transcription/translation (h) 216: 144:(described further below) can also perform this function. 368:
SayMore files can be further exported for annotation in
155:
are often preferred by linguists who want to be able to
365:-style Careful Speech annotation and Oral Translation. 301:. Depending on the recorder and microphone, additional 114:
archive quality is at least WAV 44.1 kHz, 16-bit.
415: 392: 725:Chelliah, Shobhana L.; de Reuse, Willem J. (2011). 671:
Language Documentation and Conservation 10: 243-252
318:
Electrical power generation, storage and management
291:Røde VideoMic shotgun and the Røde lavalier series 209: 349:is a language documentation package developed by 1004: 724: 461:Tools for automating components of the workflow 848:Švec, Jan G.; Granqvist, Svante (2010-11-01). 854:American Journal of Speech-Language Pathology 847: 457:intended for computational uses of IGT data. 727:Handbook of Descriptive Linguistic Fieldwork 657:Language Documentation and Description Vol 7 816:The Oxford Handbook of Linguistic Fieldwork 441:(usually called Toolbox) is a precursor of 117: 28: 812: 798:: CS1 maint: location missing publisher ( 397:Max Planck Institute for Psycholinguistics 916:Language Documentation & Conservation 691: 668: 278:("shotgun") microphone may be preferred. 913: 901:Audio Engineering Society Convention 100 813:Thieberger, Nicholas, ed. (2011-11-24). 312: 173: 825:10.1093/oxfordhb/9780199571888.001.0001 519:Language Documentation and Conservation 1005: 894: 622: 968: 618: 616: 79:tools used, and should be part of a 322: 106:Most current archive standards for 88:Endangered Languages Archive (ELAR) 13: 768:Understanding linguistic fieldwork 412:FieldWorks Language Explorer, FLEx 372:, and metadata can be exported in 14: 1024: 613: 571:Richard Littauer's GitHub catalog 529:The 2021 Indian documentary film 305:(XLR, stereo/mono converter or a 969:Sajit, C. p. (30 October 2020). 981: 962: 948: 934: 907: 888: 866:10.1044/1058-0360(2010/09-0091) 625:Linguistic Fieldwork – Springer 262:should be selected rather than 210:Audio recorders and microphones 73: 56:First Peoples' Cultural Council 841: 806: 759: 718: 685: 662: 649: 589: 327: 1: 696:. Malden, MA: Blackwell Pub. 582: 512: 44:Linguistic Society of America 533:traces the life and work of 249:Sony Digital Voice recorders 126: 94:Max Planck Institute Archive 60:Endangered Languages Project 7: 895:Brixen, Eddy (1996-05-01). 819:. Oxford University Press. 559: 332: 168: 10: 1029: 517:The peer-reviewed journal 433: 341: 64:Linguist's Code of Conduct 735:10.1007/978-90-481-9026-3 692:Ladefoged, Peter (2003). 190:series, particularly the 160:transcriptions. There is 37: 439:Field Linguist's Toolbox 118:Principles for recording 29:Principles and workflows 623:Bowern, Claire (2008). 524: 406: 383: 380:formats for archiving. 140:, while a program like 100:Yale University Library 1013:Language documentation 659:. London, SOAS: 34-54. 162:currently a workaround 102:audiovisual guidelines 52:Ethics Discussion Blog 23:language documentation 18:language documentation 633:10.1057/9780230590168 313:Other recording tools 309:) will be necessary. 260:condenser microphones 219:range, including the 174:Video+audio recorders 535:Njattyela Sreedharan 393:The Language Archive 81:data management plan 958:. 31 December 2020. 539:Dravidian languages 307:TRRS to TRS adapter 295:Shure headworn mics 264:dynamic microphones 50:, and maintains an 577:RNLD software page 505:Prosodylab Aligner 993:silvertalkies.com 744:978-90-481-9025-6 642:978-0-230-54538-0 597:"LD Tools Summit" 531:Dreaming of Words 497:Wikimedia Commons 420:SIL International 351:SIL International 254:Several types of 245:Olympus LS-series 1020: 997: 996: 985: 979: 978: 966: 960: 959: 952: 946: 945: 938: 932: 931: 911: 905: 904: 892: 886: 885: 845: 839: 838: 810: 804: 803: 797: 789: 763: 757: 756: 722: 716: 715: 689: 683: 682: 666: 660: 653: 647: 646: 620: 611: 610: 608: 607: 601:sites.google.com 593: 414:is developed by 391:is developed by 323:Computer systems 96:accepted formats 68:George van Driem 62:have released a 48:Ethics Statement 46:has prepared an 1028: 1027: 1023: 1022: 1021: 1019: 1018: 1017: 1003: 1002: 1001: 1000: 987: 986: 982: 967: 963: 954: 953: 949: 940: 939: 935: 912: 908: 893: 889: 846: 842: 835: 811: 807: 791: 790: 778: 764: 760: 745: 723: 719: 704: 690: 686: 667: 663: 654: 650: 643: 621: 614: 605: 603: 595: 594: 590: 585: 562: 527: 515: 463: 436: 409: 386: 344: 335: 330: 325: 320: 315: 299:Shure lavaliers 212: 176: 171: 129: 120: 76: 40: 31: 12: 11: 5: 1026: 1016: 1015: 999: 998: 980: 961: 947: 933: 906: 887: 860:(4): 356–368. 840: 833: 805: 776: 758: 743: 717: 703:978-0631232698 702: 684: 661: 648: 641: 612: 587: 586: 584: 581: 561: 558: 526: 523: 514: 511: 510: 509: 506: 503: 500: 477: 472: 462: 459: 435: 432: 408: 405: 385: 382: 343: 340: 334: 331: 329: 326: 324: 321: 319: 316: 314: 311: 211: 208: 175: 172: 170: 167: 157:interlinearize 147:Programs like 128: 125: 119: 116: 104: 103: 97: 91: 75: 72: 39: 36: 30: 27: 9: 6: 4: 3: 2: 1025: 1014: 1011: 1010: 1008: 994: 990: 984: 976: 972: 965: 957: 951: 943: 937: 929: 925: 921: 917: 910: 902: 898: 891: 883: 879: 875: 871: 867: 863: 859: 855: 851: 844: 836: 834:9780191744112 830: 826: 822: 818: 817: 809: 801: 795: 787: 783: 779: 777:9781351330114 773: 769: 762: 754: 750: 746: 740: 736: 732: 728: 721: 713: 709: 705: 699: 695: 688: 680: 676: 672: 665: 658: 652: 644: 638: 634: 630: 626: 619: 617: 602: 598: 592: 588: 580: 578: 574: 572: 568: 566: 557: 555: 551: 547: 543: 540: 536: 532: 522: 520: 507: 504: 501: 498: 494: 490: 485: 481: 478: 476: 473: 471: 468: 467: 466: 458: 456: 452: 448: 444: 440: 431: 429: 425: 421: 417: 413: 404: 402: 398: 394: 390: 381: 379: 375: 371: 366: 364: 358: 356: 352: 348: 339: 310: 308: 304: 300: 296: 292: 287: 284: 279: 277: 276:hypercardioid 273: 267: 265: 261: 257: 252: 250: 246: 242: 238: 234: 230: 226: 222: 218: 207: 203: 201: 197: 193: 189: 184: 182: 181:various kinds 166: 163: 158: 154: 150: 145: 143: 139: 135: 124: 115: 113: 109: 101: 98: 95: 92: 89: 86: 85: 84: 82: 71: 69: 65: 61: 57: 53: 49: 45: 35: 26: 24: 19: 16:The field of 992: 983: 974: 964: 950: 936: 922:(1): 51–86. 919: 915: 909: 900: 890: 857: 853: 843: 815: 808: 767: 761: 726: 720: 693: 687: 670: 664: 656: 651: 624: 604:. Retrieved 600: 591: 575: 569: 563: 528: 516: 480:Lingua Libre 464: 437: 410: 387: 367: 359: 345: 336: 288: 280: 268: 253: 213: 204: 185: 177: 146: 130: 121: 105: 77: 74:Data Formats 41: 32: 15: 679:10125/24693 328:Accessories 138:Transcriber 928:10125/4426 786:1029352513 770:. London. 606:2016-06-02 583:References 513:Literature 256:microphone 165:can read. 90:guidelines 975:The Hindu 874:1058-0360 794:cite book 542:Malayalam 127:Workflows 1007:Category 882:20601621 753:60322394 712:51818554 560:See also 401:Nijmegen 333:Software 283:Lavalier 272:cardioid 247:and the 169:Hardware 134:Audacity 565:LRE Map 546:Kannada 447:Shoebox 434:Toolbox 395:at the 347:SayMore 342:SayMore 149:Toolbox 880:  872:  831:  784:  774:  751:  741:  710:  700:  639:  554:Telugu 493:signed 489:spoken 470:eSpeak 424:Dallas 355:Dallas 303:cables 239:. The 198:, and 38:Ethics 749:S2CID 550:Tamil 484:libre 112:Audio 108:video 878:PMID 870:ISSN 829:ISBN 800:link 782:OCLC 772:ISBN 739:ISBN 708:OCLC 698:ISBN 637:ISBN 552:and 525:Film 502:Maus 491:and 482:, a 455:Xigt 443:FLEx 428:FLEx 407:FLEx 389:ELAN 384:ELAN 378:IMDI 376:and 374:.csv 370:FLEx 363:BOLD 297:and 235:and 217:Zoom 188:Zoom 186:The 153:FLEx 142:ELAN 136:and 58:and 924:hdl 862:doi 821:doi 731:doi 675:hdl 629:doi 508:Sox 475:HTK 451:MDF 422:in 418:at 399:in 353:in 200:Q2n 196:Q4n 151:or 1009:: 991:. 973:. 918:. 899:. 876:. 868:. 858:19 856:. 852:. 827:. 796:}} 792:{{ 780:. 747:. 737:. 729:. 706:. 673:. 635:. 627:. 615:^ 599:. 548:, 544:, 293:, 241:H1 237:H6 233:H5 231:, 229:H4 227:, 225:H2 223:, 221:H1 194:, 192:Q8 995:. 977:. 944:. 930:. 926:: 920:3 903:. 884:. 864:: 837:. 823:: 802:) 788:. 755:. 733:: 714:. 681:. 677:: 645:. 631:: 609:.

Index

language documentation
language documentation
Linguistic Society of America
Ethics Statement
Ethics Discussion Blog
First Peoples' Cultural Council
Endangered Languages Project
Linguist's Code of Conduct
George van Driem
data management plan
Endangered Languages Archive (ELAR)
Max Planck Institute Archive
Yale University Library
video
Audio
Audacity
Transcriber
ELAN
Toolbox
FLEx
interlinearize
currently a workaround
various kinds
Zoom
Q8
Q4n
Q2n
Zoom
H1
H2

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.