The Pile (dataset)

49:. However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training. The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing. Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it. 60:

do not learn all they can from data on the first pass, so it is common practice to train an AI on the same data more than once with each pass through the entire dataset referred to as an "epoch". Each of the 22 sub-datasets that make up the Pile was assigned a different number of epochs according to

469:

All data used in the Pile was taken from publicly accessible sources. EleutherAI then filtered the dataset as a whole to remove duplicates. Some sub-datasets were also filtered for quality control. Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to

891:

Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June

484:

Within the sub-datasets that were included, individual documents were not filtered to remove non-English, biased, or profane text. It was also not filtered on the basis of consent, meaning that, for example, the Pile-CC has all of the same ethical issues as the Common Crawl itself. However,

485:

EleutherAI has documented the amount of bias (on the basis of gender, religion, and race) and profanity as well as the level of consent given for each of the sub-datasets, allowing an ethics-concerned researcher to use only those parts of the Pile that meet their own standards.

532:

The Books3 component of the dataset contains copyrighted material compiled from Bibliotik, a pirate website. In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices. Users responded by creating copies of The Pile with the offending content removed.

1064:

Rae, Jack W; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; et al. (21 Jan 2022). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher".

579:

Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of Diverse Text for Language Modeling".

1042:

Mehta, Sachin; Sekhavat, Mohammad Hossein; Cao, Qingqing; Horton, Maxwell; Jin, Yanzi; Sun, Chenfan; Mirzadeh, Iman; Najibi, Mahyar; Belenko, Dmitry (2024-05-01). "OpenELM: An Efficient Language Model Family with Open Training and Inference Framework".

934:

Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large Language Model for Science".

466:

EleutherAI chose the datasets to try to cover a wide range of topics and styles of writing, including academic writing, which models trained on other datasets were found to struggle with.

61:

the perceived quality of the data. The table below shows the relative size of each of the 22 sub-datasets before and after being multiplied by the number of epochs. Numbers have been

913:

Touvron, Hugo; Lavril, Thibaut; Izacard, Gautier; Grave, Edouard; Lample, Guillaume; et al. (27 February 2023). "LLaMA: Open and Efficient Foundation Language Models".

524:

In addition to being used as a training dataset, the Pile can also be used as a benchmark to test models and score how well they perform on a variety of writing styles.

1138: 45:

Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the

714: 656: 847: 1087: 728:

Khan, Mehtab; Hanna, Alex (13 September 2022). "The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability".

510: 215: 493:

The Pile was originally developed to train EleutherAI's GPT-Neo models but has become widely used to train other models, including

634:

Brown, Tom B; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; et al. (22 Jul 2020). "Language Models are Few-Shot Learners".

1205: 981:

Yuan, Sha; Zhao, Hanyu; Du, Zhengxiao; Ding, Ming; Liu, Xiao; Cen, Yukuo; Zou, Xu; Yang, Zhilin; Tang, Jie (1 January 2021).

37:

in 2020 and publicly released on December 31 of that year. It is composed of 22 smaller datasets, including 14 new ones.

1160: 1215: 1210: 748: 419: 688:

Gao, Leo; Biderman, Stella; Hoppe, Travis; Grankin, Mikhail; researcher2; trisongz; sdtblck (15 June 2021).

1020: 982: 57: 29:

is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for

1185: 733: 20: 478: 30: 1139:"Rights Alliance removes the illegal Books3 dataset used to train artificial intelligence" 8: 506: 318: 1066: 1044: 936: 914: 893: 635: 581: 729: 708: 249: 983:"WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models" 994: 542: 181: 1113: 848:"Microsoft and Nvidia team up to train one of the world's largest language models" 998: 821: 795: 769: 352: 295: 1021:"Yandex publishes YaLM 100B, the largest GPT-like neural network in open source" 956: 498: 198: 102: 1199: 62: 477:

Some potential sub-datasets were excluded for various reasons, such as the

436: 46: 689: 368: 612: 518: 402: 335: 34: 1086:

Lieber, Opher; Sharir, Or; Lenz, Barak; Shoham, Yoav (1 August 2021).

869: 664: 494: 279: 65:, and asterisks are used to indicate the newly introduced datasets. 1071: 1049: 941: 919: 898: 640: 586: 315: 605:"The Pile: An 800GB Dataset of Diverse Text for Language Modeling" 1161:"The Pile An 800GB Dataset of Diverse Text for Language Modeling" 385: 657:"Turing-NLG: A 17-billion-parameter language model by Microsoft" 870:"AI: Megatron the Transformer, and its related language models" 514: 312: 232: 164: 749:"Difference Between a Batch and an Epoch in a Neural Network" 502: 147: 892:

2022). "OPT: Open Pre-trained Transformer Language Models".

604: 933: 471: 912: 1186:"monology/pile-uncopyrighted - Dataset at Hugging Face" 890: 687: 578: 19:

This article is about the dataset. For other uses, see

1085: 1063: 1041: 633: 1197: 1114:"The Battle Over Books3 Could Change AI Forever" 497:'s Megatron-Turing Natural Language Generation, 481:, which was excluded due to its racist content. 574: 572: 570: 568: 566: 564: 562: 560: 558: 1088:"Jurassic-1: Technical Details and Evaluation" 980: 555: 713:: CS1 maint: numeric names: authors list ( 599: 597: 1018: 511:Beijing Academy of Artificial Intelligence 16:Training dataset for large language models 1070: 1048: 940: 918: 897: 727: 639: 585: 52: 746: 683: 681: 594: 470:remove parts that are not text, such as 758:– via machinelearningmastery.com. 1198: 654: 678: 721: 13: 1111: 747:Brownlee, Jason (10 August 2022). 655:Rosset, Corby (13 February 2020). 501:'s Open Pre-trained Transformers, 14: 1227: 1141:. Rights Alliance. 14 August 2023 527: 1178: 1153: 1131: 1105: 1079: 1057: 1035: 1012: 974: 949: 927: 906: 884: 862: 840: 814: 957:"Model Card for BioMedLM 2.7B" 788: 762: 740: 648: 627: 33:(LLMs). It was constructed by 1: 548: 1206:Datasets in machine learning 999:10.1016/j.aiopen.2021.06.001 513:'s Chinese-Transformer-XL, 7: 690:"The Pile Replication Code" 536: 40: 10: 1232: 1008:– via ScienceDirect. 18: 1019:Grabovskiy, Ilya (2022). 69:Sub-datasets of The Pile 58:Artificial intelligences 1023:(Press release). Yandex 479:US Congressional Record 509:'s BioMedLM 2.7B, the 488: 474:formatting and links. 53:Contents and filtering 1216:Large language models 1211:Statistical data sets 31:large language models 21:Pile (disambiguation) 1165:academictorrents.com 872:. 24 September 2021 507:Stanford University 70: 615:. 13 February 2020 609:EleutherAI Website 517:'s YaLM 100B, and 68: 850:. 11 October 2021 828:. 8 December 2022 802:. 8 December 2022 776:. 8 December 2022 505:, and Galactica, 464: 463: 250:Gutenberg (PG-19) 1223: 1190: 1189: 1188:. 22 April 2024. 1182: 1176: 1175: 1173: 1171: 1157: 1151: 1150: 1148: 1146: 1135: 1129: 1128: 1126: 1124: 1109: 1103: 1102: 1100: 1098: 1092: 1083: 1077: 1076: 1074: 1061: 1055: 1054: 1052: 1039: 1033: 1032: 1030: 1028: 1016: 1010: 1009: 1007: 1005: 978: 972: 971: 969: 967: 953: 947: 946: 944: 931: 925: 924: 922: 910: 904: 903: 901: 888: 882: 881: 879: 877: 866: 860: 859: 857: 855: 844: 838: 837: 835: 833: 818: 812: 811: 809: 807: 792: 786: 785: 783: 781: 766: 760: 759: 757: 755: 744: 738: 737: 725: 719: 718: 712: 704: 702: 700: 685: 676: 675: 673: 671: 652: 646: 645: 643: 631: 625: 624: 622: 620: 601: 592: 591: 589: 576: 543:List of chatbots 460:1346.69 GB 71: 67: 1231: 1230: 1226: 1225: 1224: 1222: 1221: 1220: 1196: 1195: 1194: 1193: 1184: 1183: 1179: 1169: 1167: 1159: 1158: 1154: 1144: 1142: 1137: 1136: 1132: 1122: 1120: 1110: 1106: 1096: 1094: 1090: 1084: 1080: 1062: 1058: 1040: 1036: 1026: 1024: 1017: 1013: 1003: 1001: 979: 975: 965: 963: 955: 954: 950: 932: 928: 911: 907: 889: 885: 875: 873: 868: 867: 863: 853: 851: 846: 845: 841: 831: 829: 820: 819: 815: 805: 803: 794: 793: 789: 779: 777: 768: 767: 763: 753: 751: 745: 741: 726: 722: 706: 705: 698: 696: 686: 679: 669: 667: 653: 649: 632: 628: 618: 616: 603: 602: 595: 577: 556: 551: 539: 530: 491: 176:102.18 GB 159:120.71 GB 142:134.80 GB 128:162.61 GB 114:193.86 GB 97:243.87 GB 83:Effective size 63:converted to GB 55: 43: 24: 17: 12: 11: 5: 1229: 1219: 1218: 1213: 1208: 1192: 1191: 1177: 1152: 1130: 1112:Knibbs, Kate. 1104: 1078: 1056: 1034: 1011: 973: 961:huggingface.co 948: 926: 905: 883: 861: 839: 826:huggingface.co 822:"GPT-Neo 2.7B" 813: 800:huggingface.co 796:"GPT-Neo 1.3B" 787: 774:huggingface.co 770:"GPT-Neo 125M" 761: 739: 720: 677: 661:Microsoft Blog 647: 626: 593: 553: 552: 550: 547: 546: 545: 538: 535: 529: 526: 490: 487: 462: 461: 458: 456: 455:886.03 GB 453: 449: 448: 445: 442: 439: 433: 432: 429: 426: 423: 416: 415: 412: 409: 406: 399: 398: 395: 392: 389: 382: 381: 378: 375: 372: 365: 364: 361: 358: 355: 349: 348: 347:10.15 GB 345: 342: 339: 332: 331: 330:11.84 GB 328: 325: 322: 309: 308: 307:16.63 GB 305: 302: 299: 292: 291: 290:20.54 GB 288: 285: 282: 276: 275: 274:20.91 GB 272: 269: 266: 262: 261: 260:29.20 GB 258: 255: 252: 246: 245: 244:41.37 GB 242: 239: 236: 229: 228: 227:49.19 GB 225: 222: 219: 212: 211: 210:69.14 GB 208: 205: 202: 199:Stack Exchange 195: 194: 193:82.39 GB 191: 188: 185: 178: 177: 174: 171: 170:102.18 GB 168: 161: 160: 157: 154: 151: 144: 143: 140: 137: 134: 130: 129: 126: 123: 122:108.40 GB 120: 116: 115: 112: 109: 106: 103:PubMed Central 99: 98: 95: 92: 91:243.87 GB 89: 85: 84: 81: 78: 75: 54: 51: 42: 39: 15: 9: 6: 4: 3: 2: 1228: 1217: 1214: 1212: 1209: 1207: 1204: 1203: 1201: 1187: 1181: 1166: 1162: 1156: 1140: 1134: 1119: 1115: 1108: 1089: 1082: 1073: 1068: 1060: 1051: 1046: 1038: 1022: 1015: 1000: 996: 992: 988: 984: 977: 962: 958: 952: 943: 938: 930: 921: 916: 909: 900: 895: 887: 871: 865: 849: 843: 827: 823: 817: 801: 797: 791: 775: 771: 765: 750: 743: 735: 731: 724: 716: 710: 695: 691: 684: 682: 666: 662: 658: 651: 642: 637: 630: 614: 610: 606: 600: 598: 588: 583: 575: 573: 571: 569: 567: 565: 563: 561: 559: 554: 544: 541: 540: 534: 528:DMCA takedown 525: 522: 520: 516: 512: 508: 504: 500: 496: 486: 482: 480: 475: 473: 467: 459: 457: 454: 451: 450: 447:1.89 GB 446: 443: 440: 438: 435: 434: 431:4.07 GB 430: 427: 424: 421: 418: 417: 414:5.11 GB 413: 410: 407: 404: 401: 400: 397:8.02 GB 396: 393: 390: 387: 384: 383: 380:8.38 GB 379: 376: 373: 370: 367: 366: 363:9.85 GB 362: 359: 356: 354: 351: 350: 346: 343: 340: 337: 334: 333: 329: 326: 323: 320: 317: 314: 311: 310: 306: 303: 300: 297: 294: 293: 289: 286: 283: 281: 278: 277: 273: 270: 268:13.94 GB 267: 265:OpenSubtitles 264: 263: 259: 256: 254:11.68 GB 253: 251: 248: 247: 243: 240: 238:20.68 GB 237: 234: 231: 230: 226: 223: 221:24.59 GB 220: 217: 214: 213: 209: 206: 204:34.57 GB 203: 200: 197: 196: 192: 189: 187:54.92 GB 186: 183: 180: 179: 175: 172: 169: 166: 163: 162: 158: 155: 153:60.36 GB 152: 149: 146: 145: 141: 138: 136:67.40 GB 135: 133:OpenWebText2* 132: 131: 127: 124: 121: 118: 117: 113: 110: 108:96.93 GB 107: 104: 101: 100: 96: 93: 90: 87: 86: 82: 79: 77:Original size 76: 73: 72: 66: 64: 59: 50: 48: 38: 36: 32: 28: 22: 1180: 1168:. Retrieved 1164: 1155: 1143:. Retrieved 1133: 1121:. Retrieved 1117: 1107: 1095:. Retrieved 1081: 1059: 1037: 1025:. Retrieved 1014: 1002:. Retrieved 990: 986: 976: 964:. Retrieved 960: 951: 929: 908: 886: 874:. Retrieved 864: 852:. Retrieved 842: 830:. Retrieved 825: 816: 804:. Retrieved 799: 790: 778:. Retrieved 773: 764: 752:. Retrieved 742: 723: 697:. Retrieved 693: 668:. Retrieved 660: 650: 629: 617:. Retrieved 608: 531: 523: 521:'s OpenELM. 492: 483: 476: 468: 465: 441:0.95 GB 437:Enron Emails 425:2.03 GB 408:2.56 GB 391:4.01 GB 374:4.19 GB 357:4.93 GB 341:6.76 GB 324:5.93 GB 301:8.32 GB 284:6.85 GB 218:Backgrounds* 56: 47:Common Crawl 44: 26: 25: 1093:. AI21 Labs 670:31 December 369:Hacker News 298:Mathematics 1200:Categories 1123:13 October 1072:2112.11446 1050:2404.14619 942:2211.09085 920:2302.13971 899:2205.01068 694:github.com 641:2005.14165 613:EleutherAI 587:2101.00027 549:References 403:PhilPapers 388:Subtitles* 336:BookCorpus 235:Abstracts* 35:EleutherAI 1170:29 August 1145:29 August 1118:wired.com 993:: 65–68. 665:Microsoft 495:Microsoft 422:ExPorter* 280:Knowledge 74:Component 709:cite web 537:See also 353:EuroParl 316:Freenode 296:DeepMind 182:Free Law 41:Creation 27:The Pile 1004:8 March 987:AI Open 876:8 March 854:8 March 734:4217148 499:Meta AI 386:YouTube 88:Pile-CC 1097:5 June 1027:5 June 966:5 June 832:7 June 806:7 June 780:7 June 754:2 June 732: 699:6 June 619:4 June 515:Yandex 313:Ubuntu 233:PubMed 165:GitHub 119:Books3 80:Epochs 1091:(PDF) 1067:arXiv 1045:arXiv 937:arXiv 915:arXiv 894:arXiv 636:arXiv 582:arXiv 519:Apple 503:LLaMA 452:Total 321:logs* 216:USPTO 148:arXiv 1172:2023 1147:2023 1125:2023 1099:2023 1029:2023 1006:2023 968:2023 878:2023 856:2023 834:2023 808:2023 782:2023 756:2023 730:SSRN 715:link 701:2023 672:2020 621:2023 472:HTML 995:doi 489:Use 420:NIH 344:1.5 319:IRC 271:1.5 257:2.5 190:1.5 125:1.5 1202:: 1163:. 1116:. 989:. 985:. 959:. 824:. 798:. 772:. 711:}} 707:{{ 692:. 680:^ 663:. 659:. 611:. 607:. 596:^ 557:^ 338:2* 1174:. 1149:. 1127:. 1101:. 1075:. 1069:: 1053:. 1047:: 1031:. 997:: 991:2 970:. 945:. 939:: 923:. 917:: 902:. 896:: 880:. 858:. 836:. 810:. 784:. 736:. 717:) 703:. 674:. 644:. 638:: 623:. 590:. 584:: 444:2 428:2 411:2 405:* 394:2 377:2 371:* 360:2 327:2 304:2 287:3 241:2 224:2 207:2 201:* 184:* 173:1 167:* 156:2 150:* 139:2 111:2 105:* 94:1 23:.

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index