49:. However, LLMs trained on more diverse datasets are better able to handle a wider range of situations after training. The creation of the Pile was motivated by the need for a large enough dataset that contained data from a wide variety of sources and styles of writing. Compared to other datasets, the Pile's main distinguishing features are that it is a curated selection of data chosen by researchers at EleutherAI to contain information they thought language models should learn and that it is the only such dataset that is thoroughly documented by the researchers who developed it.
60:
do not learn all they can from data on the first pass, so it is common practice to train an AI on the same data more than once with each pass through the entire dataset referred to as an "epoch". Each of the 22 sub-datasets that make up the Pile was assigned a different number of epochs according to
469:
All data used in the Pile was taken from publicly accessible sources. EleutherAI then filtered the dataset as a whole to remove duplicates. Some sub-datasets were also filtered for quality control. Most notably, the Pile-CC is a modified version of the Common Crawl in which the data was filtered to
891:
Zhang, Susan; Roller, Stephen; Goyal, Naman; Artetxe, Mikel; Chen, Moya; Chen, Shuohui; Dewan, Christopher; Diab, Mona; Li, Xian; Lin, Xi
Victoria; Mihaylov, Todor; Ott, Myle; Shleifer, Sam; Shuster, Kurt; Simig, Daniel; Koura, Punit Singh; Sridhar, Anjali; Wang, Tianlu; Zettlemoyer, Luke (21 June
484:
Within the sub-datasets that were included, individual documents were not filtered to remove non-English, biased, or profane text. It was also not filtered on the basis of consent, meaning that, for example, the Pile-CC has all of the same ethical issues as the Common Crawl itself. However,
485:
EleutherAI has documented the amount of bias (on the basis of gender, religion, and race) and profanity as well as the level of consent given for each of the sub-datasets, allowing an ethics-concerned researcher to use only those parts of the Pile that meet their own standards.
532:
The Books3 component of the dataset contains copyrighted material compiled from
Bibliotik, a pirate website. In July 2023, the Rights Alliance took copies of The Pile down through DMCA notices. Users responded by creating copies of The Pile with the offending content removed.
1064:
Rae, Jack W; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; et al. (21 Jan 2022). "Scaling
Language Models: Methods, Analysis & Insights from Training Gopher".
579:
Gao, Leo; Biderman, Stella; Black, Sid; Golding, Laurence; Hoppe, Travis; Foster, Charles; Phang, Jason; He, Horace; Thite, Anish; Nabeshima, Noa; Presser, Shawn; Leahy, Connor (31 December 2020). "The Pile: An 800GB Dataset of
Diverse Text for Language Modeling".
1042:
Mehta, Sachin; Sekhavat, Mohammad
Hossein; Cao, Qingqing; Horton, Maxwell; Jin, Yanzi; Sun, Chenfan; Mirzadeh, Iman; Najibi, Mahyar; Belenko, Dmitry (2024-05-01). "OpenELM: An Efficient Language Model Family with Open Training and Inference Framework".
934:
Taylor, Ross; Kardas, Marcin; Cucurull, Guillem; Scialom, Thomas; Hartshorn, Anthony; Saravia, Elvis; Poulton, Andrew; Kerkez, Viktor; Stojnic, Robert (16 November 2022). "Galactica: A Large
Language Model for Science".
466:
EleutherAI chose the datasets to try to cover a wide range of topics and styles of writing, including academic writing, which models trained on other datasets were found to struggle with.
61:
the perceived quality of the data. The table below shows the relative size of each of the 22 sub-datasets before and after being multiplied by the number of epochs. Numbers have been
913:
Touvron, Hugo; Lavril, Thibaut; Izacard, Gautier; Grave, Edouard; Lample, Guillaume; et al. (27 February 2023). "LLaMA: Open and
Efficient Foundation Language Models".
524:
In addition to being used as a training dataset, the Pile can also be used as a benchmark to test models and score how well they perform on a variety of writing styles.
1138:
45:
Training LLMs requires sufficiently vast amounts of data that, before the introduction of the Pile, most data used for training LLMs was taken from the
714:
656:
847:
1087:
728:
Khan, Mehtab; Hanna, Alex (13 September 2022). "The
Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability".
510:
215:
493:
The Pile was originally developed to train
EleutherAI's GPT-Neo models but has become widely used to train other models, including
634:
Brown, Tom B; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; et al. (22 Jul 2020). "Language Models are Few-Shot
Learners".
1205:
981:
Yuan, Sha; Zhao, Hanyu; Du, Zhengxiao; Ding, Ming; Liu, Xiao; Cen, Yukuo; Zou, Xu; Yang, Zhilin; Tang, Jie (1 January 2021).
37:
in 2020 and publicly released on December 31 of that year. It is composed of 22 smaller datasets, including 14 new ones.
1160:
1215:
1210:
748:
419:
688:
Gao, Leo; Biderman, Stella; Hoppe, Travis; Grankin, Mikhail; researcher2; trisongz; sdtblck (15 June 2021).
1020:
982:
57:
29:
is an 886.03 GB diverse, open-source dataset of English text created as a training dataset for
1185:
733:
20:
478:
30:
1139:"Rights Alliance removes the illegal Books3 dataset used to train artificial intelligence"
8:
506:
318:
1066:
1044:
936:
914:
893:
635:
581:
729:
708:
249:
983:"WuDaoCorpora: A super large-scale Chinese corpora for pre-training language models"
994:
542:
181:
1113:
848:"Microsoft and Nvidia team up to train one of the world's largest language models"
998:
821:
795:
769:
352:
295:
1021:"Yandex publishes YaLM 100B, the largest GPT-like neural network in open source"
956:
498:
198:
102:
1199:
62:
477:
Some potential sub-datasets were excluded for various reasons, such as the
436:
46:
689:
368:
612:
518:
402:
335:
34:
1086:
Lieber, Opher; Sharir, Or; Lenz, Barak; Shoham, Yoav (1 August 2021).
869:
664:
494:
279:
65:, and asterisks are used to indicate the newly introduced datasets.
1071:
1049:
941:
919:
898:
640:
586:
315:
605:"The Pile: An 800GB Dataset of Diverse Text for Language Modeling"
1161:"The Pile An 800GB Dataset of Diverse Text for Language Modeling"
385:
657:"Turing-NLG: A 17-billion-parameter language model by Microsoft"
870:"AI: Megatron the Transformer, and its related language models"
514:
312:
232:
164:
749:"Difference Between a Batch and an Epoch in a Neural Network"
502:
147:
892:
2022). "OPT: Open Pre-trained Transformer Language Models".
604:
933:
471:
912:
1186:"monology/pile-uncopyrighted - Dataset at Hugging Face"
890:
687:
578:
19:
This article is about the dataset. For other uses, see
1085:
1063:
1041:
633:
1197:
1114:"The Battle Over Books3 Could Change AI Forever"
497:'s Megatron-Turing Natural Language Generation,
481:, which was excluded due to its racist content.
574:
572:
570:
568:
566:
564:
562:
560:
558:
1088:"Jurassic-1: Technical Details and Evaluation"
980:
555:
713:: CS1 maint: numeric names: authors list (
599:
597:
1018:
511:Beijing Academy of Artificial Intelligence
16:Training dataset for large language models
1070:
1048:
940:
918:
897:
727:
639:
585:
52:
746:
683:
681:
594:
470:remove parts that are not text, such as
758:– via machinelearningmastery.com.
1198:
654:
678:
721:
13:
1111:
747:Brownlee, Jason (10 August 2022).
655:Rosset, Corby (13 February 2020).
501:'s Open Pre-trained Transformers,
14:
1227:
1141:. Rights Alliance. 14 August 2023
527:
1178:
1153:
1131:
1105:
1079:
1057:
1035:
1012:
974:
949:
927:
906:
884:
862:
840:
814:
957:"Model Card for BioMedLM 2.7B"
788:
762:
740:
648:
627:
33:(LLMs). It was constructed by
1:
548:
1206:Datasets in machine learning
999:10.1016/j.aiopen.2021.06.001
513:'s Chinese-Transformer-XL,
7:
690:"The Pile Replication Code"
536:
40:
10:
1232:
1008:– via ScienceDirect.
18:
1019:Grabovskiy, Ilya (2022).
69:Sub-datasets of The Pile
58:Artificial intelligences
1023:(Press release). Yandex
479:US Congressional Record
509:'s BioMedLM 2.7B, the
488:
474:formatting and links.
53:Contents and filtering
1216:Large language models
1211:Statistical data sets
31:large language models
21:Pile (disambiguation)
1165:academictorrents.com
872:. 24 September 2021
507:Stanford University
70:
615:. 13 February 2020
609:EleutherAI Website
517:'s YaLM 100B, and
68:
850:. 11 October 2021
828:. 8 December 2022
802:. 8 December 2022
776:. 8 December 2022
505:, and Galactica,
464:
463:
250:Gutenberg (PG-19)
1223:
1190:
1189:
1188:. 22 April 2024.
1182:
1176:
1175:
1173:
1171:
1157:
1151:
1150:
1148:
1146:
1135:
1129:
1128:
1126:
1124:
1109:
1103:
1102:
1100:
1098:
1092:
1083:
1077:
1076:
1074:
1061:
1055:
1054:
1052:
1039:
1033:
1032:
1030:
1028:
1016:
1010:
1009:
1007:
1005:
978:
972:
971:
969:
967:
953:
947:
946:
944:
931:
925:
924:
922:
910:
904:
903:
901:
888:
882:
881:
879:
877:
866:
860:
859:
857:
855:
844:
838:
837:
835:
833:
818:
812:
811:
809:
807:
792:
786:
785:
783:
781:
766:
760:
759:
757:
755:
744:
738:
737:
725:
719:
718:
712:
704:
702:
700:
685:
676:
675:
673:
671:
652:
646:
645:
643:
631:
625:
624:
622:
620:
601:
592:
591:
589:
576:
543:List of chatbots
460:1346.69 GB
71:
67:
1231:
1230:
1226:
1225:
1224:
1222:
1221:
1220:
1196:
1195:
1194:
1193:
1184:
1183:
1179:
1169:
1167:
1159:
1158:
1154:
1144:
1142:
1137:
1136:
1132:
1122:
1120:
1110:
1106:
1096:
1094:
1090:
1084:
1080:
1062:
1058:
1040:
1036:
1026:
1024:
1017:
1013:
1003:
1001:
979:
975:
965:
963:
955:
954:
950:
932:
928:
911:
907:
889:
885:
875:
873:
868:
867:
863:
853:
851:
846:
845:
841:
831:
829:
820:
819:
815:
805:
803:
794:
793:
789:
779:
777:
768:
767:
763:
753:
751:
745:
741:
726:
722:
706:
705:
698:
696:
686:
679:
669:
667:
653:
649:
632:
628:
618:
616:
603:
602:
595:
577:
556:
551:
539:
530:
491:
176:102.18 GB
159:120.71 GB
142:134.80 GB
128:162.61 GB
114:193.86 GB
97:243.87 GB
83:Effective size
63:converted to GB
55:
43:
24:
17:
12:
11:
5:
1229:
1219:
1218:
1213:
1208:
1192:
1191:
1177:
1152:
1130:
1112:Knibbs, Kate.
1104:
1078:
1056:
1034:
1011:
973:
961:huggingface.co
948:
926:
905:
883:
861:
839:
826:huggingface.co
822:"GPT-Neo 2.7B"
813:
800:huggingface.co
796:"GPT-Neo 1.3B"
787:
774:huggingface.co
770:"GPT-Neo 125M"
761:
739:
720:
677:
661:Microsoft Blog
647:
626:
593:
553:
552:
550:
547:
546:
545:
538:
535:
529:
526:
490:
487:
462:
461:
458:
456:
455:886.03 GB
453:
449:
448:
445:
442:
439:
433:
432:
429:
426:
423:
416:
415:
412:
409:
406:
399:
398:
395:
392:
389:
382:
381:
378:
375:
372:
365:
364:
361:
358:
355:
349:
348:
347:10.15 GB
345:
342:
339:
332:
331:
330:11.84 GB
328:
325:
322:
309:
308:
307:16.63 GB
305:
302:
299:
292:
291:
290:20.54 GB
288:
285:
282:
276:
275:
274:20.91 GB
272:
269:
266:
262:
261:
260:29.20 GB
258:
255:
252:
246:
245:
244:41.37 GB
242:
239:
236:
229:
228:
227:49.19 GB
225:
222:
219:
212:
211:
210:69.14 GB
208:
205:
202:
199:Stack Exchange
195:
194:
193:82.39 GB
191:
188:
185:
178:
177:
174:
171:
170:102.18 GB
168:
161:
160:
157:
154:
151:
144:
143:
140:
137:
134:
130:
129:
126:
123:
122:108.40 GB
120:
116:
115:
112:
109:
106:
103:PubMed Central
99:
98:
95:
92:
91:243.87 GB
89:
85:
84:
81:
78:
75:
54:
51:
42:
39:
15:
9:
6:
4:
3:
2:
1228:
1217:
1214:
1212:
1209:
1207:
1204:
1203:
1201:
1187:
1181:
1166:
1162:
1156:
1140:
1134:
1119:
1115:
1108:
1089:
1082:
1073:
1068:
1060:
1051:
1046:
1038:
1022:
1015:
1000:
996:
992:
988:
984:
977:
962:
958:
952:
943:
938:
930:
921:
916:
909:
900:
895:
887:
871:
865:
849:
843:
827:
823:
817:
801:
797:
791:
775:
771:
765:
750:
743:
735:
731:
724:
716:
710:
695:
691:
684:
682:
666:
662:
658:
651:
642:
637:
630:
614:
610:
606:
600:
598:
588:
583:
575:
573:
571:
569:
567:
565:
563:
561:
559:
554:
544:
541:
540:
534:
528:DMCA takedown
525:
522:
520:
516:
512:
508:
504:
500:
496:
486:
482:
480:
475:
473:
467:
459:
457:
454:
451:
450:
447:1.89 GB
446:
443:
440:
438:
435:
434:
431:4.07 GB
430:
427:
424:
421:
418:
417:
414:5.11 GB
413:
410:
407:
404:
401:
400:
397:8.02 GB
396:
393:
390:
387:
384:
383:
380:8.38 GB
379:
376:
373:
370:
367:
366:
363:9.85 GB
362:
359:
356:
354:
351:
350:
346:
343:
340:
337:
334:
333:
329:
326:
323:
320:
317:
314:
311:
310:
306:
303:
300:
297:
294:
293:
289:
286:
283:
281:
278:
277:
273:
270:
268:13.94 GB
267:
265:OpenSubtitles
264:
263:
259:
256:
254:11.68 GB
253:
251:
248:
247:
243:
240:
238:20.68 GB
237:
234:
231:
230:
226:
223:
221:24.59 GB
220:
217:
214:
213:
209:
206:
204:34.57 GB
203:
200:
197:
196:
192:
189:
187:54.92 GB
186:
183:
180:
179:
175:
172:
169:
166:
163:
162:
158:
155:
153:60.36 GB
152:
149:
146:
145:
141:
138:
136:67.40 GB
135:
133:OpenWebText2*
132:
131:
127:
124:
121:
118:
117:
113:
110:
108:96.93 GB
107:
104:
101:
100:
96:
93:
90:
87:
86:
82:
79:
77:Original size
76:
73:
72:
66:
64:
59:
50:
48:
38:
36:
32:
28:
22:
1180:
1168:. Retrieved
1164:
1155:
1143:. Retrieved
1133:
1121:. Retrieved
1117:
1107:
1095:. Retrieved
1081:
1059:
1037:
1025:. Retrieved
1014:
1002:. Retrieved
990:
986:
976:
964:. Retrieved
960:
951:
929:
908:
886:
874:. Retrieved
864:
852:. Retrieved
842:
830:. Retrieved
825:
816:
804:. Retrieved
799:
790:
778:. Retrieved
773:
764:
752:. Retrieved
742:
723:
697:. Retrieved
693:
668:. Retrieved
660:
650:
629:
617:. Retrieved
608:
531:
523:
521:'s OpenELM.
492:
483:
476:
468:
465:
441:0.95 GB
437:Enron Emails
425:2.03 GB
408:2.56 GB
391:4.01 GB
374:4.19 GB
357:4.93 GB
341:6.76 GB
324:5.93 GB
301:8.32 GB
284:6.85 GB
218:Backgrounds*
56:
47:Common Crawl
44:
26:
25:
1093:. AI21 Labs
670:31 December
369:Hacker News
298:Mathematics
1200:Categories
1123:13 October
1072:2112.11446
1050:2404.14619
942:2211.09085
920:2302.13971
899:2205.01068
694:github.com
641:2005.14165
613:EleutherAI
587:2101.00027
549:References
403:PhilPapers
388:Subtitles*
336:BookCorpus
235:Abstracts*
35:EleutherAI
1170:29 August
1145:29 August
1118:wired.com
993:: 65–68.
665:Microsoft
495:Microsoft
422:ExPorter*
280:Knowledge
74:Component
709:cite web
537:See also
353:EuroParl
316:Freenode
296:DeepMind
182:Free Law
41:Creation
27:The Pile
1004:8 March
987:AI Open
876:8 March
854:8 March
734:4217148
499:Meta AI
386:YouTube
88:Pile-CC
1097:5 June
1027:5 June
966:5 June
832:7 June
806:7 June
780:7 June
754:2 June
732:
699:6 June
619:4 June
515:Yandex
313:Ubuntu
233:PubMed
165:GitHub
119:Books3
80:Epochs
1091:(PDF)
1067:arXiv
1045:arXiv
937:arXiv
915:arXiv
894:arXiv
636:arXiv
582:arXiv
519:Apple
503:LLaMA
452:Total
321:logs*
216:USPTO
148:arXiv
1172:2023
1147:2023
1125:2023
1099:2023
1029:2023
1006:2023
968:2023
878:2023
856:2023
834:2023
808:2023
782:2023
756:2023
730:SSRN
715:link
701:2023
672:2020
621:2023
472:HTML
995:doi
489:Use
420:NIH
344:1.5
319:IRC
271:1.5
257:2.5
190:1.5
125:1.5
1202::
1163:.
1116:.
989:.
985:.
959:.
824:.
798:.
772:.
711:}}
707:{{
692:.
680:^
663:.
659:.
611:.
607:.
596:^
557:^
338:2*
1174:.
1149:.
1127:.
1101:.
1075:.
1069::
1053:.
1047::
1031:.
997::
991:2
970:.
945:.
939::
923:.
917::
902:.
896::
880:.
858:.
836:.
810:.
784:.
736:.
717:)
703:.
674:.
644:.
638::
623:.
590:.
584::
444:2
428:2
411:2
405:*
394:2
377:2
371:*
360:2
327:2
304:2
287:3
241:2
224:2
207:2
201:*
184:*
173:1
167:*
156:2
150:*
139:2
111:2
105:*
94:1
23:.
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.