The Pile is a 825 GiB, diverse, open source language modelling data set developed by EleutherAI that consists of many smaller datasets combined together. The objective is to obtain text from as many modalities as possible to ensure that models trained using The Pile will have much broader generalization abilities.
Why is the Pile a good training set?
Recent work has shown that especially for large models, diversity in data sources improves general cross-domain knowledge of the model, as well as downstream generalization capability. In our evaluations, not only do models trained on the Pile show moderate improvements in traditional language modeling benchmarks, they also show significant improvements on Pile BPB.
| Component | Raw Size |Weight|Epochs|Effective Size|Mean Document Size|
|—————–|———-|——|—–:|————–|——————|
|Pile-CC |227.12 GiB|18.11%| 1.0|227.12 GiB |4.33 KiB |
|PubMed Central |90.27 GiB |14.40%| 2.0|180.55 GiB |30.55 KiB |
|Books3 |100.96 GiB|12.07%| 1.5|151.44 GiB |538.36 KiB |
|OpenWebText2 |62.77 GiB |10.01%| 2.0|125.54 GiB |3.85 KiB |
|ArXiv |56.21 GiB |8.96% | 2.0|112.42 GiB |46.61 KiB |
|Github |95.16 GiB |7.59% | 1.0|95.16 GiB |5.25 KiB |
|FreeLaw |51.15 GiB |6.12% | 1.5|76.73 GiB |15.06 KiB |
|StackExchange |32.20 GiB |5.13% | 2.0|64.39 GiB |2.16 KiB |
|USPTO Backgrounds |22.90 GiB |3.65% | 2.0|45.81 GiB |4.08 KiB |
|PubMed Abstracts |19.26 GiB |3.07% | 2.0|38.53 GiB |1.30 KiB |
|Gutenberg (PG-19)|10.88 GiB |2.17% | 2.5|27.19 GiB |398.73 KiB |
|OpenSubtitles |12.98 GiB |1.55% | 1.5|19.47 GiB |30.48 KiB |
|Wikipedia (en) |6.38 GiB |1.53% | 3.0|19.13 GiB |1.11 KiB |
|DM Mathematics |7.75 GiB |1.24% | 2.0|15.49 GiB |8.00 KiB |
|Ubuntu IRC |5.52 GiB |0.88% | 2.0|11.03 GiB |545.48 KiB |
|BookCorpus2 |6.30 GiB |0.75% | 1.5|9.45 GiB |369.87 KiB |
|EuroParl |4.59 GiB |0.73% | 2.0|9.17 GiB |68.87 KiB |
|HackerNews |3.90 GiB |0.62% | 2.0|7.80 GiB |4.92 KiB |
|YoutubeSubtitles |3.73 GiB |0.60% | 2.0|7.47 GiB |22.55 KiB |
|PhilPapers |2.38 GiB |0.38% | 2.0|4.76 GiB |73.37 KiB |
|NIH ExPorter |1.89 GiB |0.30% | 2.0|3.79 GiB |2.11 KiB |
|Enron Emails |0.88 GiB |0.14% | 2.0|1.76 GiB |1.78 KiB |
|Total | | | |1254.20 GiB |5.91 KiB |
Source: https://pile.eleuther.ai/