GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model based on GPT-3, called GPT-Codex — that is fine-tuned on publicly available code from GitHub.
The dataset used to train GPT-CC is obtained from SEART GitHub Search using the following criteria:
10+ GitHub stars
Must have a licence
Size < 70708 bytes These repositories are then combined with all of the GitHub repositories contain in The Pile. Full description can be found here: [https://discuss.huggingface.co/t/pretrain-gpt-neo-for-open-source-github-copilot-model/7678?u=ncoop57]