Project CodeNet by IBM

Project CodeNet is a large dataset developed by IBM aimed at teaching AI to code, it consists of some 14M code samples and about 500M lines of code in more than 55 different programming languages, from modern ones like C++, Java, Python, and Go to legacy languages like COBOL, Pascal, and FORTRAN.
AI for Code is poised to transition from proof-of-concept to widespread adoption. To provide a catalyst for such a tipping point, researchers at IBM Research have introduced Project CodeNet, a large-scale dataset for benchmarking and experimentation. Project CodeNet has many characteristics (large scale, diveristy, etc.) similar to ImageNet, a huge dataset for imagery that had a dramatic impact on the field of computer vision research. Project CodeNet is a large scale dataset with approximately 14 million code samples, each of which is an intended solution to one of 4000 coding problems. Project CodeNet aims to do for AI for Code what ImageNet did for computer vision.