Pre-trained Contextual Embedding of Source Code
Aditya Kanade and
Petros Maniatis and
Gogul Balakrishnan and
Kensen Shi
Abstract:
Recent research has achieved impressive results on understanding and
improving source code by building up on machine-learning techniques
developed for natural languages. A significant advancement in
natural-language understanding has come with the development of
pre-trained contextual embeddings, such as BERT, which can be
fine-tuned for downstream tasks with less labeled data and training
budget, while achieving better accuracies. However, there is no
attempt yet to obtain a high-quality contextual embedding of source
code, and to evaluate it on multiple program-understanding tasks
simultaneously; that is the gap that this paper aims to
mitigate. Specifically, first, we curate a massive, deduplicated
corpus of 7.4M Python files from GitHub, which we use to pre-train
CuBERT, an open-sourced code-understanding BERT model; and, second, we
create an open-sourced benchmark that comprises five classification
tasks and one program-repair task, akin to code-understanding tasks
proposed in the literature before. We fine-tune CuBERT on our
benchmark tasks, and compare the resulting models to different
variants of Word2Vec token embeddings, BiLSTM and Transformer models,
as well as published state-of-the-art models, showing that CuBERT
outperforms them all, even with shorter training, and with fewer
labeled examples. Future work on source-code embedding can benefit
from reusing our benchmark, and from comparing against CuBERT models
as a strong baseline.
Paper available as:
[Official Version]