Source Code

Source code, the fundamental building block of software, is the subject of intense research focusing on improving its analysis, generation, and security. Current efforts leverage machine learning, particularly transformer-based models like BERT and GPT variants, and graph neural networks, to analyze code for vulnerabilities, predict defects, and even automatically generate code from natural language descriptions. These advancements have significant implications for software development, enhancing code quality, security, and developer productivity, while also raising new challenges related to code authorship attribution and the detection of AI-generated code.

Papers

January 20, 2023

Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning
Lan Zhang, Chen Cao, Zhilong Wang, Peng Liu
Natural Language Processing Empirical Study Feature Wise Source Code Code Representation Bidirectional Encoder Representation

January 4, 2023

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries
Ali Al-Kaswan, Toufique Ahmed, Maliheh Izadi, Anand Ashok Sawant, Premkumar Devanbu, Arie van Deursen
Source Code Code Representation Binary Code Reverse Engineering Binary Analysis

December 20, 2022

A Survey on Pretrained Language Models for Neural Code Intelligence
Yichen Xu, Yanqiao Zhu
Language Model Timely Survey Pretrained Language Model Source Code Code Summarization Programming Community

December 18, 2022

JEMMA: An Extensible Java Dataset for ML4Code Applications
Anjan Karmakar, Miltiadis Allamanis, Romain Robbes
Source Code Source Code Model

December 12, 2022

Parameter-Efficient Finetuning of Transformers for Source Code
Shamil Ayupov, Nadezhda Chirkova
Fine Tuning Transformer Megatron Decepticons Parameter Efficient Fine Tuning Pre Trained Transformer Source Code Efficient Fine Tuning Parameter Efficient Finetuning

December 6, 2022

Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation
Anjan Karmakar, Julian Aron Prenner, Marco D'Ambros, Romain Robbes
New Framework Source Code Limited Memorization Tabular Model Code Synthesis

November 20, 2022

The Stack: 3 TB of permissively licensed source code
Denis Kocetkov, Raymond Li, Loubna Ben Allal, Jia Li, Chenghao Mou, Carlos Muñoz Ferrandis, Yacine Jernite, Margaret Mitchell, Sean Hughes, Thomas Wolf, Dzmitry Bahdanau, Leandro von Werra, Harm de Vries
Natural Language Processing Source Code Tuberculosis Treatment Text Benchmark Software Stack

November 15, 2022

October 31, 2022

October 27, 2022

Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language
Paul Denny, Viraj Kumar, Nasser Giacaman
Natural Language Prompt Engineering Source Code OpenAI Codex Programming Education Open Source Code NextGen Communication COPILOT

October 15, 2022

Code Recommendation for Open Source Software Developers
Yiqiao Jin, Yunsheng Bai, Yanqiao Zhu, Yizhou Sun, Wei Wang
Source Code Code Recommendation Open Source Software

October 11, 2022

October 1, 2022

Improving ProtoNet for Few-Shot Video Object Recognition: Winner of ORBIT Challenge 2022
Li Gu, Zhixiang Chi, Huan Liu, Yuanhao Yu, Yang Wang
Source Code Orbital Motion Frame Level Anomaly

August 26, 2022

I still know it's you! On Challenges in Anonymizing Source Code
Micha Horlboge, Erwin Quiring, Roland Meyer, Konrad Rieck
Technical Challenge Source Code Obfuscation Technique Consistent Anonymization Effect

August 23, 2022

Preprocessing Source Code Comments for Linguistic Models
Sergey Matskevich, Colin S. Gordon
Source Code Code Summarization Online Comment Reference Dataset Code Comment

Source Code

Papers

Which Features are Learned by CodeBert: An Empirical Study of the BERT-based Source Code Representation Learning

Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

A Survey on Pretrained Language Models for Neural Code Intelligence

JEMMA: An Extensible Java Dataset for ML4Code Applications

Parameter-Efficient Finetuning of Transformers for Source Code

Codex Hacks HackerRank: Memorization Issues and a Framework for Code Synthesis Evaluation

The Stack: 3 TB of permissively licensed source code

Cyrus2D base: Source Code Base for RoboCup 2D Soccer Simulation League

A Hierarchical Deep Neural Network for Detecting Lines of Codes with Vulnerabilities

Unsafe's Betrayal: Abusing Unsafe Rust in Binary Reverse Engineering via Machine Learning

Automated Code Extraction from Discussion Board Text Dataset

Poison Attack and Defense on Deep Source Code Processing Models

Conversing with Copilot: Exploring Prompt Engineering for Solving CS1 Problems Using Natural Language

Code Recommendation for Open Source Software Developers

Follow-up Attention: An Empirical Study of Developer and Neural Model Code Exploration

Code Librarian: A Software Package Recommendation System

Leveraging Artificial Intelligence on Binary Code Comprehension

Improving ProtoNet for Few-Shot Video Object Recognition: Winner of ORBIT Challenge 2022

I still know it's you! On Challenges in Anonymizing Source Code

Preprocessing Source Code Comments for Linguistic Models