Source Code

Source code, the fundamental building block of software, is the subject of intense research focusing on improving its analysis, generation, and security. Current efforts leverage machine learning, particularly transformer-based models like BERT and GPT variants, and graph neural networks, to analyze code for vulnerabilities, predict defects, and even automatically generate code from natural language descriptions. These advancements have significant implications for software development, enhancing code quality, security, and developer productivity, while also raising new challenges related to code authorship attribution and the detection of AI-generated code.

Papers

February 3, 2024

EffiBench: Benchmarking the Efficiency of Automatically Generated Code
Dong Huang, Yuhao Qing, Weiyi Shang, Heming Cui, Jie M.Zhang
High Efficiency Source Code Code Generation Model Code Efficiency

January 13, 2024

GEML: A Grammar-based Evolutionary Machine Learning Approach for Design-Pattern Detection
Rafael Barbudo, Aurora Ramírez, Francisco Servant, José Raúl Romero
Source Code Design Pattern Code Metric Classification Rule Grammar Guided Genetic Programming

January 12, 2024

Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers
Yuling Shi, Hongyu Zhang, Chengcheng Wan, Xiaodong Gu
New Machine Code Generation Real World Code Source Code Code Mixed Best Fit Line Human Programmer Distinct Pattern

January 8, 2024

Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education
Wei Hung Pan, Ming Jie Chok, Jonathan Leong Shan Wong, Yung Xin Shin, Yeong Shian Poon, Zhou Yang, Chun Yong Chong, David Lo, Mei Kuan Lim
Generated Content Source Code Education Domain Future Implication AI Generated Code

December 27, 2023

Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection
Mohammed Ataaur Rahaman, Julia Ive
Graph Drawing Source Code Sequence of Sequence Code Clone Detection Code Clone

December 18, 2023

Traces of Memorisation in Large Language Models for Code
Ali Al-Kaswan, Maliheh Izadi, Arie van Deursen
Large Language Model Real World Code Source Code Code Completion Model Finite Trace Data Extraction Attack Memorisation Profile

December 8, 2023

December 7, 2023

STraceBERT: Source Code Retrieval using Semantic Application Traces
Claudio Spiess
Source Code Reverse Engineering Code Retrieval Semantic Application

October 14, 2023

A study of the impact of generative AI-based data augmentation on software metadata classification
Tripti Kumari, Chakali Sai Charan, Ayan Das
Large Language Model Global Impact Study Feature Source Code Generative Data Augmentation Automatic Usefulness Prediction Code Comment Pair Software Classification

September 5, 2023

Revisiting File Context for Source Code Summarization
Aakash Bansal, Chia-Yi Su, Collin McMillan
Encoder Decoder Source Code Code Summarization Context Encoding Cross File Context

August 28, 2023

August 26, 2023

EditSum: A Retrieve-and-Edit Framework for Source Code Summarization
Jia Li, Yongmin Li, Ge Li, Xing Hu, Xin Xia, Zhi Jin
Source Code Code Summarization Generated Summary

August 23, 2023

Benchmarking Causal Study to Interpret Large Language Models for Source Code
Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke, Denys Poshyvanyk
Code Generation Causal Inference Causal Discovery Source Code Code Summarization

August 1, 2023

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code
Nadezhda Chirkova, Sergey Troshin
Large Language Model Transformer Based Language Model Source Code

July 27, 2023

CodeLens: An Interactive Tool for Visualizing Code Representations
Yuejun Guo, Seifeddine Bettaieb, Qiang Hu, Yves Le Traon, Qiang Tang
Source Code Code Representation Abstract Syntax Tree

July 5, 2023

June 22, 2023

FLAG: Finding Line Anomalies (in code) with Generative AI
Baleegh Ahmad, Benjamin Tan, Ramesh Karri, Hammond Pearce
Generative AI Real World Code Source Code Code Debugging Bug Report Line Detection Label Aggregation LLM Era

Source Code

Papers

EffiBench: Benchmarking the Efficiency of Automatically Generated Code

GEML: A Grammar-based Evolutionary Machine Learning Approach for Design-Pattern Detection

Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers

Assessing AI Detectors in Identifying AI-Generated Code: Implications for Education

Source Code is a Graph, Not a Sequence: A Cross-Lingual Perspective on Code Clone Detection

Traces of Memorisation in Large Language Models for Code

INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers

LLM Interactive Optimization of Open Source Python Libraries -- Case Studies and Generalization

STraceBERT: Source Code Retrieval using Semantic Application Traces

A study of the impact of generative AI-based data augmentation on software metadata classification

Revisiting File Context for Source Code Summarization

Distilled GPT for Source Code Summarization

Using ChatGPT as a Static Application Security Testing Tool

EditSum: A Retrieve-and-Edit Framework for Source Code Summarization

Benchmarking Causal Study to Interpret Large Language Models for Source Code

CodeBPE: Investigating Subtokenization Options for Large Language Model Pretraining on Source Code

CodeLens: An Interactive Tool for Visualizing Code Representations

An Exploratory Literature Study on Sharing and Energy Use of Language Models for Source Code

The FormAI Dataset: Generative AI in Software Security Through the Lens of Formal Verification

FLAG: Finding Line Anomalies (in code) with Generative AI