CodeDJ: Reproducible Queries over Large-Scale Software Repositories
Fri 16 Jul 2021 19:00 - 19:20 at ECOOP 1 - Potpourri (time band 1) Chair(s): Omer Tripp
Analyzing massive code bases has become a staple of modern software engineering research. This has happened as a welcome side-effect of the advent of public large-scale software repositories such as GitHub. Yet, finding which projects to analyze is a labor-intensive process that can lead to biased analysis results if the selection is not representative. The search interfaces exposed by mainstream software repositories do not allow researchers to formulate anything but very basic queries. This paper reports on Code DJ , an infrastructure designed to assist researchers in querying such repositories and identifying projects of interest. The infrastructure is composed of two subsystems: a persistent datastore that is constantly updated with information acquired from its target large-scale software repository (in our case GitHub), and an in-memory database with a query interface written in Rust and designed to follow popular data science API principles. Our infrastructure has built-in support for reproducibility. Users can formulate historical queries that are answered deterministically using historical states of the datastore; thus researchers can always reproduce published results. To illustrate the benefits of the proposed system, we revisit a paper aiming to establish a correlation between programming languages and software defect. Using Code DJ , we identify biases in the dataset used in the original paper. By repeating the analysis performed by the original authors with new data, we demonstrate that the results of the paper are highly sensitive to the choice of projects.
Fri 16 JulDisplayed time zone: Brussels, Copenhagen, Madrid, Paris change
08:00 - 09:00 | Empirical Studies / Parallelism (time band 3)ECOOP Technical Papers at ECOOP 1 Chair(s): Hakjoo Oh Korea University | ||
08:00 20mTalk | CodeDJ: Reproducible Queries over Large-Scale Software Repositories ECOOP Technical Papers Petr Maj Czech Technical University, Konrad Siek Czech Technical University in Prague, Jan Vitek Northeastern University / Czech Technical University, Alexander Kovalenko Czech Technical University in Prague DOI | ||
08:20 20mTalk | Enabling Additional Parallelism in Asynchronous JavaScript Applications ECOOP Technical Papers DOI | ||
08:40 20mTalk | Do Bugs Propagate? An Empirical Analysis of Temporal Correlations among Software Bugs ECOOP Technical Papers Xiaodong Gu Shanghai Jiao Tong University, China, Sunghun Kim Hong Kong University of Science and Technology, Yo-Sub Han Yonsei University, Hongyu Zhang University of Newcastle DOI |
19:00 - 20:00 | |||
19:00 20mTalk | CodeDJ: Reproducible Queries over Large-Scale Software Repositories ECOOP Technical Papers Petr Maj Czech Technical University, Konrad Siek Czech Technical University in Prague, Jan Vitek Northeastern University / Czech Technical University, Alexander Kovalenko Czech Technical University in Prague DOI | ||
19:20 20mTalk | Differential Privacy for Coverage Analysis of Software Traces ECOOP Technical Papers Yu Hao Ohio State University, Sufian Latif Ohio State University, Hailong Zhang Fordham University, Raef Bassily Ohio State University, Atanas Rountev Ohio State University DOI | ||
19:40 20mTalk | Dealing with Variability in API Misuse Specification ECOOP Technical Papers Rodrigo Bonifácio Computer Science Department - University of Brasília, Stefan Krüger Independent Researcher, Krishna Narasimhan TU Darmstadt, Eric Bodden University of Paderborn; Fraunhofer IEM, Mira Mezini TU Darmstadt, Germany DOI |