The emergence of large-language models (LLMs) that excel at code generation and commercial products such as GitHub’s Copilot has sparked interest in human-AI pair programming (referred to as “pAIr programming”) where an AI system collaborates with a human programmer. While traditional pair programming between humans has been extensively studied, it remains uncertain whether its findings can be applied to human-AI pair programming. We compare human-human and human-AI pair programming, exploring their similarities and differences in interaction, measures, benefits, and challenges. We find that the effectiveness of both approaches is mixed in the literature (though the measures used for pAIr programming are not as comprehensive). We summarize moderating factors on the success of human-human pair programming, which provides opportunities for pAIr programming research. For example, mismatched expertise makes pair programming less productive, therefore well-designed AI programming assistants may adapt to differences in expertise levels.

1 INTRODUCTION

Pair programming was first introduced in the 1990s as part of the Agile software development practice [9]. In its original definition, pair programming describes the practice of two programmers working together on the same task using a single computer, keyboard, and mouse. One programmer in the pair, the “driver,” performs the coding (typing) and implements the task, while the other programmer, the “navigator,” aids in planning, reviewing, debugging, and suggesting improvements and alternatives. Over time, pair programming has evolved and adapted to different contexts and purposes. Now, it is used in a wide range of settings, including education, industry, and open-source software development [5, 83].

\ Recent advances in code-generating large-language models (LLMs) have led to the widespread popularity of commercial AI-powered programming assistance tools such as GitHub Copilot [26], which advertises itself as “your AI pair programmer.” For pAIr programming, instead of two humans working on a single computer, it is the programmer and the LLM-based AI that work together on the same task. The shift in the paradigm raises the questions: Is the AI programming partner comparable to a human pair programmer? Are they applicable to the same contexts, can they achieve similar or better performance, and should people interact with them in the same way?

\ In this work, we delve into the current state of research on human-human and human-AI pair programming to uncover their similarities and differences, and we hope to inspire better evaluations and designs of code-generating LLMs as a pAIr programmer. We start by reviewing the application context, methods, and tasks for both human-human and human-AI pair programming literature (Section 2), then dive into fine-grained comparisons of their measurements of success (Section 3), as well as the contributing moderators, e.g., pair compatibility factors like expertise (Section 4).

\ We find that (1) prior work on both pair programming paradigms has observed mixed results in quality, productivity, satisfaction, learning, and cost, (2) pAIr programming has yet to develop comprehensive measurements, and (3) key factors to pAIr’s success have been largely unexplored.

\ Building on our exploration, we then discuss views and challenges of characterizing AI as a pair programmer, and elaborate on future opportunities for developing best practices and guidelines for human-AI pair programming (Section 5). First, we argue that moderating factors that bring challenges to human-human pair programming (e.g., compatibility and communication) unveil opportunities to improve human-AI pair programming. It can be promising to exploit the differences between a human and an AI partner (e.g., more customizable expertise level and more adaptable communication styles) to design for more successful human-AI pair programming experiences. Second, we encourage future research to explore the best deployment environment for human-AI pair programming. While most human-AI pair programming works have focused on assisting professional developers, we hope to inspire more future works in the learning context (or, student-AI pair programming), and we highlight potential challenges involved.

2 CONTEXTS, METHODS, AND TASKS

Human-human pair programming originated as a practice in the software engineering industry [9] and then become a popular collaborative learning practice in classrooms [83]. Therefore, in this paper, we compare human-human and human-AI pair programming in both the industry and education contexts, as they are the most common contexts.

\ We adhere to the original definition of human-human pair programming to closely resemble human-AI interaction on a single device. Other modes of interaction exist for comparing human and human-AI teams in programming tasks, such as computer-mediated collaborative learning [71] and distributed pair programming [19], but they are beyond the scope of this paper.

\ For human-AI pair programming, most current works have been evaluating Copilot using case studies (e.g., [12]) or experimental studies (e.g., [84]) with experienced programmers in the industry. Similar to human-human pair programming, researchers tried to mimic a realistic professional development environment in their task setup. For example, Barke et al. [8] invited 20 participants, mostly doctoral students and software engineers, to complete tasks such as developing Chat Client and Server. However, there is a lack of non-invasive field observation studies like what human-human pair programming studies have done [65, 75].

\ Few recent works have explored using LLM-based programming environments or Copilot with students. For example, Kazemitabaar et al. [39] used a controlled experimental study with 69 novice students from 10 to 17 years old working on 45 Python code-authoring and code-modifying tasks. However, existing works on human-AI pair programming are mostly in lab experiments, and there is still a lack of large-scale study [51] and classroom deployment [57, 87] as in the human-human pair programming literature.

\ When setting up comparison groups, existing pAIr programming works have been comparing the human-AI pair against human-human [35] or human solo (e.g., compare developers’ work when they use Copilot or the default code completion tool) [84]. No current study sets up a three-way comparison for human-AI, human-human, and human-solo.

\ Summary: In comparison to human-human pair programming works, existing pAIr studies lack realistic deployment in the workspace or classroom, and a larger sample size would also be desirable. Researchers of both pair programming paradigms use various study designs to examine what affects the effectiveness of pair programming. In Section 3 and Section 4, we compare the variables and measurements they used to further uncover what is lacking in pAIr studies.

:::info This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.

:::

Feed: Hacker Noon - Medium

View: Original article

Tags: application

Applications