The introductory programming sequence has been the focus of much research in computing education. The recent advent of several viable and freely-available AI-driven code generation tools present several immediate opportunities and challenges in this domain. In this position paper we argue that the community needs to act quickly in deciding what possible opportunities can and should be leveraged and how, while also working on how to overcome or otherwise mitigate the possible challenges. Assuming that the effectiveness and proliferation of these tools will continue to progress rapidly, without quick, deliberate, and concerted efforts, educators will lose advantage in helping shape what opportunities come to be, and what challenges will endure. With this paper we aim to seed this discussion within the computing education community.
translated by 谷歌翻译
本文探讨了大语言模型的自然语言生成能力,并应用于编程课程中常见的两种学习资源类型。使用OpenAI Codex作为大语言模型,我们创建编程练习(包括示例解决方案和测试用例)和代码说明,从定性和定量上评估这些练习。我们的结果表明,大多数自动生成的内容既新颖又明智,在某些情况下可以按原样使用。在创建练习时,我们发现仅通过提供关键字作为模型输入来影响编程概念和它们所包含的上下文主题非常容易。我们的分析表明,大规模生成机器学习模型是指导者的工具,尽管仍然需要进行一些监督以确保生成的内容的质量在传递给学生之前。我们进一步讨论了OpenAI Codex和类似工具对入门编程教育的含义,并强调了未来的研究流,这些研究流有可能提高教师和学生的教育体验质量。
translated by 谷歌翻译
大型语言模型,例如OpenAI的法典和DeepMind的字母,可以生成代码来解决以自然语言表达的各种问题。这项技术已经在至少一项广泛使用的编程编辑器扩展程序中进行了商业化:Github Copilot。在本文中,我们探讨了具有大型语言模型(LLM辅助编程)的编程与程序员协助的先前概念化相似,并且与众不同。我们借鉴了公开可用的经验报告,有关LLM辅助编程以及先前的可用性和设计研究。我们发现,尽管LLM辅助编程通过搜索和重用分享了一些编译,配对编程和编程的属性,但技术可能性和实践经验都存在根本差异。因此,应该将LLM辅助编程视为具有自己独特的属性和挑战的新方法。最后,我们借鉴了用户研究的观察结果,在该观察中,非专家最终用户程序员使用LLM辅助工具来求解电子表格中的数据任务。我们讨论可能出现的问题,并在将大型语言模型应用于最终用户编程时,尤其是对于几乎没有编程专业知识的用户。
translated by 谷歌翻译
随着人工智能(AI)技术在社会中变得越来越强大和突出,他们的滥用就是日益关注的问题。在教育环境中,学生可以使用AI技术来欺骗作业和考试。在本文中,我们探讨了变形金刚是否可以用于求解介绍级的编程作业,同时绕过常用的AI工具来检测软件部分之间的相似性。我们发现使用GPT-J [Wang和Komatsuzaki,2021]的学生可以完成入门级的编程作业,而无需触发Moss的怀疑[Aiken,2000],这是一种广泛使用的软件相似性和窃探测工具。尽管事实上GPT-J没有接受有关问题的培训,也没有提供任何示例可供工作。我们进一步发现,GPT-J编写的代码在结构上是多种多样的,缺乏任何特定的告诉未来的pla窃检测技术可能会用来尝试识别算法生成的代码。最后,我们讨论了大语言模型的道德和教育含义以及未来研究的方向。
translated by 谷歌翻译
自动程序合成是软件工程中的持久梦想。最近,Open AI和Microsoft提出了一种有希望的深度学习(DL)解决方案,称为Copilot,作为工业产品。尽管一些研究评估了副驾驶解决方案的正确性并报告其问题,但需要进行更多的经验评估,以了解开发人员如何有效地受益。在本文中,我们研究了两项不同的编程任务中副标士的功能:(1)为基本算法问题生成(和复制)正确,有效的解决方案,(2)将副副副总裁与人类程序员的建议解决方案与一组人的建议解决方案进行比较编程任务。对于前者,我们评估副铜在解决计算机科学中选定的基本问题(例如对基本数据结构的基本问题)中的性能和功能。在后者中,使用人提供的解决方案的编程问题数据集。结果表明,Copilot能够为几乎所有基本算法问题提供解决方案,但是,某些解决方案是越野车且不可复制的。此外,Copilot在组合多种方法来生成解决方案方面存在一些困难。将副驾驶员与人类进行比较,我们的结果表明,人类溶液的正确比率大于副本的正确比率,​​而副铜产生的越野车解决方案需要更少的努力来维修。尽管本研究和以前的研究中的强调,副柯洛特(Copilot)作为开发人员特别是在高级编程任务中的助手表现出局限性,但它可以为基本编程任务生成初步解决方案。
translated by 谷歌翻译
This study evaluated the ability of ChatGPT, a recently developed artificial intelligence (AI) agent, to perform high-level cognitive tasks and produce text that is indistinguishable from human-generated text. This capacity raises concerns about the potential use of ChatGPT as a tool for academic misconduct in online exams. The study found that ChatGPT is capable of exhibiting critical thinking skills and generating highly realistic text with minimal input, making it a potential threat to the integrity of online exams, particularly in tertiary education settings where such exams are becoming more prevalent. Returning to invigilated and oral exams could form part of the solution, while using advanced proctoring techniques and AI-text output detectors may be effective in addressing this issue, they are not likely to be foolproof solutions. Further research is needed to fully understand the implications of large language models like ChatGPT and to devise strategies for combating the risk of cheating using these tools. It is crucial for educators and institutions to be aware of the possibility of ChatGPT being used for cheating and to investigate measures to address it in order to maintain the fairness and validity of online exams for all students.
translated by 谷歌翻译
教育技术,以及他们部署的学校教育系统,制定了特定的意识形态,了解有关知识的重要事项以及学习者应该如何学习。作为人工智能技术 - 在教育和超越 - 可能导致边缘社区的不公平结果,已经制定了各种方法来评估和减轻AI的有害影响。然而,我们争辩于本文认为,在AI模型中的性能差异的基础上评估公平的主导范式是面对教育AI系统(RE)生产的系统性不公平。我们在批判理论和黑色女权主义奖学金中汲取了结构性不公正的镜头,以批判性地审查了几个普遍研究的和广泛采用的教育AI类别,并探讨了他们如何融入和重现结构不公正和不公平的历史遗产和不公平的历史遗产。他们模型绩效的奇偶阶段。我们关闭了替代愿景,为教育ai提供更公平的未来。
translated by 谷歌翻译
在设计基于AI的系统中,有蓬勃发展的兴趣,以帮助人类设计计算系统,包括自动生成计算机代码的工具。这些最值得注意的是,以第一个自我描述的“Ai对程序员”,GitHub Copilot,一种在开源GitHub代码上培训的语言模型。但是,代码通常包含错误 - 因此,鉴于Copilot处理的大量未曝避代码,肯定是语言模型将从可利用的错误代码中学到。这提出了对Copilot代码捐助的安全的担忧。在这项工作中,我们系统地调查了可能导致Github CopIlot推荐不安全代码的普遍存在和条件。为了执行此分析,我们提示CopIlot在与高风险CWE相关的方案中生成代码(例如,从吉利的“前25名”列表中的方案)。我们探索了三个不同代码生成轴上的Copilot的表现 - 检查它如何表现为特定的弱点多样性,提示的多样性以及域的多样性。总共生产89个不同的Copilot方案,以完成,生产1,689个计划。其中,我们发现大约40%的脆弱。
translated by 谷歌翻译
机器学习(ML)技术在教育方面越来越普遍,从预测学生辍学,到协助大学入学以及促进MOOC的兴起。考虑到这些新颖用途的快速增长,迫切需要调查ML技术如何支持长期以来的教育原则和目标。在这项工作中,我们阐明了这一复杂的景观绘制,以对教育专家的访谈进行定性见解。这些访谈包括对过去十年中著名应用ML会议上发表的ML教育(ML4ED)论文的深入评估。我们的中心研究目标是批判性地研究这些论文的陈述或暗示教育和社会目标如何与他们解决的ML问题保持一致。也就是说,技术问题的提出,目标,方法和解释结果与手头的教育问题保持一致。我们发现,在ML生命周期的两个部分中存在跨学科的差距,并且尤其突出:从教育目标和将预测转换为干预措施的ML问题的提出。我们使用这些见解来提出扩展的ML生命周期,这也可能适用于在其他领域中使用ML。我们的工作加入了越来越多的跨教育和ML研究的荟萃分析研究,以及对ML社会影响的批判性分析。具体而言,它填补了对机器学习的主要技术理解与与学生合作和政策合作的教育研究人员的观点之间的差距。
translated by 谷歌翻译
人类开发人员可以使用网络安全缺陷生产代码。可以新兴'智能'代码完成工具有助于修复这些缺点吗?在这项工作中,我们研究了对零拍摄漏洞修复的代码(如Openai的Codex和AI21的侏罗纪J-1)使用大型语言模型(如Openai的Codex和AI21的J-1)。我们调查设计方面的挑战,提示将Coax LLMS进入生成不安全代码的修复版本。由于许多方法来短语和句法 - 具有自然语言,这很困难。通过对四个商业,黑盒子,“现成的”典型的模型进行大规模研究,以及局部训练的模型,在合成,手工制作和现实世界的安全错误场景的混合中,我们的实验表明,LLMS可以共同修复100%的综合生成和手工制作的情景,以及58%的脆弱性,在真实的开源项目中的历史错误中选择。
translated by 谷歌翻译
Computational notebooks, such as Jupyter notebooks, are interactive computing environments that are ubiquitous among data scientists to perform data wrangling and analytic tasks. To measure the performance of AI pair programmers that automatically synthesize programs for those tasks given natural language (NL) intents from users, we build ARCADE, a benchmark of 1082 code generation problems using the pandas data analysis framework in data science notebooks. ARCADE features multiple rounds of NL-to-code problems from the same notebook. It requires a model to understand rich multi-modal contexts, such as existing notebook cells and their execution states as well as previous turns of interaction. To establish a strong baseline on this challenging task, we develop PaChiNCo, a 62B code language model (LM) for Python computational notebooks, which significantly outperforms public code LMs. Finally, we explore few-shot prompting strategies to elicit better code with step-by-step decomposition and NL explanation, showing the potential to improve the diversity and explainability of model predictions.
translated by 谷歌翻译
部署的AI系统通常不起作用。它们可以随意地构造,不加选择地部署并欺骗性地促进。然而,尽管有这一现实,但学者,新闻界和决策者对功能的关注很少。这导致技术和政策解决方案的重点是“道德”或价值一致的部署,通常会跳过先前的问题,即给定系统功能或完全提供任何好处。描述各种功能失败的危害,我们分析一组案例研究,以创建已知的AI功能问题的分类法。然后,我们指出的是政策和组织响应,这些策略和组织响应经常被忽略,并在功能成为重点后变得更容易获得。我们认为功能是一项有意义的AI政策挑战,是保护受影响社区免受算法伤害的必要第一步。
translated by 谷歌翻译
We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda capturing complex computational legal processes, and embedding them in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.
translated by 谷歌翻译
Several policy options exist, or have been proposed, to further responsible artificial intelligence (AI) development and deployment. Institutions, including U.S. government agencies, states, professional societies, and private and public sector businesses, are well positioned to implement these policies. However, given limited resources, not all policies can or should be equally prioritized. We define and review nine suggested policies for furthering responsible AI, rank each policy on potential use and impact, and recommend prioritization relative to each institution type. We find that pre-deployment audits and assessments and post-deployment accountability are likely to have the highest impact but also the highest barriers to adoption. We recommend that U.S. government agencies and companies highly prioritize development of pre-deployment audits and assessments, while the U.S. national legislature should highly prioritize post-deployment accountability. We suggest that U.S. government agencies and professional societies should highly prioritize policies that support responsible AI research and that states should highly prioritize support of responsible AI education. We propose that companies can highly prioritize involving community stakeholders in development efforts and supporting diversity in AI development. We advise lower levels of prioritization across institutions for AI ethics statements and databases of AI technologies or incidents. We recognize that no one policy will lead to responsible AI and instead advocate for strategic policy implementation across institutions.
translated by 谷歌翻译
在这项工作中,我们向阿姆斯特丹大学的人工智能(_MACE-AI)的技术,研究生,保密性和透明度的技术,审查,保密性和透明度的设置,它通过再现性的镜头教导了概念。该课程的焦点是基于从顶级AI会议的现有事实-AI算法的基础项目,并撰写关于他们的经历的报告。在课程的第一次迭代中,我们创建了一个具有来自组项目的代码实现的开源存储库。在第二次迭代中,我们鼓励学生将他们的小组项目提交给机器学习再现性挑战,这导致了我们课程所接受的9个报告。我们反映了我们在两个学年课程教学的经验,其中一年恰逢全球大流行,并通过研究生级AI计划的可重复性提出了教学局面的指导。我们希望这可以成为教师在未来在其大学建立类似课程的有用资源。
translated by 谷歌翻译
This paper provides an introductory survey to GPT-3. We cover some of the historical development behind this technology, some of the key features of GPT-3, and discuss the machine learning model and the datasets used. We survey both academic and commercial efforts applying GPT-3 in diverse domains such as developing conversational AI chatbots, software development, creative work, domain knowledge, and business productivity. We discuss some of the challenges that GPT-3 faces such as the problems of training complexity, bias, and hallucination/incorrect answers. We also discuss the future research opportunities in this area.
translated by 谷歌翻译
第44届软件工程国际会议(ICSE 2022)于2022年5月22日至2022年5月27日在美国宾夕法尼亚州匹兹堡亲自举行。在这里,我们总结了我们在会议上观察到的软件工程和测试领域的研究主题以及研究方向。
translated by 谷歌翻译
AI自然语言生成(NLG)是计算机系统从信息中生成可读性语言文本的过程。它可以成为人类创造性写作过程中不可或缺的一部分。重要的是,年轻人可以学会在主流教育中应用NLG,并为AI增强的写作工作和其他写作努力做好准备。为了探索学生如何将NLG应用于创意写作,我们在香港中学设计和实施了第一届人类创意写作竞赛。在本次比赛中,每个学生参与者都使用计算机生成并建立在开源语言模型上的学生自己的单词和单词,写了一篇关于500个字的短篇小说。我们为比赛设计了四个文本生成器,作为计算机的文本条目。此外,使用基于设计的研究,我们开发了七个研讨会,学生学会了与四个文本生成器一起编写并回答反思问题。在分析故事的四个学生的短篇小说和审判者的分数时,我们发现了学生使用的文本生成器单词的数量和类型的不同策略。一些策略似乎比其他策略更复杂。在分析学生的思考时,我们发现学生可以将文本生成器输入和输出描述为思想单位。此外,学生还展示了对文本生成器的偏好。他们在用文本生成器写作时表达了一系列感受。这些发现不仅为NLG的正规教育应用提供了设计含义,而且还提出了AI课程的教学策略。
translated by 谷歌翻译
As language models (LMs) scale, they develop many novel behaviors, good and bad, exacerbating the need to evaluate how they behave. Prior work creates evaluations with crowdwork (which is time-consuming and expensive) or existing data sources (which are not always available). Here, we automatically generate evaluations with LMs. We explore approaches with varying amounts of human effort, from instructing LMs to write yes/no questions to making complex Winogender schemas with multiple stages of LM-based generation and filtering. Crowdworkers rate the examples as highly relevant and agree with 90-100% of labels, sometimes more so than corresponding human-written datasets. We generate 154 datasets and discover new cases of inverse scaling where LMs get worse with size. Larger LMs repeat back a dialog user's preferred answer ("sycophancy") and express greater desire to pursue concerning goals like resource acquisition and goal preservation. We also find some of the first examples of inverse scaling in RL from Human Feedback (RLHF), where more RLHF makes LMs worse. For example, RLHF makes LMs express stronger political views (on gun rights and immigration) and a greater desire to avoid shut down. Overall, LM-written evaluations are high-quality and let us quickly discover many novel LM behaviors.
translated by 谷歌翻译
The Codex model has demonstrated extraordinary competence in synthesizing code from natural language problem descriptions. However, in order to reveal unknown failure modes and hidden biases, such large-scale models must be systematically subjected to multiple and diverse evaluation studies. In this work, we evaluate the code synthesis capabilities of the Codex model based on a set of 115 Python problem statements from a popular competitive programming portal: HackerRank. Our evaluation shows that Codex is indeed proficient in Python, solving 96% of the problems in a zero-shot setting, and 100% of the problems in a few-shot setting. However, Codex exhibits clear signs of generating memorized code based on our evaluation. This is alarming, especially since the adoption and use of such models could directly impact how code is written and produced in the foreseeable future. With this in mind, we further discuss and highlight some of the prominent risks associated with large-scale models of source code. Finally, we propose a framework for code-synthesis evaluation using variations of problem statements based on mutations.
translated by 谷歌翻译