智能论文笔记

Towards Fine-Dining Recipe Generation with Generative Pre-trained Transformers

Konstantinos Katserelis , Konstantinos Skianis

分类：自然语言处理 | 人工智能 | 机器学习

2022-09-26

食物对于人类生存至关重要。如此之多，以至于我们开发了不同的食谱来满足我们的口味需求。在这项工作中，我们提出了一种新颖的方式，可以使用变压器（特别是自动回归语言模型）从头开始创建新的细餐食谱。考虑到一小部分食物食谱数据集，我们尝试训练模型以识别烹饪技术，提出新颖的食谱并测试用最小数据进行微调的功能。

translated by 谷歌翻译

Survey of Generative Methods for Social Media Analysis

Stan Matwin , Aristides Milios , Paweł Prałat , Amilcar Soares , François Théberge

分类：机器学习

2021-12-13

本次调查绘制了用于分析社交媒体数据的生成方法的研究状态的广泛的全景照片（Sota）。它填补了空白，因为现有的调查文章在其范围内或被约会。我们包括两个重要方面，目前正在挖掘和建模社交媒体的重要性：动态和网络。社会动态对于了解影响影响或疾病的传播，友谊的形成，友谊的形成等，另一方面，可以捕获各种复杂关系，提供额外的洞察力和识别否则将不会被注意的重要模式。

translated by 谷歌翻译

Proceedings of the 3rd International Workshop on Reading Music Systems

Jorge Calvo-Zaragoza , Alexander Pacha

分类：计算机视觉 | 机器学习

2022-12-01

The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.

translated by 谷歌翻译

Proceedings of the 2nd International Workshop on Reading Music Systems

Jorge Calvo-Zaragoza , Alexander Pacha

分类：计算机视觉 | 机器学习

2022-12-01

The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.

translated by 谷歌翻译

Neural Networks for Chess

Dominik Klein

分类：机器学习 | 人工智能

2022-09-03

Alphazero，Leela Chess Zero和Stockfish Nnue革新了计算机国际象棋。本书对此类引擎的技术内部工作进行了完整的介绍。该书分为四个主要章节 - 不包括第1章（简介）和第6章（结论）：第2章引入神经网络，涵盖了所有用于构建深层网络的基本构建块，例如Alphazero使用的网络。内容包括感知器，后传播和梯度下降，分类，回归，多层感知器，矢量化技术，卷积网络，挤压网络，挤压和激发网络，完全连接的网络，批处理归一化和横向归一化和跨性线性单位，残留层，剩余层，过度效果和底漆。第3章介绍了用于国际象棋发动机以及Alphazero使用的经典搜索技术。内容包括minimax，alpha-beta搜索和蒙特卡洛树搜索。第4章展示了现代国际象棋发动机的设计。除了开创性的Alphago，Alphago Zero和Alphazero我们涵盖Leela Chess Zero，Fat Fritz，Fat Fritz 2以及有效更新的神经网络（NNUE）以及MAIA。第5章是关于实施微型α。 Shexapawn是国际象棋的简约版本，被用作为此的示例。 Minimax搜索可以解决六ap峰，并产生了监督学习的培训位置。然后，作为比较，实施了类似Alphazero的训练回路，其中通过自我游戏进行训练与强化学习结合在一起。最后，比较了类似α的培训和监督培训。

translated by 谷歌翻译

JEMMA: An Extensible Java Dataset for ML4Code Applications

Anjan Karmakar , Miltiadis Allamanis , Romain Robbes

分类：机器学习

2022-12-18

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

translated by 谷歌翻译

How much do language models copy from their training data? Evaluating linguistic novelty in text generation using RAVEN

R. Thomas McCoy , Paul Smolensky , Tal Linzen , Jianfeng Gao , Asli Celikyilmaz

分类：自然语言处理

2021-11-18

当前的语言模型可以产生高质量的文本。他们只是复制他们之前看到的文本，或者他们学习了普遍的语言抽象吗？要取笑这些可能性，我们介绍了乌鸦，这是一套评估生成文本的新颖性，专注于顺序结构（n-gram）和句法结构。我们将这些分析应用于四种神经语言模型（LSTM，变压器，变换器-XL和GPT-2）。对于本地结构 - 例如，单个依赖性 - 模型生成的文本比来自每个模型的测试集的人类生成文本的基线显着不那么新颖。对于大规模结构 - 例如，总句结构 - 模型生成的文本与人生成的基线一样新颖甚至更新颖，但模型仍然有时复制，在某些情况下，在训练集中重复超过1000字超过1,000字的通道。我们还表现了广泛的手动分析，表明GPT-2的新文本通常在形态学和语法中形成良好，但具有合理的语义问题（例如，是自相矛盾）。

translated by 谷歌翻译

Mapping the Internet: Modelling Entity Interactions in Complex Heterogeneous Networks

Simon Mandlik , Tomas Pevny

分类：机器学习

2021-04-19

即使机器学习算法已经在数据科学中发挥了重要作用，但许多当前方法对输入数据提出了不现实的假设。由于不兼容的数据格式，或数据集中的异质，分层或完全缺少的数据片段，因此很难应用此类方法。作为解决方案，我们提出了一个用于样本表示，模型定义和培训的多功能，统一的框架，称为“ Hmill”。我们深入审查框架构建和扩展的机器学习的多个范围范式。从理论上讲，为HMILL的关键组件的设计合理，我们将通用近似定理的扩展显示到框架中实现的模型所实现的所有功能的集合。本文还包含有关我们实施中技术和绩效改进的详细讨论，该讨论将在MIT许可下发布供下载。该框架的主要资产是其灵活性，它可以通过相同的工具对不同的现实世界数据源进行建模。除了单独观察到每个对象的一组属性的标准设置外，我们解释了如何在框架中实现表示整个对象系统的图表中的消息推断。为了支持我们的主张，我们使用框架解决了网络安全域的三个不同问题。第一种用例涉及来自原始网络观察结果的IoT设备识别。在第二个问题中，我们研究了如何使用以有向图表示的操作系统的快照可以对恶意二进制文件进行分类。最后提供的示例是通过网络中实体之间建模域黑名单扩展的任务。在所有三个问题中，基于建议的框架的解决方案可实现与专业方法相当的性能。

translated by 谷歌翻译

Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP

Sabrina J. Mielke , Zaid Alyafeai , Elizabeth Salesky , Colin Raffel , Manan Dey , Matthias Gallé , Arun Raja , Chenglei Si , Wilson Y. Lee , Benoît Sagot

分类：自然语言处理 | 机器学习

2021-12-20

我们想要模型的文本单位是什么？从字节到多字表达式，可以在许多粒度下分析和生成文本。直到最近，大多数自然语言处理（NLP）模型通过单词操作，将那些作为离散和原子令牌处理，但从字节对编码（BPE）开始，基于次字的方法在许多领域都变得占主导地位，使得仍然存在小词汇表允许快速推断。是道路字符级模型的结束或字节级处理吗？在这项调查中，我们通过展示和评估基于学习分割的词语和字符以及基于子字的方法的混合方法以及基于学习的分割的杂交方法，连接多行工作。我们得出结论，对于所有应用来说，并且可能永远不会成为所有应用的银子弹奇异解决方案，并且严重思考令牌化对许多应用仍然很重要。

translated by 谷歌翻译

Album cover art image generation with Generative Adversarial Networks

Felipe Perez Stoppa , Ester Vidaña-Vila , Joan Navarro

分类：计算机视觉 | 人工智能 | 机器学习

2022-12-09

Generative Adversarial Networks (GANs) were introduced by Goodfellow in 2014, and since then have become popular for constructing generative artificial intelligence models. However, the drawbacks of such networks are numerous, like their longer training times, their sensitivity to hyperparameter tuning, several types of loss and optimization functions and other difficulties like mode collapse. Current applications of GANs include generating photo-realistic human faces, animals and objects. However, I wanted to explore the artistic ability of GANs in more detail, by using existing models and learning from them. This dissertation covers the basics of neural networks and works its way up to the particular aspects of GANs, together with experimentation and modification of existing available models, from least complex to most. The intention is to see if state of the art GANs (specifically StyleGAN2) can generate album art covers and if it is possible to tailor them by genre. This was attempted by first familiarizing myself with 3 existing GANs architectures, including the state of the art StyleGAN2. The StyleGAN2 code was used to train a model with a dataset containing 80K album cover images, then used to style images by picking curated images and mixing their styles.

translated by 谷歌翻译

Astronomia ex machina: a history, primer, and outlook on neural networks in astronomy

Michael J. Smith , James E. Geach

分类：机器学习

2022-11-07

In recent years, deep learning has infiltrated every field it has touched, reducing the need for specialist knowledge and automating the process of knowledge discovery from data. This review argues that astronomy is no different, and that we are currently in the midst of a deep learning revolution that is transforming the way we do astronomy. We trace the history of astronomical connectionism from the early days of multilayer perceptrons, through the second wave of convolutional and recurrent neural networks, to the current third wave of self-supervised and unsupervised deep learning. We then predict that we will soon enter a fourth wave of astronomical connectionism, in which finetuned versions of an all-encompassing 'foundation' model will replace expertly crafted deep learning models. We argue that such a model can only be brought about through a symbiotic relationship between astronomy and connectionism, whereby astronomy provides high quality multimodal data to train the foundation model, and in turn the foundation model is used to advance astronomical research.

translated by 谷歌翻译

Extracting Training Data from Large Language Models

Nicholas Carlini , Florian Tramer , Eric Wallace , Matthew Jagielski , Ariel Herbert-Voss , Katherine Lee , Adam Roberts , Tom Brown , Dawn Song , Ulfar Erlingsson

分类：

2020-12-14

It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data.We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.

translated by 谷歌翻译

Introduction to Machine Learning for the Sciences

Titus Neupert , Mark H Fischer , Eliska Greplova , Kenny Choo , M. Michael Denner

分类：机器学习

2021-02-08

这是一门专门针对STEM学生开发的介绍性机器学习课程。我们的目标是为有兴趣的读者提供基础知识，以在自己的项目中使用机器学习，并将自己熟悉术语作为进一步阅读相关文献的基础。在这些讲义中，我们讨论受监督，无监督和强化学习。注释从没有神经网络的机器学习方法的说明开始，例如原理分析，T-SNE，聚类以及线性回归和线性分类器。我们继续介绍基本和先进的神经网络结构，例如密集的进料和常规神经网络，经常性的神经网络，受限的玻尔兹曼机器，（变性）自动编码器，生成的对抗性网络。讨论了潜在空间表示的解释性问题，并使用梦和对抗性攻击的例子。最后一部分致力于加强学习，我们在其中介绍了价值功能和政策学习的基本概念。

translated by 谷歌翻译

Physics-based Deep Learning

Nils Thuerey , Philipp Holl , Maximilian Mueller , Patrick Schnell , Felix Trost , Kiwon Um

分类：机器学习

2021-09-11

这本数字本书包含在物理模拟的背景下与深度学习相关的一切实际和全面的一切。尽可能多，所有主题都带有Jupyter笔记本的形式的动手代码示例，以便快速入门。除了标准的受监督学习的数据中，我们将看看物理丢失约束，更紧密耦合的学习算法，具有可微分的模拟，以及加强学习和不确定性建模。我们生活在令人兴奋的时期：这些方法具有从根本上改变计算机模拟可以实现的巨大潜力。

translated by 谷歌翻译

Democratizing Machine Translation with OPUS-MT

Jörg Tiedemann , Mikko Aulamo , Daria Bakshandaeva , Michele Boggia , Stig-Arne Grönroos , Tommi Nieminen , Alessandro Raganato , Yves Scherrer , Raul Vazquez , Sami Virpioja

分类：自然语言处理

2022-12-04

This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.

translated by 谷歌翻译

PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Jingqing Zhang , Yao Zhao , Mohammad Saleh , Peter J. Liu

分类：

2019-12-18

Recent work pre-training Transformers with self-supervised objectives on large text corpora has shown great success when fine-tuned on downstream NLP tasks including text summarization. However, pre-training objectives tailored for abstractive text summarization have not been explored. Furthermore there is a lack of systematic evaluation across diverse domains. In this work, we propose pre-training large Transformer-based encoder-decoder models on massive text corpora with a new selfsupervised objective. In PEGASUS, important sentences are removed/masked from an input document and are generated together as one output sequence from the remaining sentences, similar to an extractive summary. We evaluated our best PEGASUS model on 12 downstream summarization tasks spanning news, science, stories, instructions, emails, patents, and legislative bills. Experiments demonstrate it achieves state-of-the-art performance on all 12 downstream datasets measured by ROUGE scores. Our model also shows surprising performance on low-resource summarization, surpassing previous state-of-the-art results on 6 datasets with only 1000 examples. Finally we validated our results using human evaluation and show that our model summaries achieve human performance on multiple datasets.

translated by 谷歌翻译

Language Models are not Models of Language

Csaba Veres

分类：自然语言处理 | 机器学习

2021-12-13

自然语言处理（NLP）已成为当前人工智能繁荣中的主要应用领域之一。转移学习已经启用了大量深入学习的神经网络，接受了语言建模任务，以大大提高了所有语言任务的性能。有趣的是，当模型培训使用包含软件代码的数据培训时，它们在从自然语言规范中生成功能计算机代码时展示了显着的能力。我们认为这是一种难题，用于神经模型为生成词组结构语法提供了一种替代理论，以说明语言有效。由于编程语言的语法由短语结构语法决定，因此成功的神经模型显然是对编程语言的理论基础的理论基础，以及通过扩展，自然语言来实现。我们认为语言模型的术语模型是误导性的，因为深度学习模型不是语言的理论模型，并提出采用语料库模型，这更好地反映了模型的成因和内容。

translated by 谷歌翻译

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li , Peter J. Liu

分类：

2019-10-23

Transfer learning, where a model is first pre-trained on a data-rich task before being finetuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new "Colossal Clean Crawled Corpus", we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

translated by 谷歌翻译

Automatic Generation of Factual News Headlines in Finnish

Maximilian Koppatz , Khalid Alnajjar , Mika Hämäläinen , Thierry Poibeau

分类：自然语言处理

2022-12-05

We present a novel approach to generating news headlines in Finnish for a given news story. We model this as a summarization task where a model is given a news article, and its task is to produce a concise headline describing the main topic of the article. Because there are no openly available GPT-2 models for Finnish, we will first build such a model using several corpora. The model is then fine-tuned for the headline generation task using a massive news corpus. The system is evaluated by 3 expert journalists working in a Finnish media house. The results showcase the usability of the presented approach as a headline suggestion tool to facilitate the news production process.

translated by 谷歌翻译

Preprocessing Source Code Comments for Linguistic Models

Sergey Matskevich , Colin Gordon

分类：机器学习

2022-08-23

评论是源代码的重要组成部分，是文档的主要来源。这引起了人们对使用大量注释的兴趣训练或评估消耗或生产它们的工具，例如生成甲骨文，甚至是从注释中生成代码，或自动生成代码摘要。这项工作大部分对评论的结构和质量做出了强烈的假设，例如假设它们主要由适当的英语句子组成。但是，我们对这些用例的现有评论的实际质量知之甚少。评论通常包含在其他类型的文本中看不到的独特结构和元素，并且从中过滤或提取信息需要额外的谨慎。本文探讨了来自GitHub的840个最受欢迎的开源项目和Srilab数据集的8422个项目的Python评论的内容和质量，并且Na \“ Ive vs.深入过滤的影响都可以使用现有注释来用于使用现有注释。培训和评估产生评论的系统。

translated by 谷歌翻译