智能论文笔记

A Survey of Task-Based Machine Learning Content Extraction Services for VIDINT

Joshua Brunk , Nathan Jermann , Ryan Sharp , Carl D. Hoover

分类：计算机视觉

2022-07-09

本文提供了当前视频内容提取工具的比较，重点是比较基于任务的机器学习服务。在过去十年中，视频智能（VIDINT）数据已成为关键情报来源。基于AI的分析和自动化工具从视频中提取和构造内容的需求已迅速成为需要大规模搜索，分析和利用视频的组织的优先事项。随着机器学习技术的快速增长，机器转录，机器翻译，主题标签和对象识别任务的成熟度以指数级的速度提高，随着新应用程序的发展，速度和准确性的性能记录破坏了。本文的每个部分审查并根据与机器学习技术从视频中提取信息相关的任务进行了比较产品，软件资源和视频分析功能。

translated by 谷歌翻译

Democratizing Machine Translation with OPUS-MT

Jörg Tiedemann , Mikko Aulamo , Daria Bakshandaeva , Michele Boggia , Stig-Arne Grönroos , Tommi Nieminen , Alessandro Raganato , Yves Scherrer , Raul Vazquez , Sami Virpioja

分类：自然语言处理

2022-12-04

This paper presents the OPUS ecosystem with a focus on the development of open machine translation models and tools, and their integration into end-user applications, development platforms and professional workflows. We discuss our on-going mission of increasing language coverage and translation quality, and also describe on-going work on the development of modular translation models and speed-optimized compact solutions for real-time translation on regular desktops and small devices.

translated by 谷歌翻译

Urdu Speech and Text Based Sentiment Analyzer

Waqar Ahmad , Maryam Edalati

分类：自然语言处理

2022-07-19

发现别人认为是我们信息收集策略的关键方面。现在，人们可以积极利用信息技术来寻找和理解他人的想法，这要归功于越来越多的意见资源（例如在线评论网站和个人博客）的越来越多。由于其在理解人们的意见方面的关键功能，因此情感分析（SA）是一项至关重要的任务。另一方面，现有的研究主要集中在英语上，只有少量研究专门研究低资源语言。对于情感分析，这项工作根据用户评估提供了一个新的多级乌尔都语数据集。高音扬声器网站用于获取乌尔都语数据集。我们提出的数据集包括10,000项评论，这些评论已被人类专家精心归类为两类：正面，负面。这项研究的主要目的是构建一个手动注释的数据集进行乌尔都语情绪分析，并确定基线结果。采用了五种不同的词典和规则的算法，包括NaiveBayes，Stanza，TextBlob，Vader和Flair，实验结果表明，其精度为70％的天赋优于其他经过测试的算法。

translated by 谷歌翻译

A Survey of Plagiarism Detection Systems: Case of Use with English, French and Arabic Languages

Mehdi Abdelhamid , Faical Azouaou , Sofiane Batata

分类：自然语言处理

2022-01-10

在学术界，抄袭肯定不是一个新兴的关注，但它随着互联网的普及和对全球内容来源的易于访问而变得更大的程度，使人类干预不足。尽管如此，由于计算机辅助抄袭检测，抄袭远远远非是一个未被解除的问题，目前是一个有效的研究领域，该研究落在信息检索（IR）和自然语言处理（NLP）领域。许多软件解决方案有助于满足这项任务，本文概述了用于阿拉伯语，法国和英语学术和教育环境的抄袭检测系统。比较在八个系统之间持有，并在检测不同来源的三个混淆水平的特征，可用性，技术方面以及它们的性能之间进行：逐字，释义和跨语言抄袭。在本研究的背景下也进行了对技术形式的抄袭技术形式的关注检查。此外，还提供了对不同作者提出的抄袭类型和分类的调查。

translated by 谷歌翻译

Proceedings of the 2nd International Workshop on Reading Music Systems

Jorge Calvo-Zaragoza , Alexander Pacha

分类：计算机视觉 | 机器学习

2022-12-01

The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 2nd International Workshop on Reading Music Systems, held in Delft on the 2nd of November 2019.

translated by 谷歌翻译

ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications

Juan Zuluaga-Gomez , Karel Veselý , Igor Szöke , Petr Motlicek , Martin Kocour , Mickael Rigault , Khalid Choukri , Amrutha Prasad , Seyyed Saeed Sarfjoo , Iuliia Nigmatulina

分类：自然语言处理 | 人工智能

2022-11-08

Personal assistants, automatic speech recognizers and dialogue understanding systems are becoming more critical in our interconnected digital world. A clear example is air traffic control (ATC) communications. ATC aims at guiding aircraft and controlling the airspace in a safe and optimal manner. These voice-based dialogues are carried between an air traffic controller (ATCO) and pilots via very-high frequency radio channels. In order to incorporate these novel technologies into ATC (low-resource domain), large-scale annotated datasets are required to develop the data-driven AI systems. Two examples are automatic speech recognition (ASR) and natural language understanding (NLU). In this paper, we introduce the ATCO2 corpus, a dataset that aims at fostering research on the challenging ATC field, which has lagged behind due to lack of annotated data. The ATCO2 corpus covers 1) data collection and pre-processing, 2) pseudo-annotations of speech data, and 3) extraction of ATC-related named entities. The ATCO2 corpus is split into three subsets. 1) ATCO2-test-set corpus contains 4 hours of ATC speech with manual transcripts and a subset with gold annotations for named-entity recognition (callsign, command, value). 2) The ATCO2-PL-set corpus consists of 5281 hours of unlabeled ATC data enriched with automatic transcripts from an in-domain speech recognizer, contextual information, speaker turn information, signal-to-noise ratio estimate and English language detection score per sample. Both available for purchase through ELDA at http://catalog.elra.info/en-us/repository/browse/ELRA-S0484. 3) The ATCO2-test-set-1h corpus is a one-hour subset from the original test set corpus, that we are offering for free at https://www.atco2.org/data. We expect the ATCO2 corpus will foster research on robust ASR and NLU not only in the field of ATC communications but also in the general research community.

translated by 谷歌翻译

JEMMA: An Extensible Java Dataset for ML4Code Applications

Anjan Karmakar , Miltiadis Allamanis , Romain Robbes

分类：机器学习

2022-12-18

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code's richly structured information. With this in mind, we introduce JEMMA, an Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50KC dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project, the very task that JEMMA is designed to help with.

translated by 谷歌翻译

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Daniel Galvez , Greg Diamos , Juan Ciro , Juan Felipe Cerón , Keith Achorn , Anjali Gopi , David Kanter , Maximilian Lam , Mark Mazumder , Vijay Janapa Reddi

分类：机器学习 | (统计)机器学习

2021-11-17

人民的言论是自由下载的30,000小时，并在CC-BY-SA下进行学术和商业用途的许可的受监管的会话英语语音识别数据集（具有CC-by子集）。通过使用现有转录搜索适当许可的音频数据来通过搜索互联网来收集数据。我们描述了我们的数据收集方法，并在Apache 2.0许可证下发布了我们的数据收集系统。我们表明，在此数据集上培训的模型在Librispeech的测试清洁测试集上实现了9.98％的单词错误率。最后，我们讨论了围绕创建一个相当大量的机器学习的法律和道德问题，并计划继续维护项目的计划根据MLCommons的赞助。

translated by 谷歌翻译

Cross-language Information Retrieval

Petra Galuščáková , Douglas W. Oard , Suraj Nair

分类：人工智能 | 自然语言处理

2021-11-10

两个关键假设塑造了排名检索的通常视图：（1）搜索者可以为他们希望看到的文档中的疑问选择单词，并且（2）排名检索的文档就足以，因为搜索者将足够就足够了能够认识到他们希望找到的那些。当要搜索的文档处于搜索者未知的语言时，既不是真的。在这种情况下，需要跨语言信息检索（CLIR）。本章审查了艺术技术的交流信息检索，并概述了一些开放的研究问题。

translated by 谷歌翻译

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford , Jong Wook Kim , Tao Xu , Greg Brockman , Christine McLeavey , Ilya Sutskever

分类：自然语言处理 | 机器学习

2022-12-06

We study the capabilities of speech processing systems trained simply to predict large amounts of transcripts of audio on the internet. When scaled to 680,000 hours of multilingual and multitask supervision, the resulting models generalize well to standard benchmarks and are often competitive with prior fully supervised results but in a zero-shot transfer setting without the need for any fine-tuning. When compared to humans, the models approach their accuracy and robustness. We are releasing models and inference code to serve as a foundation for further work on robust speech processing.

translated by 谷歌翻译

The Harvard USPTO Patent Dataset: A Large-Scale, Well-Structured, and Multi-Purpose Corpus of Patent Applications

Mirac Suzgun , Luke Melas-Kyriazi , Suproteem K. Sarkar , Scott Duke Kominers , Stuart M. Shieber

分类：自然语言处理 | 机器学习

2022-07-08

创新是经济和社会发展的主要驱动力，有关多种创新的信息嵌入了专利和专利申请的半结构化数据中。尽管在专利数据中表达的创新的影响和新颖性很难通过传统手段来衡量，但ML提供了一套有希望的技术来评估新颖性，汇总贡献和嵌入语义。在本文中，我们介绍了Harvard USPTO专利数据集（HUPD），该数据集是2004年至2004年之间提交给美国专利商业办公室（USPTO）的大型，结构化和多用途的英语专利专利申请。 2018年。HUPD拥有超过450万张专利文件，是可比的Coldia的两到三倍。与以前在NLP中提出的专利数据集不同，HUPD包含了专利申请的发明人提交的版本（不是授予专利的最终版本），其中允许我们在第一次使用NLP方法进行申请时研究专利性。它在包含丰富的结构化元数据以及专利申请文本的同时也很新颖：通过提供每个应用程序的元数据及其所有文本字段，数据集使研究人员能够执行一组新的NLP任务，以利用结构性协变量的变异。作为有关HUPD的研究类型的案例研究，我们向NLP社区（即专利决策的二元分类）介绍了一项新任务。我们还显示数据集中提供的结构化元数据使我们能够对此任务进行概念转移的明确研究。最后，我们演示了如何将HUPD用于三个其他任务：专利主题领域的多类分类，语言建模和摘要。

translated by 谷歌翻译

Deep Learning-Driven Edge Video Analytics: A Survey

Renjie Xu , Saiedeh Razavi , Rong Zheng

分类：计算机视觉 | 机器学习

2022-11-28

Video, as a key driver in the global explosion of digital information, can create tremendous benefits for human society. Governments and enterprises are deploying innumerable cameras for a variety of applications, e.g., law enforcement, emergency management, traffic control, and security surveillance, all facilitated by video analytics (VA). This trend is spurred by the rapid advancement of deep learning (DL), which enables more precise models for object classification, detection, and tracking. Meanwhile, with the proliferation of Internet-connected devices, massive amounts of data are generated daily, overwhelming the cloud. Edge computing, an emerging paradigm that moves workloads and services from the network core to the network edge, has been widely recognized as a promising solution. The resulting new intersection, edge video analytics (EVA), begins to attract widespread attention. Nevertheless, only a few loosely-related surveys exist on this topic. A dedicated venue for collecting and summarizing the latest advances of EVA is highly desired by the community. Besides, the basic concepts of EVA (e.g., definition, architectures, etc.) are ambiguous and neglected by these surveys due to the rapid development of this domain. A thorough clarification is needed to facilitate a consensus on these concepts. To fill in these gaps, we conduct a comprehensive survey of the recent efforts on EVA. In this paper, we first review the fundamentals of edge computing, followed by an overview of VA. The EVA system and its enabling techniques are discussed next. In addition, we introduce prevalent frameworks and datasets to aid future researchers in the development of EVA systems. Finally, we discuss existing challenges and foresee future research directions. We believe this survey will help readers comprehend the relationship between VA and edge computing, and spark new ideas on EVA.

translated by 谷歌翻译

Machine Learning Sensors

Pete Warden , Matthew Stewart , Brian Plancher , Colby Banbury , Shvetank Prakash , Emma Chen , Zain Asgar , Sachin Katti , Vijay Janapa Reddi

分类：机器学习

2022-06-07

机器学习传感器代表了嵌入式机器学习应用程序未来的范式转移。当前的嵌入式机器学习（ML）实例化遭受了复杂的整合，缺乏模块化以及数据流动的隐私和安全问题。本文提出了一个以数据为中心的范式，用于将传感器智能嵌入边缘设备上，以应对这些挑战。我们对“传感器2.0”的愿景需要将传感器输入数据和ML处理从硬件级别隔离到更广泛的系统，并提供一个薄的界面，以模拟传统传感器的功能。这种分离导致模块化且易于使用的ML传感器设备。我们讨论了将ML处理构建到嵌入式系统上控制微处理器的软件堆栈中的标准方法所带来的挑战，以及ML传感器的模块化如何减轻这些问题。 ML传感器提高了隐私和准确性，同时使系统构建者更容易将ML集成到其产品中，以简单的组件。我们提供了预期的ML传感器和说明性数据表的例子，以表现出来，并希望这将建立对话使我们朝着传感器2.0迈进。

translated by 谷歌翻译

Intent Recognition in Conversational Recommender Systems

Sahar Moradizeyveh

分类：自然语言处理 | 机器学习

2022-12-06

Any organization needs to improve their products, services, and processes. In this context, engaging with customers and understanding their journey is essential. Organizations have leveraged various techniques and technologies to support customer engagement, from call centres to chatbots and virtual agents. Recently, these systems have used Machine Learning (ML) and Natural Language Processing (NLP) to analyze large volumes of customer feedback and engagement data. The goal is to understand customers in context and provide meaningful answers across various channels. Despite multiple advances in Conversational Artificial Intelligence (AI) and Recommender Systems (RS), it is still challenging to understand the intent behind customer questions during the customer journey. To address this challenge, in this paper, we study and analyze the recent work in Conversational Recommender Systems (CRS) in general and, more specifically, in chatbot-based CRS. We introduce a pipeline to contextualize the input utterances in conversations. We then take the next step towards leveraging reverse feature engineering to link the contextualized input and learning model to support intent recognition. Since performance evaluation is achieved based on different ML models, we use transformer base models to evaluate the proposed approach using a labelled dialogue dataset (MSDialogue) of question-answering interactions between information seekers and answer providers.

translated by 谷歌翻译

Proceedings of the 3rd International Workshop on Reading Music Systems

Jorge Calvo-Zaragoza , Alexander Pacha

分类：计算机视觉 | 机器学习

2022-12-01

The International Workshop on Reading Music Systems (WoRMS) is a workshop that tries to connect researchers who develop systems for reading music, such as in the field of Optical Music Recognition, with other researchers and practitioners that could benefit from such systems, like librarians or musicologists. The relevant topics of interest for the workshop include, but are not limited to: Music reading systems; Optical music recognition; Datasets and performance evaluation; Image processing on music scores; Writer identification; Authoring, editing, storing and presentation systems for music scores; Multi-modal systems; Novel input-methods for music to produce written music; Web-based Music Information Retrieval services; Applications and projects; Use-cases related to written music. These are the proceedings of the 3rd International Workshop on Reading Music Systems, held in Alicante on the 23rd of July 2021.

translated by 谷歌翻译

Adapting the Tesseract Open-Source OCR Engine for Tamil and Sinhala Legacy Fonts and Creating a Parallel Corpus for Tamil-Sinhala-English

Charangan Vasantharajan , Laksika Tharmalingam , Uthayasanker Thayasivam

分类：自然语言处理

2021-09-13

Most low-resource languages do not have the necessary resources to create even a substantial monolingual corpus. These languages may often be found in government proceedings but mainly in Portable Document Format (PDF) that contains legacy fonts. Extracting text from these documents to create a monolingual corpus is challenging due to legacy font usage and printer-friendly encoding, which are not optimized for text extraction. Therefore, we propose a simple, automatic, and novel idea that can scale for Tamil, Sinhala, English languages, and many documents along with parallel corpora. Since Tamil and Sinhala are Low-Resource Languages, we improved the performance of Tesseract by employing LSTM-based training on more than 20 legacy fonts to recognize printed characters in these languages. Especially, our model detects code-mixed text, numbers, and special characters from the printed document. It is shown that this approach can reduce the character-level error rate of Tesseract from 6.03 to 2.61 for Tamil (-3.42% relative change) and 7.61 to 4.74 for Sinhala (-2.87% relative change), as well as the word-level error rate from 39.68 to 20.61 for Tamil (-19.07% relative change) and 35.04 to 26.58 for Sinhala (-8.46% relative change) on the test set. Also, our newly created parallel corpus consists of 185.4k, 168.9k, and 181.04k sentences and 2.11M, 2.22M, and 2.33M Words in Tamil, Sinhala, and English respectively. This study shows that fine-tuning Tesseract models on multiple new fonts help to understand the texts and enhances the performance of the OCR. We made newly trained models and the source code for fine-tuning Tesseract, freely available.

translated by 谷歌翻译

Survey of NLP in Pharmacology: Methodology, Tasks, Resources, Knowledge, and Tools

Dimitar Trajanov , Vangel Trajkovski , Makedonka Dimitrieva , Jovana Dobreva , Milos Jovanovik , Matej Klemen , Aleš Žagar , Marko Robnik-Šikonja

分类：自然语言处理 | 机器学习

2022-08-22

自然语言处理（NLP）是一个人工智能领域，它应用信息技术来处理人类语言，在一定程度上理解并在各种应用中使用它。在过去的几年中，该领域已经迅速发展，现在采用了深层神经网络的现代变体来从大型文本语料库中提取相关模式。这项工作的主要目的是调查NLP在药理学领域的最新使用。正如我们的工作所表明的那样，NLP是药理学高度相关的信息提取和处理方法。它已被广泛使用，从智能搜索到成千上万的医疗文件到在社交媒体中找到对抗性药物相互作用的痕迹。我们将覆盖范围分为五个类别，以调查现代NLP方法论，常见的任务，相关的文本数据，知识库和有用的编程库。我们将这五个类别分为适当的子类别，描述其主要属性和想法，并以表格形式进行总结。最终的调查介绍了该领域的全面概述，对从业者和感兴趣的观察者有用。

translated by 谷歌翻译

Survey of Generative Methods for Social Media Analysis

Stan Matwin , Aristides Milios , Paweł Prałat , Amilcar Soares , François Théberge

分类：机器学习

2021-12-13

本次调查绘制了用于分析社交媒体数据的生成方法的研究状态的广泛的全景照片（Sota）。它填补了空白，因为现有的调查文章在其范围内或被约会。我们包括两个重要方面，目前正在挖掘和建模社交媒体的重要性：动态和网络。社会动态对于了解影响影响或疾病的传播，友谊的形成，友谊的形成等，另一方面，可以捕获各种复杂关系，提供额外的洞察力和识别否则将不会被注意的重要模式。

translated by 谷歌翻译

AI in HCI Design and User Experience

Wei Xu

分类：人工智能

2023-01-03

In this chapter, we review and discuss the transformation of AI technology in HCI/UX work and assess how AI technology will change how we do the work. We first discuss how AI can be used to enhance the result of user research and design evaluation. We then discuss how AI technology can be used to enhance HCI/UX design. Finally, we discuss how AI-enabled capabilities can improve UX when users interact with computing systems, applications, and services.

translated by 谷歌翻译

A Survey of Human-in-the-loop for Machine Learning

Xingjiao Wu , Luwei Xiao , Yixuan Sun , Junhang Zhang , Tianlong Ma , Liang He

分类：机器学习

2021-08-02

通过整合人类的知识和经验，人在循环旨在以最低成本培训准确的预测模型。人类可以为机器学习应用提供培训数据，并直接完成在基于机器的方法中对管道中计算机中的难以实现的任务。在本文中，我们从数据的角度调查了人类循环的现有工作，并将它们分为三类具有渐进关系：（1）从数据处理中提高模型性能的工作，（2）通过介入模型培训提高模型性能，（3）系统的设计独立于循环的设计。使用上述分类，我们总结了该领域的主要方法;随着他们的技术优势/弱点以及自然语言处理，计算机愿景等的简单分类和讨论。此外，我们提供了一些开放的挑战和机遇。本调查打算为人类循环提供高级别的摘要，并激励有兴趣的读者，以考虑设计有效的循环解决方案的方法。

translated by 谷歌翻译