智能论文笔记

AI Ethics on Blockchain: Topic Analysis on Twitter Data for Blockchain Security

Yihang Fu , Zesen Zhuang , Luyao Zhang

分类：人工智能 | 机器学习

2022-12-14

Blockchain has empowered computer systems to be more secure using a distributed network. However, the current blockchain design suffers from fairness issues in transaction ordering. Miners are able to reorder transactions to generate profits, the so-called miner extractable value (MEV). Existing research recognizes MEV as a severe security issue and proposes potential solutions, including prominent Flashbots. However, previous studies have mostly analyzed blockchain data, which might not capture the impacts of MEV in a much broader AI society. Thus, in this research, we applied natural language processing (NLP) methods to comprehensively analyze topics in tweets on MEV. We collected more than 20000 tweets with \#MEV and \#Flashbots hashtags and analyzed their topics. Our results show that the tweets discussed profound topics of ethical concern, including security, equity, emotional sentiments, and the desire for solutions to MEV. We also identify the co-movements of MEV activities on blockchain and social media platforms. Our study contributes to the literature at the interface of blockchain security, MEV solutions, and AI ethics.

translated by 谷歌翻译

What are People Talking about in #BlackLivesMatter and #StopAsianHate? Exploring and Categorizing Twitter Topics Emerging in Online Social Movements through the Latent Dirichlet Allocation Model

Xin Tong , Yixuan Li , Jiayi Li , Rongqi Bei , Luyao Zhang

分类：自然语言处理 | 机器学习

2022-05-29

少数群体一直在使用社交媒体来组织社会运动，从而产生深远的社会影响。黑人生活问题（BLM）和停止亚洲仇恨（SAH）是两个成功的社会运动，在Twitter上蔓延开来，促进了抗议活动和活动，反对种族主义，并提高公众对少数群体面临的其他社会挑战的认识。但是，以前的研究主要对与用户的推文或访谈进行了定性分析，这些推文或访谈可能无法全面和有效地代表所有推文。很少有研究以严格，量化和以数据为中心的方法探讨了BLM和SAH对话中的Twitter主题。因此，在这项研究中，我们采用了一种混合方法来全面分析BLM和SAH Twitter主题。我们实施了（1）潜在的DIRICHLET分配模型，以了解顶级高级单词和主题以及（2）开放编码分析，以确定整个推文中的特定主题。我们通过#BlackLivesMatter和#Stopasianhate主题标签收集了超过一百万条推文，并比较了它们的主题。我们的发现表明，这些推文在深度上讨论了各种有影响力的话题，社会正义，社会运动和情感情感都是两种运动的共同主题，尽管每个运动都有独特的子主题。我们的研究尤其是社交媒体平台上的社会运动的主题分析，以及有关AI，伦理和社会相互作用的文献。

translated by 谷歌翻译

Twitter Corpus of the #BlackLivesMatter Movement And Counter Protests: 2013 to 2021

Salvatore Giorgi , Sharath Chandra Guntuku , McKenzie Himelein-Wachowiak , Amy Kwarteng , Sy Hwang , Muhammad Rahman , Brenda Curtis

分类：自然语言处理

2020-09-01

黑人生活问题（BLM）是一项分散的社会运动，抗议对黑人个人和社区的暴力行为，重点是警察暴力。 2020年，艾哈迈德·阿贝里（Ahmaud Arbery），布雷纳·泰勒（Breonna Taylor）和乔治·弗洛伊德（George Floyd）的杀害后，该运动引起了人们的关注。#BlackLivesMatter社交媒体标签已经代表了基层运动，并以类似的标签来抗议BLM运动，例如#AllllivesMatter和#allllivesmatter和#allllivesmatter，以及#bluelivesmatter。我们介绍了来自100多个国家 /地区的1,300万用户的6390万推文的数据集，其中包含以下关键字之一：BlackLivesMatter，AlllivesMatter和BluelivesMatter。该数据集包含从2013年BLM运动开始到2021年的所有当前可用推文。我们总结了数据集并显示了使用BlackLivesMatter关键字和与反向运动相关的关键字的时间趋势。此外，对于每个关键字，我们创建并发布了一组潜在的Dirichlet分配（LDA）主题（即自动聚集了语义上共同共的单词的组），以帮助研究人员识别这三个关键字的语言模式。

translated by 谷歌翻译

Twitter conversations predict the daily confirmed COVID-19 cases

Rabindra Lamsala , Aaron Harwood , Maria Rodriguez Read

分类：自然语言处理

2022-06-21

在撰写本文时，Covid-19（2019年冠状病毒病）已扩散到220多个国家和地区。爆发后，大流行的严肃性使人们在社交媒体上更加活跃，尤其是在Twitter和Weibo等微博平台上。现在，大流行特定的话语一直在这些平台上持续数月。先前的研究证实了这种社会产生的对话对危机事件的情境意识的贡献。案件的早期预测对于当局估算应对病毒的生长所需的资源要求至关重要。因此，这项研究试图将公共话语纳入预测模型的设计中，特别针对正在进行的波浪的陡峭山路区域。我们提出了一种基于情感的主题方法，用于设计与公开可用的Covid-19相关Twitter对话中的多个时间序列。作为用例，我们对澳大利亚Covid-19的日常案例和该国境内产生的Twitter对话实施了拟议的方法。实验结果：（i）显示了Granger导致每日COVID-19确认案例的潜在社交媒体变量的存在，并且（ii）确认这些变量为预测模型提供了其他预测能力。此外，结果表明，用于建模的社交媒体变量包含了48.83--51.38％的RMSE比基线模型的改善。我们还向公众发布了大型Covid-19特定地理标记的全球推文数据集Megocov，预计该量表的地理标记数据将有助于通过其他空间和时间上下文理解大流行的对话动态。

translated by 谷歌翻译

Vaccine Discourse on Twitter During the COVID-19 Pandemic

Gabriel Lindelöf , Talayeh Aledavood , Barbara Keller

分类：自然语言处理

2022-07-23

自Covid-19大流行病开始以来，疫苗一直是公共话语中的重要话题。疫苗周围的讨论被两极分化，因为有些人认为它们是结束大流行的重要措施，而另一些人则犹豫不决或发现它们有害。这项研究调查了与Twitter上的Covid-19疫苗有关的帖子，并着重于对疫苗有负姿态的帖子。收集了与COVID-19疫苗相关的16,713,238个英文推文的数据集，收集了涵盖从2020年3月1日至2021年7月31日的该期间。我们使用Scikit-Learn Python库来应用支持向量机（SVM）分类器针对Covid-19疫苗的推文具有负姿态。总共使用了5,163个推文来训练分类器，其中有2,484个推文由我们手动注释并公开提供。我们使用Berttopic模型来提取和调查负推文中讨论的主题以及它们如何随时间变化。我们表明，随着疫苗的推出，对COVID-19疫苗的负面影响随时间而下降。我们确定了37个讨论主题，并随着时间的推移介绍了各自的重要性。我们表明，流行的主题包括阴谋讨论，例如5G塔和微芯片，但还涉及涉及疫苗接种安全性和副作用以及对政策的担忧。我们的研究表明，即使是不受欢迎的观点或阴谋论，与广受欢迎的讨论主题（例如Covid-19疫苗）配对时，也会变得广泛。了解问题和讨论的主题以及它们如何随着时间的变化对于政策制定者和公共卫生当局提供更好和时间的信息和政策，以促进未来类似危机的人口接种。

translated by 谷歌翻译

COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes

Raj Kumar Gupta , Ajay Vishwanath , Yinping Yang

分类：自然语言处理

2020-07-14

本文描述了一个关于人们的话语的大型全球数据集以及在Twitter平台上对Covid-19的大流行的反应。从2020年1月28日至2022年6月1日，我们收集并处理了超过2900万个唯一用户的Twitter帖子，使用了四个关键字：“ Corona”，“ Wuhan”，“ NCOV”和“ COVID”。利用概率主题建模和预训练的基于机器学习的情感识别算法，我们将每个推文标记为具有十七个属性，包括a）十个二进制属性，指示了Tweet的相关性（1）或与前十名检测到的主题，B ）五个定量情绪属性表示价或情感的强度程度（从0：极为消极到1：极为积极）以及恐惧，愤怒，悲伤和幸福情感的强度程度（从0：完全不是1到1 ：极度强烈），c）两个分类属性表明情绪（非常负面，消极，中立或混合，积极，非常积极）以及主导的情感（恐惧，愤怒，悲伤，幸福，没有特定的情感），主要是推文表达。我们讨论技术有效性，并报告这些属性的描述性统计，其时间分布和地理表示。本文最后讨论了数据集在传播，心理学，公共卫生，经济学和流行病学中的用法。

translated by 谷歌翻译

"I think this is the most disruptive technology": Exploring Sentiments of ChatGPT Early Adopters using Twitter Data

Mubin Ul Haque , Isuru Dharmadasa , Zarrin Tasnim Sworna , Roshan Namal Rajapakse , Hussain Ahmad

分类：自然语言处理

2022-12-12

Large language models have recently attracted significant attention due to their impressive performance on a variety of tasks. ChatGPT developed by OpenAI is one such implementation of a large, pre-trained language model that has gained immense popularity among early adopters, where certain users go to the extent of characterizing it as a disruptive technology in many domains. Understanding such early adopters' sentiments is important because it can provide insights into the potential success or failure of the technology, as well as its strengths and weaknesses. In this paper, we conduct a mixed-method study using 10,732 tweets from early ChatGPT users. We first use topic modelling to identify the main topics and then perform an in-depth qualitative sentiment analysis of each topic. Our results show that the majority of the early adopters have expressed overwhelmingly positive sentiments related to topics such as Disruptions to software development, Entertainment and exercising creativity. Only a limited percentage of users expressed concerns about issues such as the potential for misuse of ChatGPT, especially regarding topics such as Impact on educational aspects. We discuss these findings by providing specific examples for each topic and then detail implications related to addressing these concerns for both researchers and users.

translated by 谷歌翻译

Perspectives of Non-Expert Users on Cyber Security and Privacy: An Analysis of Online Discussions on Twitter

Nandita Pattnaik , Shujun Li , Jason R. C. Nurse

分类：机器学习

2022-06-05

Current research on users` perspectives of cyber security and privacy related to traditional and smart devices at home is very active, but the focus is often more on specific modern devices such as mobile and smart IoT devices in a home context. In addition, most were based on smaller-scale empirical studies such as online surveys and interviews. We endeavour to fill these research gaps by conducting a larger-scale study based on a real-world dataset of 413,985 tweets posted by non-expert users on Twitter in six months of three consecutive years (January and February in 2019, 2020 and 2021). Two machine learning-based classifiers were developed to identify the 413,985 tweets. We analysed this dataset to understand non-expert users` cyber security and privacy perspectives, including the yearly trend and the impact of the COVID-19 pandemic. We applied topic modelling, sentiment analysis and qualitative analysis of selected tweets in the dataset, leading to various interesting findings. For instance, we observed a 54% increase in non-expert users` tweets on cyber security and/or privacy related topics in 2021, compared to before the start of global COVID-19 lockdowns (January 2019 to February 2020). We also observed an increased level of help-seeking tweets during the COVID-19 pandemic. Our analysis revealed a diverse range of topics discussed by non-expert users across the three years, including VPNs, Wi-Fi, smartphones, laptops, smart home devices, financial security, and security and privacy issues involving different stakeholders. Overall negative sentiment was observed across almost all topics non-expert users discussed on Twitter in all the three years. Our results confirm the multi-faceted nature of non-expert users` perspectives on cyber security and privacy and call for more holistic, comprehensive and nuanced research on different facets of such perspectives.

translated by 谷歌翻译

OSN Dashboard Tool For Sentiment Analysis

Andreas Kilde Lien , Lars Martin Randem , Hans Petter Fauchald Taralrud , Maryam Edalati

分类：自然语言处理

2022-06-14

互联网上的自以为是的数据量正在迅速增加。越来越多的人在评论，讨论论坛，微博和一般社交媒体中分享他们的想法和意见。由于意见在所有人类活动中都是核心，因此已应用情绪分析来获得有关此类数据的见解。有几种情感分类的方法。主要缺点是缺乏用于分类和高级可视化的标准化解决方案。在这项研究中，提出了用于在线社交网络分析的情感分析仪仪表板。这是为了使人们能够获得对他们有趣的主题的见解。该工具允许用户在仪表板中运行所需的情感分析算法。除了提供几种可视化类型外，仪表板还促进了来自情感分类的原始数据结果，可以下载以进行进一步分析。

translated by 谷歌翻译

AI-based Monitoring and Response System for Hospital Preparedness towards COVID-19 in Southeast Asia

Tushar Goswamy , Naishadh Parmar , Ayush Gupta , Raunak Shah , Vatsalya Tandon , Varun Goyal , Sanyog Gupta , Karishma Laud , Shivam Gupta , Sudhanshu Mishra

分类：自然语言处理 | 机器学习

2020-07-30

这篇研究论文提出了COVID-19监测和响应系统，以确定医院患者的数量激增以及关键设备（如东南亚国家的呼吸机），以了解医疗机构的负担。这可以通过资源计划措施来帮助这些地区的当局，以将资源重定向到模型确定的地区。由于缺乏有关医院患者涌入的公开可用数据，或者这些国家可能面临的设备，ICU单元或医院病床的短缺，我们利用Twitter数据来收集此信息。该方法为印度的各州提供了准确的结果，我们正在努力验证其余国家的模型，以便它可以作为当局监控医院负担的可靠工具。

translated by 谷歌翻译

Twitter Topic Classification

Dimosthenis Antypas , Asahi Ushio , Jose Camacho-Collados , Leonardo Neves , Vítor Silva , Francesco Barbieri

分类：自然语言处理

2022-09-20

社交媒体平台主持了有关每天出现的各种主题的讨论。理解所有内容并将其组织成类别是一项艰巨的任务。处理此问题的一种常见方法是依靠主题建模，但是使用此技术发现的主题很难解释，并且从语料库到语料库可能会有所不同。在本文中，我们提出了基于推文主题分类的新任务，并发布两个相关的数据集。鉴于涵盖社交媒体中最重要的讨论点的广泛主题，我们提供了最近时间段的培训和测试数据，可用于评估推文分类模型。此外，我们在任务上对当前的通用和领域特定语言模型进行定量评估和分析，这为任务的挑战和性质提供了更多见解。

translated by 谷歌翻译

A Python Library for Exploratory Data Analysis on Twitter Data based on Tokens and Aggregated Origin-Destination Information

Mario Graff , Daniela Moctezuma , Sabino Miranda-Jiménez , Eric S. Tellez

分类：自然语言处理

2020-09-03

Twitter也许是社交媒体更适合研究。它只需要几个步骤来获取信息，并且有很多库可以帮助这方面。尽管如此，知道特定事件是否在Twitter上表达是一个具有挑战性的任务，需要相当多的推文集合。该提案旨在促进研究员对自从2015年12月以来推出的Twitter采集的加工信息收集到Twitter上采矿活动的过程。事件可能与自然灾害，健康问题和人民的流动相关，等等可以与图书馆一起追求的研究。在这一贡献中提出了不同的应用程序，以说明图书馆的能力：对推文中发现的主题的探索性分析，这是西班牙语方言中的相似性研究以及不同国家的移动性报告。总之，呈现的Python库应用于不同的域，并在以阿拉伯语，英语，西班牙语和俄语的单词和双克单词的频率下检索一系列信息。以及与200多个国家或地区的地点之间的旅行数量有关的移动性信息。

translated by 谷歌翻译

Analyzing the State of Computer Science Research with the DBLP Discovery Dataset

Lennart Küll

分类：自然语言处理

2022-12-01

The number of scientific publications continues to rise exponentially, especially in Computer Science (CS). However, current solutions to analyze those publications restrict access behind a paywall, offer no features for visual analysis, limit access to their data, only focus on niches or sub-fields, and/or are not flexible and modular enough to be transferred to other datasets. In this thesis, we conduct a scientometric analysis to uncover the implicit patterns hidden in CS metadata and to determine the state of CS research. Specifically, we investigate trends of the quantity, impact, and topics for authors, venues, document types (conferences vs. journals), and fields of study (compared to, e.g., medicine). To achieve this we introduce the CS-Insights system, an interactive web application to analyze CS publications with various dashboards, filters, and visualizations. The data underlying this system is the DBLP Discovery Dataset (D3), which contains metadata from 5 million CS publications. Both D3 and CS-Insights are open-access, and CS-Insights can be easily adapted to other datasets in the future. The most interesting findings of our scientometric analysis include that i) there has been a stark increase in publications, authors, and venues in the last two decades, ii) many authors only recently joined the field, iii) the most cited authors and venues focus on computer vision and pattern recognition, while the most productive prefer engineering-related topics, iv) the preference of researchers to publish in conferences over journals dwindles, v) on average, journal articles receive twice as many citations compared to conference papers, but the contrast is much smaller for the most cited conferences and journals, and vi) journals also get more citations in all other investigated fields of study, while only CS and engineering publish more in conferences than journals.

translated by 谷歌翻译

Multi-dimensional Racism Classification during COVID-19: Stigmatization, Offensiveness, Blame, and Exclusion

Xin Pei , Deval Mehta

分类：人工智能

2022-08-29

超越种族主义文本的二元分类，我们的研究从社会科学理论中获取线索，以开发一种用于种族主义检测的多维模型，即污名化，进攻性，责备和排斥。在BERT和主题建模的帮助下，这种分类检测可以洞悉Covid-19期间数字平台上种族主义讨论的基本细节。我们的研究有助于丰富有关社交媒体上种族主义行为的学术讨论。首先，采用阶段分析来捕捉在Covid-19的早期阶段的主题变化的动态，该阶段从国内流行病转变为国际公共卫生紧急情况，后来转变为全球大流行。此外，映射这一趋势可以更准确地预测有关离线世界中种族主义的公众舆论发展，同时，制定了规定的干预策略，以打击像Covid-19这样的全球公共卫生危机期间的种族主义兴起。此外，这项跨学科研究还指出了关于社交网络分析和采矿的未来研究的方向。将社会科学观点整合到计算方法的发展中，为更准确的数据检测和分析提供了见解。

translated by 谷歌翻译

2020 U.S. presidential election in swing states: Gender differences in Twitter conversations

Amir Karami , Spring B. Clark , Anderson Mackenzie , Dorathea Lee , Michael Zhu , Hannah R. Boyajieff , Bailey Goldschmidt

分类：自然语言处理

2021-08-21

社交媒体通常在选举活动中被公众使用，以表达他们对不同问题的看法。在各种社交媒体渠道中，Twitter为研究人员和政客提供了一个有效的平台，以探索有关经济和外交政策等广泛主题的公众舆论。当前的文献主要集中于分析推文的内容而无需考虑用户的性别。这项研究收集和分析了大量推文，并使用计算，人类编码和统计分析来识别2020年美国总统选举期间发布的300,000多个推文中的主题。我们的发现是基于广泛的主题，例如税收，气候变化和Covid-19-19。在主题中，女性和男性用户之间存在着显着差异，超过70％的主题。

translated by 谷歌翻译

Understanding COVID-19 Vaccine Reaction through Comparative Analysis on Twitter

Yuesheng Luo , Mayank Kejriwal

分类：自然语言处理

2021-11-10

虽然现在几个月有多个Covid-19疫苗，但疫苗犹豫不决在美国的高水平。部分内容也已成为政治化，特别是自11月总统选举以来。在包括Twitter的社交媒体背景下，在此期间理解疫苗犹豫不决，可以为计算社会科学家和决策者提供有价值的指导。本文通过相对研究两个不同的时间段（选举前的一个，另一个月之后的另一个月，另一个月）采用相对研究的两个Twitter数据集，而不是研究单一的Twitter语料库，而不是研究单个Twitter语料库。数据收集和过滤方法。我们的研究结果表明，从2020年到2021年秋天的政治到Covid-19疫苗的讨论中讨论了重大转变。通过使用基于集群和机器学习的方法与采样和定性分析，我们发现了几种细粒度疫苗犹豫不决的原因，其中一些随着时间的推移而变得更加（或更少）。我们的结果还强调了去年这个问题的强烈极化和政治化。

translated by 谷歌翻译

Demystifying the COVID-19 vaccine discourse on Twitter

Zainab Zaidi , Mengbin Ye , Fergus John Samon , Abdisalam Jama , Binduja Gopalakrishnan , Chenhao Gu , Shanika Karunasekera , Jamie Evans , Yoshihisa Kashima

分类：自然语言处理

2022-08-29

对社交媒体上的COVID-19疫苗接种的公众讨论不仅对于解决当前的Covid-19-19大流行，而且对于未来的病原体爆发而言至关重要。我们检查了一个Twitter数据集，其中包含7500万英文推文，讨论2020年3月至2021年3月的Covid-19疫苗接种。我们使用自然语言处理（NLP）技术培训了一种立场检测算法，以将推文分为“反Vax”或“ pro-Vax”或“ Pro-Vax” '，并使用主题建模技术检查话语的主要主题。虽然Pro-Vax推文（3700万）远远超过反VAX推文（1000万），但两种姿态的大多数推文（63％的反VAX和53％的Pro-Vax推文）都来自双稳定的用户，他们都发布了两者在观察期间，亲和反VAX推文。 Pro-Vax推文主要集中在疫苗开发上，而反VAX推文则涵盖了广泛的主题，其中一些主题包括真正的问题，尽管存在很大的虚假性。尽管从相反的角度讨论了这两个立场，但两种立场都是常见的。模因和笑话是最转推消息之一。尽管对反vax话语的两极分化和在线流行的担忧是毫无根据的，但针对虚假的有针对性的反驳很重要。

translated by 谷歌翻译

HTML版本

Detecing Anti-Vaccine Users on Twitter

Matheus Schmitz , Goran Murić , Keith Burghardt

分类：自然语言处理

2021-10-21

最近受到在线叙述驱动的疫苗犹豫会大大降低了疫苗接种策略的功效，例如Covid-19。尽管医学界对可用疫苗的安全性和有效性达成了广泛的共识，但许多社交媒体使用者仍被有关疫苗的虚假信息淹没，并且柔和或不愿意接种疫苗。这项研究的目的是通过开发能够自动识别负责传播反疫苗叙事的用户的系统来更好地理解反疫苗情绪。我们引入了一个公开可用的Python软件包，能够分析Twitter配置文件，以评估该个人资料将来分享反疫苗情绪的可能性。该软件包是使用文本嵌入方法，神经网络和自动数据集生成的，并接受了数百万条推文培训。我们发现，该模型可以准确地检测出抗疫苗用户，直到他们推文抗Vaccine主题标签或关键字。我们还展示了文本分析如何通过检测Twitter和常规用户之间的抗疫苗传播器之间的道德和情感差异来帮助我们理解反疫苗讨论的示例。我们的结果将帮助研究人员和政策制定者了解用户如何成为反疫苗感以及他们在Twitter上讨论的内容。政策制定者可以利用此信息进行更好的针对性的运动，以揭露有害的反疫苗接种神话。

translated by 谷歌翻译

An Empirical Study of IoT Security Aspects at Sentence-Level in Developer Textual Discussions

Nibir Chandra Mandal , Gias Uddin

分类：机器学习

2022-06-07

物联网是一个快速新兴的范式，现在几乎涵盖了我们现代生活的各个方面。因此，确保物联网设备的安全至关重要。物联网设备与传统计算可能有所不同，从而在物联网设备中设计和实施适当的安全措施可能具有挑战性。我们观察到，物联网开发人员在堆栈溢出（SO）等开发人员论坛中讨论了与安全相关的挑战。但是，我们发现，在SO中，物联网安全讨论也可以埋葬在非安全性讨论中。在本文中，我们旨在了解物联网开发人员在将安全实践和技术应用于IoT设备时面临的挑战。我们有两个目标：（1）开发一个模型，该模型可以自动在SO中找到与安全有关的物联网讨论，并且（2）研究模型输出以了解与IoT开发人员安全相关的挑战。首先，我们从中下载了53k帖子，因此包含有关物联网的讨论。其次，我们手动将53K帖子的5,919个句子标记为1或0。第三，我们使用此基准测试来研究一套深度学习变压器模型。最佳性能模型称为SECBOT。第四，我们将SECBOT应用于整个帖子，并找到大约30K安全性的句子。第五，我们将主题建模应用于与安全有关的句子。然后，我们标记并分类主题。第六，我们分析了主题的演变。我们发现（1）SECBOT是基于深度学习模型Roberta的重建。 SECBOT提供的最佳F1分数为0.935，（2）SECBOT错误分类的样本中有六个错误类别。当关键字/上下文是模棱两可的（例如，网关可以是安全网关或简单网关）时，SECBOT主要是错误的，（3）有9个安全主题分为三个类别：软件，硬件和网络，以及（4）最多的主题属于软件安全性，然后是网络安全。

translated by 谷歌翻译

DiPD: Disruptive event Prediction Dataset from Twitter

Sanskar Soni , Dev Mehta , Vinush Vishwanath , Aditi Seetha , Satyendra Singh Chouhan

分类：自然语言处理 | 机器学习

2021-11-25

如果失去控制，骚乱和抗议可能会在一个国家造成严重破坏。我们已经看到了这一点，例如BLM运动，气候罢工，CAA运动等等，在很大程度上引起了破坏。我们的动机落后于创建此数据集是使用它来开发机器学习系统，可以让用户能够深入了解正在进行的趋势事件，并提醒他们可能导致国家中断的事件。如果任何事件开始失控，可以通过在升级之前监控它来处理和减轻它。此数据集收集已知已知造成的过去或正在进行的事件的推文，并将这些推文标记为1.我们还收集了被认为是非最终的并且将它们标记为0，以便它们也可用于培训分类系统。数据集包含94855个独特事件的记录和168706个独特事件的记录，从而给出了总数据集263561记录。我们从推文中提取多个功能，例如用户的跟随计数和用户的位置，以了解推文的影响和范围。此数据集可能在各种事件相关机器学习问题（如事件分类，事件识别等）中有用。

translated by 谷歌翻译