行人安全是运输系统管理人员和运营商的优先事项,以及德克萨斯州奥斯汀市雇用的愿景零策略的主要重点。虽然有许多治疗和技术能够有效地提高行人安全性,但识别这些治疗最需要的位置仍然是一个挑战。当前的实践需要手动观察候选位置进行有限的时间段,导致识别过程是耗时的,随着时间的推移,交通模式的滞后,缺乏可扩展性。中间块位置,通常需要安全对策,特别是难以识别和监控。该研究的目标是了解公交车站位置和中块交叉路口之间的相关性,以帮助交通工程师实施视觉零策略以提高行人安全性。在事先工作中,我们开发了一种使用深度神经网络模型来检测交通摄像机视频的行人交叉事件,以识别交叉事件。在本文中,我们扩展了使用在附近的交叉口的货架上的CCTV PAN- TILT-ZOOM(PTZ)流量监控摄像机中使用交通摄像机视频识别总线停止使用的方法。我们将视频检测结果与巴士站附近的中间块交叉相关联,在中间块交叉的每一侧的公共汽车上的行人活动。我们还通过自动创建仅显示交叉事件的视频剪辑自动化创建来促进人工活动检测的网络门户,从而大大提高人类审查过程的效率来促进人工活动检测。
translated by 谷歌翻译
Periocular refers to the region of the face that surrounds the eye socket. This is a feature-rich area that can be used by itself to determine the identity of an individual. It is especially useful when the iris or the face cannot be reliably acquired. This can be the case of unconstrained or uncooperative scenarios, where the face may appear partially occluded, or the subject-to-camera distance may be high. However, it has received revived attention during the pandemic due to masked faces, leaving the ocular region as the only visible facial area, even in controlled scenarios. This paper discusses the state-of-the-art of periocular biometrics, giving an overall framework of its most significant research aspects.
translated by 谷歌翻译
Detecting actions in untrimmed videos should not be limited to a small, closed set of classes. We present a simple, yet effective strategy for open-vocabulary temporal action detection utilizing pretrained image-text co-embeddings. Despite being trained on static images rather than videos, we show that image-text co-embeddings enable openvocabulary performance competitive with fully-supervised models. We show that the performance can be further improved by ensembling the image-text features with features encoding local motion, like optical flow based features, or other modalities, like audio. In addition, we propose a more reasonable open-vocabulary evaluation setting for the ActivityNet data set, where the category splits are based on similarity rather than random assignment.
translated by 谷歌翻译
Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data \& models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study will be available at https://github.com/LAION-AI/scaling-laws-openclip
translated by 谷歌翻译
Electronic Health Records (EHRs) hold detailed longitudinal information about each patient's health status and general clinical history, a large portion of which is stored within the unstructured text. Temporal modelling of this medical history, which considers the sequence of events, can be used to forecast and simulate future events, estimate risk, suggest alternative diagnoses or forecast complications. While most prediction approaches use mainly structured data or a subset of single-domain forecasts and outcomes, we processed the entire free-text portion of EHRs for longitudinal modelling. We present Foresight, a novel GPT3-based pipeline that uses NER+L tools (i.e. MedCAT) to convert document text into structured, coded concepts, followed by providing probabilistic forecasts for future medical events such as disorders, medications, symptoms and interventions. Since large portions of EHR data are in text form, such an approach benefits from a granular and detailed view of a patient while introducing modest additional noise. On tests in two large UK hospitals (King's College Hospital, South London and Maudsley) and the US MIMIC-III dataset precision@10 of 0.80, 0.81 and 0.91 was achieved for forecasting the next biomedical concept. Foresight was also validated on 34 synthetic patient timelines by 5 clinicians and achieved relevancy of 97% for the top forecasted candidate disorder. Foresight can be easily trained and deployed locally as it only requires free-text data (as a minimum). As a generative model, it can simulate follow-on disorders, medications and interventions for as many steps as required. Foresight is a general-purpose model for biomedical concept modelling that can be used for real-world risk estimation, virtual trials and clinical research to study the progression of diseases, simulate interventions and counterfactuals, and for educational purposes.
translated by 谷歌翻译
In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning.
translated by 谷歌翻译
Likelihood-based deep generative models have recently been shown to exhibit pathological behaviour under the manifold hypothesis as a consequence of using high-dimensional densities to model data with low-dimensional structure. In this paper we propose two methodologies aimed at addressing this problem. Both are based on adding Gaussian noise to the data to remove the dimensionality mismatch during training, and both provide a denoising mechanism whose goal is to sample from the model as though no noise had been added to the data. Our first approach is based on Tweedie's formula, and the second on models which take the variance of added noise as a conditional input. We show that surprisingly, while well motivated, these approaches only sporadically improve performance over not adding noise, and that other methods of addressing the dimensionality mismatch are more empirically adequate.
translated by 谷歌翻译
The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We therefore put forward wider and shallower models as a viable and desirable alternative for small models on NLP tasks, and as an important area of research for domains beyond this.
translated by 谷歌翻译
病理学家通过检查载玻片上的针头活检的组织来诊断和坡度前列腺癌。癌症的严重程度和转移风险取决于格里森等级,这是基于前列腺癌腺体的组织和形态的分数。为了进行诊断检查,病理学家首先将腺体定位在整个活检核心中,如果发现癌症 - 他们分配了Gleason等级。尽管严格的诊断标准,但这种耗时的过程仍会出现错误和明显的观察者间变异性。本文提出了一个自动化的工作流程,该工作流程遵循病理学家的\ textit {modus operandi},对整个幻灯片图像(WSI)的多尺度斑块进行隔离和分类。分别对基质和腺体边界; (2)分类器网络以高放大倍数将良性与癌症分离; (3)另一个分类器可以在低放大倍率下预测每个癌症的等级。总的来说,此过程为前列腺癌分级提供了一种特定于腺体的方法,我们将其与其他基于机器学习的分级方法进行比较。
translated by 谷歌翻译
社交媒体的日益普及引起了人们对儿童在线安全的关注。未成年人与具有掠夺性意图的成年人之间的互动是一个特别严重的关注点。在线性修饰的研究通常依靠领域专家来手动注释对话,从而限制了规模和范围。在这项工作中,我们测试了良好的方法如何检测对话行为并取代专家的人类注释。在在线修饰的心理理论中,我们将$ 6772的$ 6772 $聊天消息标记为儿童性犯罪者以十一种掠夺性行为之一发送的聊天消息。我们训练字袋和自然语言推断模型来对每种行为进行分类,并表明,最佳性能模型以一致但不与人类注释的方式分类的方式对行为进行了分类。
translated by 谷歌翻译