Learning Automata-Based Complex Event Patterns in Answer Set Programming

Nikos Katzouris Institute of Informatics & Telecommunications, National Center for Scientific Research (NCSR) “Demokritos”, Athens, Greece
¹,
Georgios Paliouras Institute of Informatics & Telecommunications, National Center for Scientific Research (NCSR) “Demokritos”, Athens, Greece
¹,

¹email: {nkatz,gpaliourg}@iit.demokritos.gr

Abstract

Complex Event Recognition and Forecasting (CER/F) techniques attempt to detect, or even forecast ahead of time, event occurrences in streaming input using predefined event patterns. Such patterns are not always known in advance, or they frequently change over time, making machine learning techniques, capable of extracting such patterns from data, highly desirable in CER/F. Since many CER/F systems use symbolic automata to represent such patterns, we propose a family of such automata where the transition-enabling conditions are defined by Answer Set Programming (ASP) rules, and which, thanks to the strong connections of ASP to symbolic learning, are directly learnable from data. We present such a learning approach in ASP and an incremental version thereof that trades optimality for efficiency and is capable to scale to large datasets. We evaluate our approach on two CER datasets and compare it to state-of-the-art automata learning techniques, demonstrating empirically a superior performance, both in terms of predictive accuracy and scalability.

Keywords:

Automata Learning Answer Set Programming Complex Event Recognition

1 Introduction

Complex Event Recognition and forecasting (CER/F) systems [8, 18, 2] detect, or even forecast ahead of time, occurrences of complex events (CEs) in multivariate streaming input, defined as temporal combinations of simple events, e.g. sensor data. CE patterns are typically defined by domain experts in some event specification language. However, such patterns are not always known in advance, or they frequently need to be revised, as the characteristics of the input data change over time, making machine learning techniques capable of learning such patterns from data highly desirable in CER/F.

Despite the great diversity of existing event specification languages, a minimal set of basic constructs/operators that should be present in every such language have been identified [36, 3, 18, 20], in the form of an abstract event algebra. The most important of these operators are the sequence operator and the closely related iteration operator (Kleene Closure), implying respectively that some particular events should succeed one another temporally, or that an event should occur iteratively in a sequence, and the selection operator, which filters (selects) events that satisfy a set of predefined predicates. Taken together, these three operators already point to a computational model for CER/F based on symbolic automata [9], and indeed, in most existing CER/F systems CE patterns are either defined directly as symbolic automata, or are compiled into such at runtime [35, 1, 36, 13, 11, 12, 30, 29, 7, 2]. In symbolic automata the transition-enabling conditions are predicates, rather than mere symbols, as in classical automata. In CER/F, the structure of such an automaton pattern corresponds to the conditions specified by the sequence/iteration operators in the CE pattern and their transition guards correspond to the pattern’s selection predicates.

Learning such symbolic automata-based CE patterns is a challenging task that requires to combine automata structure identification techniques with reasoning about the satisfiability of the selection predicates. To address this issue we propose a family of symbolic automata, which we call answer set automata (ASA), where the transition guards are defined by means of rules in Answer Set Programming (ASP), providing definitions for the selection predicates. Importantly, thanks to the strong connections of ASP with symbolic learning, ASA-based CE patterns are directly learnable from data. We present an approach that utilizes the power of ASP in declarative learning to automatically construct such patterns from data and we lift this approach to an incremental version that trades optimality for efficiency and allows our learning framework to scale to large datasets. We evaluate our approach on two CER datasets and compare it to state-of-the-art automata learning techniques, demonstrating empirically a superior performance, both in terms of predictive accuracy and scalability.

2 Related Work

Although the field of automata learning [10, 34] has a long history in the literature [19, 5, 28, 21, 33, 4, 17, 32, 16], existing techniques that induce automata from positive & negative traces have several shortcomings, which limit their applicability in the CER/F domain. Most such algorithms either attempt to learn a model that perfectly discriminates between the positive/negative traces [19, 5, 17, 4, 16], or they use greedy techniques for state merging [28, 21], a technique that generalizes from a large, tree-like automaton (the Prefix Tree Acceptor – PTA), generated from the entire training set. These approaches tend to learn large, overfitted models that generalize poorly. More recent techniques [31] replace the PTA generalization heuristics with exact, constraint-based automata identification methods, achieving higher generalization capacity. However, the issue remains that such techniques still need to encode the entire training set into a PTA, raising memory/scalability issues in large datasets.

All aforementioned algorithms learn classical automata. Although some of these algorithms could, in principle, be applied to the symbolic automata learning setting, e.g. via propositionalization, that would entail an explosion in alphabet size and the combinatorial complexity of the learning task. On the other hand, although some algorithms for symbolic automata induction do exist [14, 26, 6, 15], they are mostly based on “upgrading” existing classical automata identification techniques to richer alphabets, and they thus suffer from the limitations outlined above, i.e. poor generalization, intolerance to noise and limited scalability.

Learning automata and grammars has been an application domain for ILP since its early days, targeting mostly classical automata expressed as definite clause grammars. More recent ILP frameworks, such as meta-interpretative learning (MIL) and learning from answer sets (ILASP), have also been applied to the task [27, 16]. However, both these approaches learn models that perfectly discriminate between positive and negative examples, therefore, they cannot deal with noise in the data. Moreover, the MIL approach of [27] learns classical automata, and although the ILASP-based approach of [16] does learn a form of symbolic automata, the transition guards therein are restricted to propositional clauses generated from combinations of the alphabet symbols, which falls short of the CER/F requirement for arbitrary selection predicates.

In contrast to the above-mentioned approaches, our symbolic automata learning framework utilizes the full expressive power of ASP in the definitions of transition guards and ASP’s declarative learning capabilities to learn highly compressive models that generalize adequately. Moreover, its incremental learning version is able to scale to arbitrarily large datasets.

3 ASP Background

We assume some familiarity with ASP and refer to [23] for an in-depth account. In this section we review some basic ASP constructs that will be useful in what follows. Throughout, we use the Clingo¹¹1https://potassco.org/ syntax for representing ASP expressions. A choice rule is an expression of the form ${α} \leftarrow δ_{1}, \dots, δ_{n}$ , with the intuitive meaning that whenever the body $δ_{1}, \dots, δ_{n}$ is satisfied by an answer set $I$ of a program that includes the choice rule, instances of the head $α$ are arbitrarily included in $I$ (satisfied) as well. A weak constraint is an expression of the form $:\texttildelow δ1,…,δn.[w@p,t1,…,tk]$ , where $δ_{i}$ ’s are literals, called the body of the constraint, $w$ and $p$ are integers, called respectively the weight and the priority level of the constraint and $t_{1}, \dots, t_{k}$ are ASP terms. A grounding/instance of a weak constraint $c$ is an expression that results from $c$ by replacing all variables in $δ_{1}, \dots, δ_{n}, t_{1}, \dots, t_{k}$ by constants. Such an instance is satisfied by an answer set $I_{Π}$ of a program $Π$ that includes $c$ if $I_{Π}$ satisfies $c$ ’s ground body, which incurs a penalty of $w$ on $I_{Π}$ . $I_{Π}$ ’s total cost is the sum of penalties resulting from each instance of $c$ that is satisfied by $I_{Π}$ . Inclusion of weak constraints in an ASP program triggers an optimization process that yields answer sets of minimum cost. Priority levels in weak constraints model the constraints’ relative importance, since the aforementioned optimization process attempts to first minimize the total cost due to weak constraints of higher priority levels.

4 The Problem Setting and a Running Example

We next set the scene for our proposed automata-based CE pattern learning framework. CER/F applications usually deal with multivariate input, i.e., input that arrives in multiple streams, each representing a ‘‘signal’’ obtained by the evolution of a relevant domain attribute in time. We illustrate the case using a running example from the domain of precision medicine, as formulated in the context of the INFORE EU-funded project²²2https://www.infore-project.eu/.

Example 1 (Running Example).

INFORE’s precision medicine use-case utilizes CER/F techniques to assist the knowledge discovery process in personalized cancer treatment development. This process involves running a vast amount of complex, multi-cellular simulations to study the effects of various drug synergies on tumor growth. Such simulations are extremely demanding computationally and their majority ends-up in a negative result, signifying that a particular drug combination is not effective. Using CER/F to detect/forecast non-promising simulations at an early stage may thus speed up research by allowing to terminate non-interesting simulations early-on and allocate resources towards the exploration of more promising drug combinations. This calls for learning patterns of interesting/non-interesting outcomes from labeled, historical simulation data. Notably, the interpretability of such patterns is crucial in this domain.Therefore, mainstream black-box time-series classification methods, including deep learning techniques, are not an option. Figure 1 presents such a simulation generated by PhysiBoss [22], a bio-informatics simulation environment that allows to explore the results of several environmental and genetic alterations to populations of cells. Figure 1 presents tumor growth evolution over time, in terms of population sizes of three types of cell: the tumor’s alive cells, its apoptotic cells, i.e. cells that are “programmed” to die, due to apoptosis, and the tumor’s necrotic cells, i.e. cells that die due to the effects of an injected Tumor Necrosis Factor (TNF), i.e. a drug combination.

Figure 1: A multi-cellular simulation of tumor evolution from the INFORE project.

Multivariate time-series, such as the simulation data from Figure 1, may be converted into symbolic multivariate sequences (MVSs), e.g. by using the Symbolic Aggregate Approximation (SAX) algorithm [24, 25]. SAX converts time-series into symbolic sequences, by mapping numerical values to symbols, drawn from a fixed-length alphabet, such that each symbol in the converted sequence corresponds to a bin (value range) in the original time-series.

In the event pattern learning setting that we put forward we assume that the training data consist of labeled, symbolic MVSs, each representing a training example. For instance, the symbolic MVS obtained by discretizing the simulation data in Figure 1, along with a label (e.g. interesting/non-interesting simulation) represents such a training example. In what follows, by MVS we always mean a symbolic MVS.

Given an MVS $S$ of maximum length $n$ (i.e. the length of the largest sequence in $S$ ) we use a logical representation of $S$ as a sequence of interpretations $I_{1}, \dots, I_{n}$ , where each $I_{t}$ consists of ground facts that describe $S$ ’s $t$ -th coordinate. In particular, we assume that each sequence in $S$ corresponds to a domain attribute and consists of symbols from a fixed alphabet $Σ$ that represent the values of this attribute over time. Then, the interpretation $I_{t}$ that corresponds to $S$ ’s $t$ -th coordinate consists of observation facts of the form $o b s (S e q I d, a v (A, V), T)$ , meaning that the attribute $A$ of the MVS with unique id $S e q I d$ has value $V$ at time $T$ .

(a) A toy training set consisting of two MVSs. Positive example (

i d_{1}

): Negative example (

i d_{2}

): Alive cells: eeeedcbbbb Alive cells: eecdbbbbbb Necrotic cells: aabbbcccde Necrotic cells: aabbbbcccc Apoptotic cells: bbbcdghhhh Apoptotic cells: bbbcfghhhh (b) The logical representation of MVS $i d_{1}$ :

Iid11={\scriptsize obs(id1,\scriptsize av(alive,e),1). \scriptsize obs(id1,\scriptsize av(necrotic,a),1). \scriptsize obs(id1,\scriptsize av% (apoptotic,b),1).}

Iid12={\scriptsize obs(id1,\scriptsize av(alive,e),2). \scriptsize obs(id1,\scriptsize av(necrotic,a),2). \scriptsize obs(id1,\scriptsize av% (apoptotic,b),2).}

Iid13={\scriptsize obs(id1,\scriptsize av(alive,e),3). \scriptsize obs(id1,\scriptsize av(necrotic,b),3). \scriptsize obs(id1,\scriptsize av% (apoptotic,b),3).}

…

Iid110={\scriptsize obs(id1,\scriptsize av(alive,b),10). \scriptsize obs(id1,\scriptsize av% (necrotic,e),10). \scriptsize obs(id1,% \scriptsize av(apoptotic,h),10).}

Table 1: Multivariate, discrete sequences and their logical representation.

Example 2 (Labeled MVS).

Table 1 (a) presents a minimal, toy training set with a single positive and a single negative example. Each example is an MVS consisting of three length-ten sequences regarding the evolution of cell populations for the different types of cell in our running example (Example 1). These sequences are short excerpts of longer simulation sequences, which have been discretized using SAX. Each symbol in a sequence corresponds to a bin of real values.Table 1 (b) presents the logical representation of the positive example (MVS $i d_{1}$ ) as a sequence of interpretations $I_{1}^{i d}, \dots, I_{10}^{i d}$ .

5 Answer Set Automata

We next define a family of symbolic automata that may be used to express CE patterns over MVSs. The transition guard predicates, which we shall call transition features and correspond to selection predicates in a CER/F context, will be defined by ASP rules, we therefore call the resulting automata answer set automata (ASA). Intuitively, a transition in an ASA is enabled when the body of the corresponding transition feature rule is satisfied by the input MVS. Note that ASA will be non-deterministic in principle. This is because to enforce determinism in the case of symbolic automata it must be ensured that the conditions that guard all outgoing transitions from some state $q$ to a set of different states must be mutually exclusive. This is infeasible to guarantee in the ASA framework, since the transition features may encode arbitrary, domain-specific conditions that are deemed informative to synthesize automata with. The non-determinism of the ASA framework is in accordance with most event specification languages in the CER domain, where complex event patterns are defined as, or are eventually compiled into non-deterministic automata.

Definition 1 (Answer Set Automaton – ASA).

An answer set automaton is a tuple $A = ⟨ Σ, B, R, Q, q_{0}, F, δ ⟩$ , where: $Σ$ is the alphabet, represented by a set of ASP facts; $B$ is some background knowledge represented by an ASP program; $R$ is a set of ASP rules, called the transition features; $Q$ is a finite set of states; $q_{0} \in Q$ is a designated start state and $F \subseteq Q$ is a designated set of accepting states; $δ$ is a non-deterministic state transition function defined by means of a feature mapping $δ_{R} : Q \times R \to Q$ , where $δ_{R} (q, r) = q^{'}$ has the intuitive meaning that the transition from state $q$ to $q^{'}$ is guarded by the transition feature $r$ . Given a feature mapping $δ_{R}$ , the transition function $δ : Q \times 2^{Σ} \to 2^{Q} \cup {⊥}$ is defined as:

δ (q, I) = {\begin{matrix} A_{q} = {δ_{R} (q, r) \in Q | I \cup B ⊨ b o d y (r)}, if A_{q} \neq \emptyset, □ \in {{q}, {⊥}}, else. \end{matrix}

(1)

The “alphabet” $Σ$ in Definition 1 is defined as the set of ground facts that encode the simple events in the input, i.e., facts of the form $o b s (a l i v e, a)$ , as per our running example. The transition function $δ$ operates on interpretations, i.e. subsets of $Σ$ , as indicated by the powerset $2^{Σ}$ in $δ$ ’s signature. Then, given a state $q$ and an interpretation $I$ , the “if” branch of $δ$ in Eq. (1) maps $q$ to its set of “next states” $A_{q}$ . Each $q^{'} \in A_{q}$ is obtained by the feature mapping $δ_{R}$ , therefore, it is of the form $q^{'} = δ_{R} (q, r)$ and has the property that $I \cup B$ satisfies the body of the corresponding transition feature $r$ .

If the set of next states is empty (i.e., if no transition feature associated with $q$ is satisfied by $I$ ), the “else” branch encodes the different operational semantics that may be defined to govern the automaton’s behavior in this case. Typical options for such semantics is either to self-loop on the current state, in which case $δ (q, I) = {q}$ , or to reject the input, by moving to an absorbing, “dead” state, which we denote by $⊥$ . These semantics correspond to two different event consumption policies in CER/F, which are called skip-till-any-match and strict contiguity respectively [18].

The logical representation of an ASA consists of a set of facts of the form $t r a n s i t i o n (s_{1}, f, s_{2})$ , meaning that the transition from state $s_{1}$ to $s_{2}$ is guarded by condition $f$ . Figure 2 presents an example of this representation, which is explained in more detail in Example 3 below. Additionally, the accepting states in the ASA are specified via atoms of the form $a c c e p t i n g (s)$ .

Reasoning with ASA is performed via an interpreter that defines the desired ASA behavior. Such an interpreter is presented in Figure 2 as part of the background knowledge $B$ (also to be discussed in Example 3). The first rule in the interpreter states that for any example (input MVS) $S e q I d$ an ASA is initially (at time 1) in state $q_{0}$ , i.e. $q_{0}$ is the start state. The second rule states that an ASA moves from state $S_{1}$ to state $S_{2}$ at time $T$ , if there is a transition whose feature is satisfied by the current example ( $S e q I d$ ) at time $T$ . This is the meaning of the $s a t i s f i e s / 3$ atom that appears in that rule. Finally, the third rule in the interpreter defines the accepting condition for the MVS $S e q I d$ , by demanding the ASA to be in an accepting state at the end of the MVS. Rejection of an input MVS is defined implicitly via closed-world assumption. Note that this interpreter implements the strict contiguity operational semantics, i.e. it rejects the input at time $T$ if the set of next states $A_{q}$ at $T$ is empty. The skip-till-any match semantics may also be supported by adding to the interpreter an extra rule that allows to self-loop when $A_{q} = \emptyset$ , e.g. by using count aggregates in ASP:

$i n S t a t e (I d, S, T + 1) \leftarrow i n S t a t e (I d, S, T), # c o u n t {F : s a t i s f i e s (I d, F, T)} = 0.$

Transition Features. In this work the transition features are assumed to be defined beforehand (learning them, along with the corresponding ASA is a direction of future work) and are assumed to encode simple relations over attributes and values. Given that the goal is to check if the transition features are satisfied or not by the coordinates of an input MVS, we define them via the predicate $s a t i s f i e s / 3$ that is used by the interpreter. The bodies of such $s a t i s f i e s / 3$ rules consist of the conditions that should be satisfied in order for a transition to be triggered. These conditions are defined by means of attribute-value observation atoms (i.e., $o b s / 3$ atoms, as in Table 1) and comparison predicates that encode relations between such atoms.

Figure 2: An ASA that discriminates between the positive and negative MVSs of Figure 1.

Example 3 (Answer set automata & transition features).

Figure 2 presents an ASA, its logical representation and its transition features $R$ , along with some background knowledge $B$ necessary to reason with the ASA. $B$ consists of the ASA interpreter and the specification of attribute-value domain constants. We first discuss the transition features $R$ before going into the details of the ASA and its functionality. The first rule in $R$ specifies the conditions under which an MVS $S e q I d$ satisfies, at time $T$ , a predicate with signature $n e g (A, V)$ , stating that at time $T$ in $S e q I d$ (i.e., in $S e q I d$ ’s $T$ -th coordinate), the attribute $A$ does not have the value $V$ . The next rule defines a predicate $l t (A_{1}, A_{2})$ stating that at time $T$ the value of attribute $A_{1}$ is less than (hence, $l t$ ) the value of attribute $A_{2}$ . Finally, the third rule in $R$ defines $a t L e a s t (A, V)$ stating that at time $T$ the value of attribute $A$ is at least $V$ .

Given these definitions, and based on the instances of the transition features that guard the edges of the ASA in Figure 2, the behavior of the ASA is the following: It self-loops on $q_{0}$ for as long as the size of the alive cells population in the input – recall our running example – is not $b$ ; it transitions to the accepting state $q_{1}$ if an observation comes in, where the size of the alive cells population is smaller than that of the necrotic – an indication that a drug is promising; and it expects the necrotic cells population size to exceed a threshold ( $c$ ) from that point on, by self-looping on the accepting state.

The ASA in Figure 2 accepts the positive ( $i d_{1}$ ) and rejects the negative example ( $i d_{2}$ ) in Table 1. To see the former note the ASA self-loops on $q_{0}$ until time $T = 7$ , when the value of $a l i v e = b$ , causing $B \cup I_{7}^{i d_{1}}$ to no longer satisfy the self-loop condition on $q_{0}$ . However, at the same time point $n e c r o t i c = c$ , which exceeds the value of alive (recall that the symbols in the running example represent bins of values and they are ordered by the lexicographic ordering, i.e. letters later in the alphabet represent bins of larger values). Therefore, $B \cup I_{7}^{i d_{1}} ⊨ l t (a l i v e, n e c r o t i c)$ , causing the ASA to transition to the accepting state $q_{1}$ . The values of necrotic in the remaining time steps are at least $c$ , thus the ASA self-loops on the accepting state for the rest of the input.

To see that the ASA rejects $i d_{2}$ note that at time $T = 5$ in $i d_{2}$ $a l i v e = b$ , causing the condition that allows to self-loop on $q_{0}$ up to that point to fail. Moreover, the condition that allows to transition from $q_{0}$ to $q_{1}$ is also not satisfied, since $n e c r o t i c = b$ . Therefore, since the interpreter implements the strict contiguity semantics, the ASA is in no state at time $T = 6$ (i.e. it implicitly moves to the dead state) and the input is rejected.

6 Learning Answer Set Automata

We now turn to our approach to Answer Set Automata Learning (ASAL) in ASP. Our learning objective may be formulated as follows: Given a training set consisting of positive and negative MVSs $S^{+}$ and $S^{-}$ , a set of transition features $R$ , a “state budget” $N$ and potentially, a set of structural and regularization constraints $S C$ and $R C$ respectively, learn an ASA $A$ that uses the transition features in $R$ , respects $S C$ , optimizes $R C$ , does not exceed $N$ and minimizes the training error, defined as: $\sum_{s \in S^{+}} r e j e c t s (A, s) + \sum_{s \in S^{-}} a c c e p t s (A, s)$ , i.e. the number of misclassified training examples.

Structural constraints may include e.g. requirements for accepting states being absorbing ones, starting states not being accepting states etc. Regularization constraints are typically related to learning simpler models, where simplicity in our case is measured by the total number of states and transitions in an automaton. Several other simplicity criteria may be considered. For instance, an earliness bias would favor automata that accept/reject their input as early as possible, the intuition being that initial segments of the input are often enough to learn a good model, while trying to fit the entire input could yield more complicated automata with inferior generalization capacity.

We cast the automata learning problem as an abductive reasoning task in ASP. In its simplest form, such a task may be defined as a tuple $⟨ Π, I C, G, A ⟩$ , where $Π$ is a logic program that represents some background knowledge, $I C$ is a set of constraints that must be respected, $G$ is a set of ground constraints, often called “goals” and $A$ is a set of predicate signatures called abducibles. A solution to an abductive reasoning task is a set of ground logical facts $Δ$ , such that $B \cup I C \cup Δ ⊨ G$ .

In our case, the background knowledge program $Π$ contains the ASA interpreter, the definitions of the transition features $R$ and the training MVSs $S^{+} \cup S^{-}$ in logical form; $I C$ are the structural and regularization constraints, $G$ is a set of ground constraints related to the acceptance/rejection of each MVS in $S^{+}, S^{-}$ respectively and $A = {t r a n s i t i o n / 3, a c c e p t i n g / 1}$ .

\emph{Generate ASA:}{transition(S1,F,S2)}←state(S1),state(S2),feature(F).feature(ϕ)←type1(var1ϕ),…,typen(varnϕ).{states(S)}←maxStates(S).{accepting(S)}←state(S).maxStates(1..qN).   start(1).\emph{Minimize the training error:}:\texttildelow  accepted(SeqId), negative(SeqId). [wfp@2,SeqId]:\texttildelow  not accepted(SeqId), positive(SeqId). [wfn@2,SeqId]\emph{Example of regularization constraints:}:\texttildelow  transition(S1,X,S2). [1@1,S1,S2,X]:\texttildelow  accepted(SeqId,T). [T@1,SeqId,T]accepted(SeqId,T)←inState(SeqId,S,T),accepting(S).accepted(SeqId)←accepted(SeqId,_).\emph{Example of structural constraints:}←transition(S,,S2),accepting,S2≠S.

Table 2: Abductive ASA learning in ASP.

The abductive task is straightforward to specify and solve in ASP, via its generate-and-test methodology, according to which we generate automata via choice rules and test their performance via (weak) constraints. Table 2 presents an example formulation. The choice rules in the first block of rule generate ASA as collections of $t r a n s i t i o n / 3$ and $a c c e p t i n g / 3$ facts (as in Figure 2). The $f e a t u r e / 1$ rule is just a “type rule” added for each transition feature $r \in R$ . For instance, for the transition feature $l t (A, V)$ from Figure 1, the corresponding type rule is $f e a t u r e (l t (A, V)) \leftarrow a t t (A), v a l (V) .$

The next block of rules is a set of weak constraints that minimize the training error. Note that the $a c c e p t e d / 1$ predicate is defined in the ASA interpreter in Figure 2 and the $p o s i t i v e / 1$ and $n e g a t i v e / 1$ predicates are facts that carry the label of each training MVS. The constraints may be weighted differently, accounting e.g. for imbalances between positive/negative examples in the training set. The weights are $w_{f p}$ for the first rule, which is the price paid for each false positive, and $w_{f n}$ for the second rule, the price paid for each false negative. These constraints have a higher priority, relative to the ones that follow in Table 2, making training error minimization the primary objective.

The next block of rules presents an example of regularization biases in the form of weak constraints. The first constraint attempts to compress the generated automata as much as possible, by penalizing $t r a n s i t i o n / 3$ facts. The second rule attempts to maximize earliness by minimizing the length of the prefixes that an ASA needs to process before accepting³³3Maximizing earliness for input rejection, in addition to acceptance, would also be possible by modifying the ASA interpreter to not handle rejection via the closed-world assumption, but via explicit, absorbing dead states, and adding appropriate regularization constraints.. The $a c c e p t e d / 2$ predicate that is used in the earliness constraint simply monitors the number of steps needed to reach an accepting state, while the next rule defines acceptance in terms of such “partial” acceptance. To use the earliness bias constraint in ASAL these two rules should replace the third rule in the ASA interpreter in Figure 2, while accepting states should be treated as absorbing. This essentially forces ASAL to learn an automaton that accepts from prefixes of the input, whose length is to be minimized by the earliness constraint. The last rule in Table 2 is a structural constraint ensuring that accepting states are absorbing.

Example 4 (Asal in Action).

Let $Π$ be the program consisting of the rules in Table 2, $B$ and $R$ from Figure 2 and the logical representation of the two training examples in Table 1, as the union of the $I_{t}^{i d_{1}}$ ’s and the $I_{t}^{i d_{2}}$ ’s, along with the facts $p o s i t i v e (i d_{1})$ and $n e g a t i v e (i d_{2})$ . Running $C l i n g o$ on $Π$ and filtering the $t r a n s i t i o n / 3$ and the $a c c e p t i n g / 1$ facts from the generated solutions yields the learnt ASA. One of these ASA is the one illustrated in Figure 2, which is suboptimal, based on the constraints in Table 2. It can be seen that optimal ASA consist of two states and two transitions. e.g. the ASA

$t r a n s i t i o n (q_{0}, a t_l e a s t (a l i v e, e), q_{0}) . t r a n s i t i o n (q_{0}, a t_l e a s t (a p o p t o t i c, d), q_{1}) .$

$a c c e p t i n g (q_{1}) .$

In addition to being smaller, it can be seen that this ASA accepts the positive example $i d_{1}$ at step $6$ , in contrast to the ASA in Figure 2, which accepts at step 8. If we drop the earliness constraint in Table 2 we may obtain a single-state ASA:

$t r a n s i t i o n (q_{0}, n e g (a p o p t o t i c, f), q_{0}) .$

$a c c e p t i n g (q_{0}) .$

Although this ASA correctly classifies the examples, it is degenerate and has an unconventional behavior, always starting from an accepting state and failing later on to reject the negative examples (e.g. it moves to the implicit dead state at step 5 to reject $i d_{2}$ ).

6.1 Incremental Learning & Automata Revision

ASAL is guaranteed to find an optimal, constraint-compliant solution to the abductive ASA learning task, given enough time and memory. The main drawback, however, is that the requirements for such resources grow exponentially with the complexity of the learning task (e.g. alphabet size, number of transition features etc.) and the size of the input (e.g. number and length of the training examples), making the approach infeasible for larger datasets. To alleviate this issue we give-up the requirement for optimal ASA and opt for an incremental learning strategy that simply learns ASA with a good fit in the training data. Additional strategies for improving ASAL’s scalability, such as incorporating symmetry-breaking constraints, in order to ignore isomorphic ASA, are future work directions.

ASAL’s incremental learning version operates on mini-batches of the training data, sufficiently small to allow for fast ASA induction from each batch and, ideally, sufficiently large and diverse to allow for learning relatively good ASA from samples of the training set. This is paired with an ASA revision technique, which tries to apply minimal structural modifications on existing automata, to improve their performance on new mini-batches. The algorithm – we omit the pseudocode due to space limitations – is a greedy, iterative hill-climbing search that works as follows: At each point in time, an initially empty, best-so-far ASA $A$ is maintained. At each mini-batch $D$ , if $A$ ’s local classification performance on $D$ is lower than a given error threshold, $A$ is revised by running $C l i n g o$ on a program $Π$ similar to the one described in Example 4, augmented as follows: For each $t r a n s i t i o n (q_{i}, ϕ_{i}, q_{j})$ and each $a c c e p t i n g (q_{i})$ fact in the logical specification of $A$ we add to $Π$ the following:

$e x i s t i n g (t r a n s i t i o n (q_{i}, ϕ_{i}, q_{j})) .$

$e x i s t i n g (a c c e p t i n g (q_{i})) .$

$:\texttildelow not transition(qi,ϕi,qj),existing(transition(qi,ϕi,qj)). [−wi@1]$

$:\texttildelow not accepting(qi),existing(accepting(qi)). [1@1]$

The first two facts simply state which facts already exist in the structural description of $A$ , while the weak constraints penalize their removal, thus fostering minimal revisions. The weight $w_{i}$ in the first weak constraint is defined as $w_{i} = n - p$ , where $n, p$ are respectively the number of negative/positive examples throughout the entire training set, which are accepted by $A$ and the acceptance paths use the corresponding transition feature $ϕ_{i}$ . Note the “-” sign in front of $w_{i}$ , which makes $w_{i}$ positive (i.e., a penalty in the optimization process) if $n > p$ and negative (i.e. a reward) in the opposite case.

An ASP solver runs on the augmented program $Π$ for a given time $t$ , and the $k$ best solutions found in that time are preserved, where the time-out $t$ and $k$ are run-time parameters. Subsequently, these $k$ locally best solutions are evaluated on the entire training set (updating the aforementioned $p, n$ counts for each $t r a n s i t i o n / 3$ fact in these ASA). If an ASA $A^{'}$ is found after this process has a better global performance than $A$ , it replaces $A$ as the new best ASA. The process is repeated for a given number of iterations, by shuffling and re-partitioning the training set into new mini-batches at the beginning of each iteration. The current best ASA that results from this process is returned in the output.

Method

F_{1}

-score #States #Transitions Time (min) Bio Small-U RPNI 0.702 13 292 0.05 EDSM 0.722 12 278 0.05 DISC 0.833 10 13 51

A S A L_{c l a s s i c}^{b a t c h}

0.887 3 35 1 (time-out)

A S A L_{c l a s s i c}^{i n c r}

0.882 3 41 0.566 Bio Large-U

A S A L_{c l a s s i c}^{i n c r}

0.858 3 57 13 Bio Small

A S A L_{c l a s s i c}^{b a t c h}

0.902 3 53 5 (time-out)

A S A L_{c l a s s i c}^{i n c r}

0.889 3 60 2.7

A S A L_{s y m b}^{b a t c h}

0.968 3 5 5 (time-out)

A S A L_{s y m b}^{i n c r}

0.924 3 7 3.4 Bio Large

A S A L_{c l a s s i c}^{i n c r}

0.852 3 12 25

A S A L_{s y m b}^{i n c r}

0.942 3 8 39 Maritime

A S A L_{c l a s s i c}^{i n c r}

0.892 3 15 6

A S A L_{s y m b}^{i n c r}

0.952 3 8 18

Table 3: Experimental results

7 Experimental evaluation

We empirically assess ASAL on two CER datasets from the domains of precision medicine and maritime monitoring. The former has already been outlined in Section 3. It consists of 644 time series, each containing three signals related to the alive, necrotic and apoptotic attributes, of length 49 each. The positive class (interesting simulations) amounts to the $20 %$ of the data. In addition to this dataset, to which we refer as Bio-small, in order to test the scalability of the incremental version of ASAL, we also used a significantly larger dataset with the same characteristics, but with 50K training simulations.

The maritime dataset contains data from nine vessels that cruised around the port of Brest, France. There are five attribute signals for longitude, latitude, speed, heading, and course over ground. The positive class is related to whether a vessel eventually enters the port of Brest and there are 2,980 negative and 2,269 positive examples, a total of 5,249 multivariate examples, each of length 30. The maritime dataset has been previously used in CER research, we therefore refer to [2] for more details. Both datasets were discretized using SAX with ten bins.

We compared the following algorithms: (a) $A S A L_{c l a s s i c}^{b a t c h}$ and $A S A L_{c l a s s i c}^{i n c r}$ , the batch and incremental version of ASAL that learns ASA that resemble classical automata, by using no transition feature other that equality, i.e. an attribute having a particular value found in the data. This is similar to using a symbol for each attribute-value. This version of ASAL was evaluated to assess the merits of using relational transition features; (b) $A S A L_{s y m b}^{b a t c h}$ and $A S A L_{s y m b}^{i n c r}$ , the batch and incremental versions respectively of ASAL, as described in the previous sections. These were used with five predefined transition features, similar to those presented in Figure 2; (c) RPNI [28] and an improved version thereof, EDSM [21], two widely used algorithms of the state-merging (SM) family, that compress a PTA (see Section 3) using greedy heuristics. Note that although these algorithms are quite old, the main ideas behind them are the SoA in SM-style learning and their $\mathsf{LearrnLib}\lx@note{footnote}{\url{https://learnlib.de/}}$⁴⁴4https://learnlib.de/ implementation used in the experiments is extremely efficient and frequently used by practitioners; (d) DISC [31], a recent algorithm that translates the PTA compression problem into a Mixed Integer Linear Programming problem, which it delegates to the off-the-shelf Gurobi solver. DISC is similar to ASAL in the sense that it is able to learn optimal automata, given enough time and memory.

The experimental setting was a five-fold cross validation with $80 / 20$ training/testing set splits. The experiments were carried-out on a 3.6GHz processor (4 cores, 8 threads) and 16GB of RAM. Whenever $A S A L_{c l a s s i c}^{b a t c h}$ was used it was given a timeout (maximum allowed time), in order to obtain a solution in a feasible amount of time.

The results are presented in Table 3 in the form of average testing set $F_{1}$ -scores, number of states and transitions, as well as training times for the algorithms compared. Note that RPNI, EDSM and DISC deal with single strings and cannot handle multivariate input. To compare to these algorithms we used a univariate version of the bio dataset, which contains the alive attribute only, as it is informative enough to learn a useful model. No such attribute has this property in the maritime dataset, we therefore did not experiment with RPNI, EDSM and DISC on it. Moreover, only the “classic” version of ASAL was used on the univarate bio dataset, since there are no cross-feature relations that could be captured by transition features in the symbolic version.

The first block in Table 3 concerns the small version of the univariate dataset. It can be seen that RPNI and EDSM are lightning-fast, learning a model in approx. three secs. On the other hand, they have significantly inferior predictive performance as compared to all other algorithms and they are significantly more complicated, as indicated by their size. DISC has a better $F_{1}$ -score and learns slightly simpler models, as indicated by its states number (the reduced number of transitions for DISC is misleading, since DISC omits self loops). On the other hand, it takes a little less than an hour on average to train. $A S A L_{c l a s s i c}^{b a t c h}$ has the best predictive performance, achieved within 1 min. Its incremental version closely follows in predictive performance in almost half the time.

In the large version of the univiariate bio dataset (second block in Table 3) $A S A L_{c l a s s i c}^{i n c r}$ was the only usable algorithm, since RPNI, EDSM and DISC terminated with memory errors, due to the size of the dataset. In contrast, thanks to its incremental learning strategy that never loads the entirety of the training data into memory, ASAL was able to learn a good model in approx. 13 minutes.

The results from the small version of the full bio dataset (containing all the features) in the next block in Table 3), highlight the advantages of learning symbolic automata. Indeed, both the batch and the incremental version of ASAL achieve significantly better results than the classical version of ASAL and learn automata with much fewer transitions. Note that the $F_{1}$ -scores of the two versions of the batch ASAL version were achieved with the same time-out value and the incremental versions of ASAL have comparative training times. Therefore, the symbolic version learns better models without significantly compromising efficiency. The results from the large bio dataset (full, all features used) and the maritime dataset also seem to confirm this claim.

8 Conclusions and Future Work

We presented a methodology for learning symbolic automata where the transition guards are defined via ASP rules, and evaluated our approach on two CER datasets, demonstrating its efficacy. There are several directions for future work, including a more thorough experimental assessement on more datasets and settings, a formal characterization of the expressive power of ASA in relation to common event algebras used in CER, scalability improvements, e.g. via symmetry breaking, and jointly learning the transition features definitions, along with the automaton.

Acknowledgements

This work is supported by the project entitled “ARIADNE - AI Aided D-band Network for 5G Long Term Evolution”, which has received funding from the European Union’s Horizon 2020 research & innovation programme under grant agreement No 871464, and by the project entitled “INFORE: Interactive Extreme-Scale Analytics and Forecasting”, which has received funding from the European Union’s Horizon 2020 research & innovation programme under grant agreement No 825070.

References

[1] J. Agrawal, Y. Diao, D. Gyllstrom, and N. Immerman (2008) Efficient pattern matching over event streams. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pp. 147–160. Cited by: §1.
[2] E. Alevizos, A. Artikis, and G. Paliouras (2022) Complex event forecasting with prediction suffix trees. The VLDB Journal 31 (1), pp. 157–180. Cited by: §1, §1, §7.
[3] E. Alevizos, A. Skarlatidis, A. Artikis, and G. Paliouras (2017) Probabilistic complex event recognition: a survey. ACM Computing Surveys (CSUR) 50 (5), pp. 1–31. Cited by: §1.
[4] D. Angluin, S. Eisenstat, and D. Fisman (2015) Learning regular languages via alternating automata. In Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §2.
[5] D. Angluin (1987) Learning regular sets from queries and counterexamples. Information and computation 75 (2), pp. 87–106. Cited by: §2.
[6] G. Argyros and L. D’Antoni (2018) The learnability of symbolic automata. In International Conference on Computer Aided Verification, pp. 427–445. Cited by: §2.
[7] G. Cugola and A. Margara (2010) TESLA: a formally defined event specification language. In Proceedings of the Fourth ACM International Conference on Distributed Event-Based Systems, pp. 50–61. Cited by: §1.
[8] G. Cugola and A. Margara (2012) Processing flows of information: from data stream to complex event processing. ACM Computing Surveys (CSUR) 44 (3), pp. 15. Cited by: §1.
[9] L. D’Antoni and M. Veanes (2017) The power of symbolic automata and transducers. In International Conference on Computer Aided Verification, pp. 47–67. Cited by: §1.
[10] C. De la Higuera (2010) Grammatical inference: learning automata and grammars. Cambridge University Press. Cited by: §2.
[11] A. Demers, J. Gehrke, M. Hong, M. Riedewald, and W. White (2006) Towards expressive publish/subscribe systems. In International Conference on Extending Database Technology, pp. 627–644. Cited by: §1.
[12] A. J. Demers, J. Gehrke, B. Panda, M. Riedewald, V. Sharma, W. M. White, et al. (2007) Cayuga: a general purpose event monitoring system.. In Cidr, Vol. 7, pp. 412–422. Cited by: §1.
[13] Y. Diao, N. Immerman, and D. Gyllstrom (2007) Sase+: an agile language for kleene closure over event streams. UMass Technical Report. Cited by: §1.
[14] S. Drews and L. D’Antoni (2017) Learning symbolic automata. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems, pp. 173–189. Cited by: §2.
[15] D. Fisman, H. Frenkel, and S. Zilles (2021) Inferring symbolic automata. arXiv preprint arXiv:2112.14252. Cited by: §2.
[16] D. Furelos-Blanco, M. Law, A. Jonsson, K. Broda, and A. Russo (2021) Induction and exploitation of subgoal automata for reinforcement learning. Journal of Artificial Intelligence Research 70, pp. 1031–1116. Cited by: §2, §2.
[17] G. Giantamidis, S. Tripakis, and S. Basagiannis (2021) Learning moore machines from input–output traces. International Journal on Software Tools for Technology Transfer 23 (1), pp. 1–29. Cited by: §2.
[18] N. Giatrakos, E. Alevizos, A. Artikis, A. Deligiannakis, and M. N. Garofalakis (2020) Complex event recognition in the big data era: a survey. VLDB J. 29 (1), pp. 313–352. Cited by: §1, §1, §5.
[19] E. M. Gold (1967) Language identification in the limit. Information and control 10 (5), pp. 447–474. Cited by: §2.
[20] A. Grez, C. Riveros, and M. Ugarte (2019) A formal framework for complex event processing. In 22nd International Conference on Database Theory (ICDT 2019), Cited by: §1.
[21] K. J. Lang, B. A. Pearlmutter, and R. A. Price (1998) Results of the abbadingo one dfa learning competition and a new evidence-driven state merging algorithm. In International Colloquium on Grammatical Inference, pp. 1–12. Cited by: §2, §7.
[22] G. Letort, A. Montagud, G. Stoll, R. Heiland, E. Barillot, P. Macklin, A. Zinovyev, and L. Calzone (2019) PhysiBoSS: a multi-scale agent-based modelling framework integrating physical dimension and cell signalling. Bioinformatics 35 (7), pp. 1188–1196. Cited by: Example 1.
[23] V. Lifschitz (2019) Answer set programming. Springer. Cited by: §3.
[24] J. Lin, E. Keogh, S. Lonardi, and B. Chiu (2003) A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, pp. 2–11. Cited by: §4.
[25] J. Lin, E. Keogh, L. Wei, and S. Lonardi (2007) Experiencing sax: a novel symbolic representation of time series. Data Mining and knowledge discovery 15 (2), pp. 107–144. Cited by: §4.
[26] O. Maler and I. Mens (2017) A generic algorithm for learning symbolic automata from membership queries. In Models, Algorithms, Logics and Tools, pp. 146–169. Cited by: §2.
[27] S. H. Muggleton, D. Lin, N. Pahlavi, and A. Tamaddoni-Nezhad (2014) Meta-interpretive learning: application to grammatical inference. Machine learning 94 (1), pp. 25–49. Cited by: §2.
[28] J. Oncina and P. Garcia (1992) Identifying regular languages in polynomial time. In Advances in structural and syntactic pattern recognition, pp. 99–108. Cited by: §2, §7.
[29] P. R. Pietzuch, B. Shand, and J. Bacon (2003) A framework for event composition in distributed systems. In ACM/IFIP/USENIX International Conference on Distributed Systems Platforms and Open Distributed Processing, pp. 62–82. Cited by: §1.
[30] N. P. Schultz-Møller, M. Migliavacca, and P. Pietzuch (2009) Distributed complex event processing with query rewriting. In Proceedings of the Third ACM International Conference on Distributed Event-Based Systems, pp. 1–12. Cited by: §1.
[31] M. Shvo, A. C. Li, R. T. Icarte, and S. A. McIlraith (2021) Interpretable sequence classification via discrete optimization. In Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI), pp. 9647–9656. Cited by: §2, §7.
[32] R. Smetsers, P. Fiterău-Broştean, and F. Vaandrager (2018) Model learning as a satisfiability modulo theories problem. In International Conference on Language and Automata Theory and Applications, pp. 182–194. Cited by: §2.
[33] V. Ulyantsev, I. Zakirzyanov, and A. Shalyto (2015) BFS-based symmetry breaking predicates for dfa identification. In International Conference on Language and Automata Theory and Applications, pp. 611–622. Cited by: §2.
[34] W. Wieczorek (2017) Grammatical inference. Springer. Cited by: §2.
[35] E. Wu, Y. Diao, and S. Rizvi (2006) High-performance complex event processing over streams. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 407–418. Cited by: §1.
[36] H. Zhang, Y. Diao, and N. Immerman (2014) On complexity and optimization of expensive queries in complex event processing. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pp. 217–228. Cited by: §1.