Itemset Utility Maximization with Correlation Measure

Jiahui Chen    Yixin Xu    Shicheng Wan    Wensheng Gan*    and Jerry Chun-Wei Lin     \IEEEmembershipSenior Member, IEEE This research was supported in part by the National Natural Science Foundation of China (Grant Nos. 61902079 and 62002136), Guangzhou Basic and Applied Basic Research Foundation (Grant Nos. 202102020928 and 202102020277), National Key-Research and Development Program of China (Grant No. 2020YFB2104003), and the Young Scholar Program of Pazhou Lab (Grant No. PZL2021KF0023).Jiahui Chen, Yixin Xu, and Shicheng Wan are with the School of Computer Science and Technology, Guangdong University of Technology, Guangzhou 510006, China. (E-mail: csjhchen@gmail.com, yxxu1998@gmail.com, and scwan1998@gmail.com)Wensheng Gan is with the College of Cyber Security, Jinan University, Guangzhou 510632, China; and also with Pazhou Lab, Guangzhou 510330, China. (E-mail: wsgan001@gmain.com)Jerry Chun-Wei Lin is with the Department of Computer Science, Electrical Engineering and Mathematical Sciences, Western Norway University of Applied Sciences, Bergen, Norway. (E-mail: jerrylin@ieee.org)Corresponding author: Wensheng Gan
Abstract

As an important data mining technology, high utility itemset mining (HUIM) is used to find out interesting but hidden information (e.g., profit and risk). HUIM has been widely applied in many application scenarios, such as market analysis, medical detection, and web click stream analysis. However, most previous HUIM approaches often ignore the relationship between items in an itemset. Therefore, many irrelevant combinations (e.g., {gold, apple} and {notebook, book}) are discovered in HUIM. To address this limitation, many algorithms have been proposed to mine correlated high utility itemsets (CoHUIs). In this paper, we propose a novel algorithm called the Itemset Utility Maximization with Correlation Measure (CoIUM), which considers both a strong correlation and the profitable values of the items. Besides, the novel algorithm adopts a database projection mechanism to reduce the cost of database scanning. Moreover, two upper bounds and four pruning strategies are utilized to effectively prune the search space. And a concise array-based structure named utility-bin is used to calculate and store the adopted upper bounds in linear time and space. Finally, extensive experimental results on dense and sparse datasets demonstrate that CoIUM significantly outperforms the state-of-the-art algorithms in terms of runtime and memory consumption.

{IEEEImpStatement}

This article contributes to a utility-based correlated pattern discovery method for transaction databases. It studies the relationship between consisted items of a profitable itemset. Compared with the state-of-the-art algorithms, the designed CoIUM method has greatly improved the performance in terms of execution time and memory consumption. The proposed method is good at dealing with massive database, especially dense databases, which is very beneficial to dealing with commodity sales in real life. Based on large-scale testing, it is found that this method has good scalability in terms of runtime and memory consumption, which indicates that CoIUM is suitable for different types of databases (include sparse and dense). Since the traditional high utility itemset mining has been successfully applied in many applications and various domains, this correlated utility mining problem formulation and efficient algorithm can be applied in many applications, such as product recommendation, market basket analysis, e-learning, text mining and web click stream analysis.

{IEEEkeywords}

Data mining, correlated itemset, high utility itemset, upper bounds.

\IEEEpeerreviewmaketitle

1 Introduction

Data mining techniques (e.g., pattern mining [10, 15]) play a vital role in many real-life applications, such as market analysis [1] and website click stream analysis [8]. In previous studies on data mining, a fundamental task called frequent itemset mining (FIM) [2, 18, 25] is used to discover itemsets that frequently occur in a transaction database. FP-Growth [18] is one of the most famous FIM algorithms. It maintains the original database in an FP-tree structure [18]. However, the limitation of traditional FIM algorithms is that different items have identical importance (e.g., unit profit or weight). For instance, the profit of a piece of bread is equal to that of a diamond in the FIM domain. This assumption is unrealistic in real business life.

To address the above limitations, a new framework known as high utility itemset mining (HUIM) [4, 16, 23] was proposed. Compared to FIM, HUIM considers both the quantity and profit of an item. An itemset is referred to as a high utility itemset (HUI) if its utility value is higher than or equal to the user-specified minUtil. HUIM has attracted considerable attention in recent decades because discovering profitable patterns is more valuable than discovering frequent patterns for enterprises. In FIM, the support of an item has an anti-monotonic property, which means that the support of an itemset is not less than that of its supersets. This property can efficiently reduce the search space because supersets of infrequent itemsets are not considered. However, the utility of an item in HUIM does not hold this property. For example, in a supermarket, assume that the amount of the daily sales of {bread} is $100, but the amount of daily sales of its supersets (i.e., {bread,milk}) may be less than, equal to, or higher than $100. This indicates that the utility is neither monotonic nor anti-monotonic. To address this issue, transaction-weighted utilization (abbreviated as TWU), which has the anti-monotonic property, was proposed by Liu et al. [24]. Based on the TWU measure, HUIM can prune the search space effectively. After that, a lot of HUIM algorithms [23, 13, 37] adopted TWU to improve their performance. In general, HUIM approaches can be summarized into two types: two-phase and single-phase algorithms. The two-phase algorithms are characterized by discovering HUIs in two steps. In the first step, numerous candidates are generated by overestimating the utility of itemsets. Then, they calculate the real utility of candidates to filter out HUIs. These two-phase algorithms can discover a complete set of HUIs, there are two shortcomings that cannot be ignored: 1) generating many candidates in the first phase; and 2) repeatedly scanning the database. As a result, two-phase algorithms suffer from long execution times and memory consumption. To solve these issues, the first single-phase HUIM algorithm called HUI-Miner [23] was proposed. Compared to the two-phase algorithms, the single-phase algorithms do not need to repeatedly scan the database. The existing single-phase algorithms include: FHM [11], EFIM [43], MEFIM [28], etc.

However, a key disadvantage of HUIM algorithms is that they usually ignore the relationships between items. This case will usually reveal numerous irrelevant items in high-utility itemsets. For example, in market analysis, assume that {notebook, book} and {phone, TV} are high utility patterns. Obviously, these products (i.e., notebook, book, phone) are weakly correlated. Instead, in real life, retailers often hope to earn more revenue by making related products into product combinations, such as {milk, bread}, {gold, diamond}, and {computer, mouse}. In essence, these products have a strong connection because they often meet the needs of most customers. To discover more correlated high utility itemsets (CoHUIs), some interesting measures such as Kulc [41], bond [30] and coherence [6] were proposed as standards. Based on the definition of Kulc, many algorithms like CoHUIM [14], CoUPM [13] and ECoHUPM [34] were adopted to discover CoHUIs. These algorithms improved the mining performance by using novel data structures and pruning strategies. Recently, Vo et al. [38] presented the CoHUI-Miner algorithm. They used projection techniques to reduce the expense of database scanning and utilized several pruning strategies to mine CoHUIs effectively. However, CoHUI-Miner only used loose upper bound (TWU and remaining utility) to prune the search space, which led to numerous unpromising candidates being generated. Therefore, it suffered in terms of runtime and memory consumption.

To this end, we propose an efficient projection-based algorithm, namely Itemset Utility Maximization with Correlation Measure (CoIUM). Compared to other state-of-the-art algorithms, CoIUM can discover a complete set of CoHUIs more effectively with less execution time and memory consumption. The main contributions of our work are as follows:

  • We put forward a single-phase algorithm which utilizes two tighter upper-bounds to discover all CoHUIs without candidate generation. Also, the utility-bin structure [43] is adopted to calculate and store the upper bounds of itemsets.

  • The CoIUM algorithm uses a projection technique that can effectively prune the search space.

  • The CoIUM algorithm adopts a measure called as the correlation factor. It is used as a metric to distinguish whether items are highly correlated.

  • Massive experiments are conducted on several datasets to evaluate the performance of CoIUM. We also compare the performance of CoIUM with the state-of-the-art algorithms. The results show that the CoIUM is the most efficient in terms of runtime and memory consumption.

The rest of this paper is organized as follows. Some related works about HUIM and CoHUIM are introduced in Section 2, respectively. In Section 3, we introduce the preliminaries and problem statement of correlated high-utility itemset mining. In addition, a novel CoIUM algorithm is proposed in Section 4. Several experimental results are presented in Section 5. Finally, the conclusion and future work are presented in Section 6.

2 Related Work

2.1 High Utility Itemset Mining

A key difference between high utility itemset mining (HUIM) [16, 23, 31, 29] and frequent itemset mining (FIM) [18, 25] is that frequency has the anti-monotonic property, but utility does not. Hence, if itemset is unpromising, the HUIM cannot easily filter out all its supersets in advance. To address this limitation, transaction-weight utilization (TWU), which is anti-monotonic, was first proposed in Two-Phase [24]. Based on the TWU measure, if TWU of a subset is lower than a given minUtil threshold, all its supersets are not HUIs [24]. Therefore, almost HUIM algorithms reduce the search space by using the TWU measure. However, TWU is a loose upper bound for eliminating candidates. Thus, this measure always overestimates the utilities of itemsets, which may result in massive candidates. At the same time, IHUP [4] constructed a tree structure to discover HUIs. Compared with Two-Phase, this new structure can save more runtime and memory. Thus, the experimental results showed that IHUP was about three times faster than Two-Phase in runtime. However, the upper bound of IHUP was not tight, which may have retained numerous unpromising candidates. To reduce the generation of unpromising candidates and identify HUIs effectively, Tseng et al. [37] proposed UP-Growth. It also constructed a pattern tree to extract HUIs and utilized several novel strategies to discover HUIs effectively. UP-Growth can reduce the numerous unpromising candidates and enhance mining performance, so it outperformed IHUP in terms of runtime and memory.

The above-mentioned algorithms are two-phase algorithms, and they face the problems of generating numerous unpromising candidates and scanning the database at least twice. To address these shortcomings, a single-phase approach was proposed to discover HUIs. As a single-phase approach, HUI-Miner [23] algorithm utilized a new data structure called utility-list, which can retain key information about HUIs without scanning the database multiple times. HUI-Miner used a depth-first search to discover all HUIs without omission, and constructed the utility-list of each itemset by joining the utility-lists of its subsets. HUI-Miner can reduce numerous candidates because it uses the remaining utility, which is more compact than TWU. As for performance, HUI-Miner is superior to all two-phase algorithms. Meanwhile, another algorithm called HUP was presented by Liu et al. [22]. HUI-Miner and HUP were proposed at the same time, but their key pruning strategies are conceptually equivalent. However, using join operations to create utility-lists is time- and memory-consuming. To reduce the expense of join operations, the FHM algorithm [11] used the EUCP strategy to effectively reduce the generation of join operations. The experiments showed that the performance of FHM is up to 6 times faster than that of HUI-Miner. Then, an algorithm called IMHUP was proposed [32], and utilized an indexed list-based structure to discover HUIs without requiring numerous comparison operations. Based on this structure, two techniques were developed to improve mining performance. In addition, ULB-Miner [9] utilized a utility-list buffer structure to store information and introduced a new index segment to access the stored data. Thus, the performance of ULB-Miner is better than that of HUI-Miner and FHM. EFIM [43] utilized two tighter upper bounds to discover the HUIs effectively. Furthermore, it utilized two new techniques (projection technique and transaction merging) to deal with long transactions. The experiments showed that the performance of EFIM is faster than that of HUI-Miner and FHM on spare and dense databases. The above algorithms are devised under the condition that each item in the database has a different positive utility. Then, the FHN algorithm [20] can discover the HUIs with negative unit profit, and Gan et al. [17] proposed a top- HUIM algorithm to solve the threshold setting issue. To summarize, many utility mining algorithms have been developed to solve practical problems, including on-shelf utility mining [7], and high average-utility itemset mining [36, 39]. Up until now, a lot of HUIM algorithms with various data have been developed, such as high utility association rules [26], HUIM with fuzzy itemsets [40], targeted utility querying [27], HUIM on massive data [19], HUIM on noisy data [5], and HUIM on stream data [35].

2.2 Correlated High Utility Itemset Mining

Most HUIM algorithms aim to discover a complete set of HUIs, but do not consider the relationships between items in an itemset. To find potential and strong affinity patterns, WIP [42] was proposed, which used a weighted confidence measure to generate strong affinity patterns. Besides, there are three other interesting measures for correlation: any-confidence, all-confidence, and bond [3]. These measures are useful in discovering the degree of connection between items. However, these measures cannot be properly applied to databases that contain numerous empty transactions. To solve this problem, a null-transaction invariant [41] measure was proposed. It was a credible measure because it did not consider how many null-transactions appeared in the database. There are five null-(transaction) invariant measures: all-confidence [30], coherence [30], cosine [41], kulczynsky [41], and max-confidence [30]. Compared with previous studies on traditional data mining approaches, the proposed affinity/correlation patterns can mine more useful information efficiently.

To discover more interesting information, Ahmed et al. [3] first proposed a high-utility itemset mining algorithm to discover HUIs with frequency affinity. The HUIPM algorithm utilized a tree structure UFTA to store HUIs and used a pruning strategy to reduce the generation of unpromising candidates. However, the limitation of HUIPM is that it is time-consuming because it has to establish tree nodes and generate lots of candidates. Thus, Lin et al. [21] presented a single-phase algorithm known as FDHUP which uses two data structures called EI-table and FU-tree, to discover correlated itemsets efficiently. The FDHUP utilized three pruning strategies, which were based on RAU (remaining affinitive utility) to prune the candidates. As a result, FDHUP performed better than HUIPM in terms of runtime and memory. FCHM [12] also discovered the correlated high utility itemsets. It has two versions, FCHM depended on bond measure and FCHM relied on all-confidence measure. The algorithm used two structures, utility-list and EUCS, to calculate the TWU values. Then, Gan et al. [14] proposed CoHUIM. This algorithm used a projection mechanism to improve the performance of the scanning database and utilized the property of Kulc to prune the unpromising candidates. They also developed the CoUPM algorithm [13], which utilized utility-list structure to store promising itemsets and used Kulc to calculate the support of itemsets. The algorithm also utilized some properties of utility and correlation to effectively explore the search space. Then, Rashad et al. [34] proposed the ECoHUPM algorithm to evaluate the correlation of HUIs. This algorithm used a new data structure named CoUTlist to store utility and support for each itemset. It also introduced three new pruning properties to prune unpromising candidates. Recently, Vo et al. [38] introduced CoHUI-Miner to extract correlated items that provide numerous benefits. This algorithm applied projection technique to reduce the expense of scanning the database and applied the prefix utility of projection technique to compute the utility value. The experiments showed that it was efficient, especially on dense databases.

3 Preliminaries and Problem Formulation

In this section, we present the basic definitions and problem formulation. More definition details can be referred to previous work [38, 13, 43]. Let = {, , , } be a set of distinct items in a transaction database = {, , , }, where is the number of transactions in . An -itemset is a subset of which contains different items. Each transaction consists of some distinct items and has a unique identifier of called Tid. Each item owns an external utility (e.g., unit profit) that is represented by . And each item in transaction has an internal utility (e.g., purchase quantity) which denoted as . In addition, Table 1 is a simple transaction database (the third column is transaction utility) which consists of six distinct items (i.e., , , , , , and ), and Table 2 shows the external utilities of all items.

Tid Transaction Utility
(,1) (,2) (,2) (,3) $19
(,2) (,1) (,3) (,2) $18
(,4) (,3) (,2) $23
(,3) (,2) (,3) $17
(,3) (,4) (,5) $25
(,2) (,3) (,3) (,5) $21
(,1) (,2) (,2) (,3) (,5) (,2) $31
(,2) (,3) (,1) $13
Table 1: A transaction database
Definition 1 (Utility of an itemset)

In a transaction , the utility of an item is defined as = , where . Similarly, the utility of an itemset is denoted as = . Furthermore, in database , the utility of an itemset is denoted as = .

For example, the utility of item in transaction is $4 2 = $8. The utility of itemset in transaction is = + = $8 + $2 = $10. At the same time, = + + = $8 + $10 + $8 = $26.

Item
Utility($) 4 2 1 3 2 1
Table 2: External utility of each item
Definition 2 (Utility of database)

The utility of a transaction is defined as = , where is the -th item in . Obviously, the total utility of a database is the summation of all transactions it contains, where TU = .

For example, the utility of is = + + + = $4 + $4 + $2 + $9 = $19. Then, we can calculate the utility of other seven transactions with the same manner. And the total utility of database is $167. Based on the transaction-weighted utilization [24], we have TWU() = + + = $19 + $18 + $31 = $68. Table 3 lists TWU of all 1-itemsets.

Item
TWU $108 $114 $149 $109 $85 $87
Support 5 5 7 5 4 4
Table 3: Support and TWU of all 1-itemsets

There are many measures [3] used to evaluate the correlation between different items, but some of them will be affected by empty transactions. In this paper, we introduce the Kulczynsky measure [41] (abbreviated as Kulc), which is not affected by null transactions.

Definition 3 (Kulc)

The frequency of an itemset in database is the amount of occurs. We use sup() to represent the frequency of . For an -itemset = , its Kulc is defined as follows: Kulc() = .

For example, Kulc({}) = () = = 0.6. Obviously, the range of Kulc is [0,1] [14]. In fact, Kulc represents the average of frequency, so if Kulc() becomes high, related items in appear frequently. Compared to other correlation measures, Kulc holds the null (transaction)-invariant property. It is more suitable for applications in different fields because it is independent on database size [14].

Definition 4 (Correlated high-utility itemset)

An itemset can be assumed as a correlated high utility itemset (CoHUI) when it meets two conditions: (1) minUtil TU; (2) Kulc() minCor, where minCor is a user-defined minimum correlation threshold.

For example, in a database, minUtil and minCor are set to 0.1 and 0.7, respectively. The itemset is an HUI because the utility of is not less than minUtil TU ( = $26 minUtil TU = 0.1 $168 = $16.8), but it is not a CoHUI since its Kulc is less than minCor (Kulc({}) = 0.6 0.7).

Problem formulation: In a transaction database, given two user-specified minimum thresholds (minUtil and minCor), the purpose of CoIUM is to find all CoHUIs that their utilities are greater than minUtil TU and their Kulc is higher than minCor.

4 Proposed CoIUM Algorithm

In this section, we present a single-phase algorithm to extract all CoHUIs. Subsection 4.1 discusses several upper bounds and pruning strategies. Subsection 4.2 introduces an effective projection technology to reduce the cost of database scanning. Subsection 4.3 introduces the pseudocode of the CoIUM algorithm. In Subsection 4.4, we use a detailed example to discuss how CoIUM works.

4.1 Several Strategies for Pruning Search Space

The proposed CoIUM employs a depth-first method to explore the search space. Given a set of distinct items , then the search space () can be represented as a set-enumeration tree [33]. The algorithm starts with an empty set. To discover all CoHUIs of itemsets , CoIUM recursively appends its extension to through the order, and this makes becomes larger. We assume the items = {, , , , , } and the total order is TWU ascendant order for the running example. Thus, the order is { }.

Property 1

Given an itemset , TWU of is always higher than or equal to its utility (TWU() ).

Strategy 1

Based on Property 1, for any itemset , if TWU() minUtil TU, then all supersets of are low-utility itemsets. Thus, we can remove the low utility itemsets from the search space. Proof details are provided in study [24].

Definition 5 (Extension of an itemset [43])

Consider an itemset , and () is represented as its extension according to the order. E() = { , }. Given an -itemset , we denote its extensions containing items as -extension of .

For instance, consider the database of our running example and an itemset = {}. The 1-extension of are {}, {}, {}, and the itemset {} is called a 2-extension of {}. To prune more unpromising candidates, it is important to use a more compact upper bound to limit the production of unpromising candidates. We present two upper bounds (TWU and re) that have been used in previous work, and introduce two more compact upper bounds (su and lu) used in this work. Using different strategies based on these upper bounds can prune different numbers of candidates. In addition, we introduce the property of Kulc.

Definition 6 (Remaining utility [23])

Given an itemset in transaction . We define as a total order on items from . The remaining utility of in is denoted as = .

For example, we select an itemset from transaction . Based on TWU ascendant order, items {} and {} appear after itemset . The remaining utility of is = + = $4 + $2 = $6.

Definition 7 (Local utility [43] and subtree utility [43])

Given an item and an itemset , where . The local utility of w.r.t. is defined as = . The subtree utility of and is defined as = + .

For example, we select = {}. The local utility of and extension item is = + + + = $19 + $19 = $38. The subtree utility of and extension item is = ( + + $0) + ( + + $0) + ( + + $0) + ( + + $0) = [$4 + $2] + [$16 + $3] + [$12 + $2] + [$4 + $2] = $45.

Property 2

Given an item and itemset , where . We get relationship , where = .

Strategy 2

Based on Property 2, given an item and itemset , where . If minUtil TU, all extensions of containing are low-utility in a subtree. This means that the item should be removed in the search space of . Proof details are provided in study [43].

For example, if the algorithm discovers itemset {} having local utility value is less than minUtil TU, all extensions of {}, e.g., {} and {}, can be ignored.

Property 3

Given an item and itemset , where . The relationship always holds on. It also applies to which is an extension of , that is .

Strategy 3

Based on Property 3, given an item and itemset , where . If minUtil TU, all extensions of containing are low-utility. In the set-enumeration tree, this means that it is unnecessary to explore and all of its supersets. Proof details are provided in study [43].

Based on Strategy 2, the unpromising itemsets will be filtered out from the subtrees of an item . To reduce the cost of exploring search space, Strategy 3 is proposed to identify whole subtrees of which can be extended.

Definition 8 (Primary and secondary items [43])

Given an item and itemset , where . The items of are given as Primary() = { Secondary() minUtil TU}. The secondary items of are given as Secondary() = { minUtil TU}. Note that Primary() Secondary().

Consider the running example and = {}. Primary() = , Secondary() = . In other words, the prefixes and should be explored. Furthermore, only the items {} should be considered as its extensions. To quickly compute the utility of the upper bounds at any time, we use a utility-bin array structure called UA. This structure is initialized with 0 before calculation and can be reused, which greatly reduces memory consumption.

Definition 9 (Utility-bin array [43])

From database , there is a set of items . The utility value of the item is stored in the array UA[], and the length of UA is .

Using UA to calculate lu(). First, we initialize the UA by filling all elements with 0. Second, the utility-bin array UA[] is updated as UA[] = UA[] + + , where . After database scanning, , UA[] = lu(,).

Using UA to calculate su(). First, we initialize the UA by filling all elements with 0. Second, the utility-bin array UA[] is updated as UA[] = UA[] + , where . After database scanning, , UA[] = su(,).

The above pruning strategies are based on the upper bounds to prune the space. Next, we will introduce the relevant strategies on the correlation.

Property 4

For any itemset , assume is an -extension itemset of , where . If is a related pattern, its extension is also a related pattern. The measure is anti-monotonic: Kulc() Kulc().

Strategy 4

Based on Property 4, assume that the symbol is sorted in TWU ascending order. When utilizing the depth-first search strategy in the search space, if the Kulc value of any itemset is less than the minCor value, any of its extension will be regarded as unpromising and irrelevant itemset. They are not CoHUIs which will be ignored. Proof details are provided in the study [13].

For example, set minCor = 0.7, but the Kulc of an itemset is 0.6. Thus, all its supersets cannot be CoHUIs, and we can remove them from the search space. This anti-monotonic property of Kulc is also known as the sorted downward closure property. In other words, if an itemset is not a strongly related itemset, all its supersets will also not be strongly correlated itemsets.

4.2 Projection Technology

The CoIUM algorithm employs a projection technique to reduce the memory overhead. When an itemset is considered extensible, the algorithm only needs to calculate the utility of . And other items which are not belonged to will be pruned. The database without these items () can be called projected database. Note that all items are sorted based on the TWU ascendant order before creating the projected database. The details of the projection mechanism are described below.

Definition 10 (Projected database [43])

The order is defined as the lexicographical order when reading all transactions from back to frond. Given an itemset and a database . The projection of is denoted as -, where - = {--}. For any itemset , the projected transaction is denoted as -, where - = {}. The prefix utility of the projected transaction is defined as (-) = .

Let us consider two transactions: = {, , , } and = {, , , }, and there are four situations of . Note that all items are sorted based on TWU ascendant order. First, if two transaction have same items, when the TID of is greater than that of . Second, we compare the last item between and . If , when . Third, if the current items are equal after comparing, identify the previous items, when (0 min()), . Finally, .

Database projection is a very important technology for reducing database size. As the discovered itemsets become larger, the size of the projected database becomes smaller. The database projection used in CoIUM performs the following. It needs to sort all items based on the TWU ascendant order. If it creates the projected transaction of , an offset pointer will be set to to determine the position of . Then, all transaction containing will be found, and they are cut into a new projected transaction staring from . Other items which do not belong to the extension of will not be considered in the new projected transaction.

For example, assume = . The projected database w.r.t. is: - = , - = . The prefix utility of the projected database in corresponding transactions is ({}-) = = $8, ({}-) = = $12. Thus, if the algorithm discovers itemset {} is a CoHUI, the utility of {} will be kept in the prefix utility. The prefix utility value of each transaction may be different. When the algorithm expands the itemset {}, it only needs to select items from the projected database {}- for calculation. Compared to the original database, the size of the projected database is smaller, so it improves the speed of mining CoHUIs. When the algorithm expands the itemset {}, it only needs to update the value in prefix utility without recalculating the utility of {}. Thus, using prefix utilities to obtain information can improve the efficiency of algorithm mining CoHUIs.

4.3 Procedures of CoIUM

Input: : transaction database; minUtil: a minimum threshold set by user, minCor: a positive correlation threshold.
Output: a set of CoHUIs.
1 initialize = ; for each item I do
2       scan all transactions, using utility-bin array to compute ; calculate the sup;
3 end for
4construct Secondary() = { I minUtil TU}; update sup and sup = {sup Secondary()}; for each item I do
5       sort Secondary() in of TWU ascending order; eliminate items Secondary from transactions; remove all empty transactions; sort all remaining transactions according to ;
6 end for
construct Primary() = { Secondary minUtil TU}; call Search (, , , Secondary(), minUtil, minCor); return CoHUIs
Algorithm 1 The CoIUM algorithm
Input: : the current itemset, -: projected database w.r.t. , Primary(): the primary items of , Secondary(): the secondary items of , minUtil, minCor.
Output: CoHUIs: a set of CoHUIs.
1 for each {1}-length item Secondary do
2       if u() minUtil TU then
3             CoHUIs ; // the of is 1 when = 1;
4       end if
5      
6 end for
7for each item  do
8       = {}; scan - and compute ; if sup() 0 then
9             compute ;
10       end if
11      if  minUtil TU and minCor then
12             CoHUIs ;
13       end if
14      create - and scan - to compute and , where Secondary(); construct Primary() = { Secondary() su(,) minUtil TU}; construct Secondary() = { Secondary() lu(,) minUtil TU}; call Search(, -, Primary(), Secondary(), minUtil, minCor);
15 end for
Algorithm 2 The Search procedure

In this subsection, we introduce the details of CoIUM. The main procedure of CoIUM is described in Algorithm 1. Here, three parameters are used as input: two minimum thresholds, minUtil and minCor, one transaction database . Firstly, CoIUM defines as an empty set in line 1. In lines 2-4, it scans the database to compute the support of all items and uses an array-based structure to compute the local utility. Note that if = is taken as the precondition, the value of local utility is equal to the TWU value. In line 5, the items whose local utility is greater than or equal to minUtil TU are selected to form the secondary set. In other words, only secondary items will be considered in extensions of . In line 6, the support of items should be updated. In lines 7-10, the algorithm sorts Secondary() based on TWU ascending order and removes unpromising items that do not belong to Secondary(). To reduce the cost of memory, it removes the empty transactions and sorts the remaining transactions based on . In line 11, CoIUM reuses the array-based structure to compute the subtree utility from the secondary items. If the subtree utility of items is no less than the minUtil TU, they are obtained as primary items. In line 12, the algorithm carries out depth-first exploration from the start of in a recursive procedure Search.

The Search procedure (Algorithm 2) takes six parameters as input: the itemset , the projected database w.r.t. , the primary and secondary items w.r.t. , the minUtil and minCor threshold. In lines 1-5, the algorithm first identifies whether {l}-length items are CoHUIs. According to Kulc, we can know that the Kulc values of {l}-length items are always equal to 1. Therefore, the algorithm only needs to compare its own utility value. In lines 6-13, the algorithm starts with item Primary(), and these items are considered as extensible items. To obtain the extension of , the algorithm recursively adds items to form a new itemset . Based on the projected database -D, the utility of is computed. If the support of itemset is no less than zero, the algorithm should compute its correlation value. Then, if the utility of is greater than minUtil TU and the value of is greater than the , can be output as CoHUIs. In lines 14-16, the algorithm creates a projected database - and scans this database to compute the local utility and subtree utility w.r.t. . The purpose is to find all the extensions of from the secondary items with respect to . Then, it puts the promising items into Primary and Secondary. In line 17, the Search procedure finds the extension of recursively by depth-first method. A complete set of CoHUIs has been output when the CoIUM algorithm terminates.

4.4 An Example of CoIUM

In this subsection, we provide a detailed example to discuss how the algorithm works. Assume the database that are displayed in Table 1 and Table 2 with minUtil = 0.2 and minCor = 0.3. In Algorithm 1, it defines as an empty set (Line 1), then computes the local utility of each item (Line 3). The result of is that = $108, = $114, = $149, = $109, = $85, = $87. The values of is equal to TWU values when is empty set. The minUtil TU is (0.2 $167 =) $33.4, so all the items can be obtained to Secondary() and Secondary() = {} (Line 5). As shown in Table 3, the support of these items are sup({}) = 5, sup({}) = 5, sup({}) = 7, sup({}) = 5, sup({}) = 4, and sup({}) = 4 (Line 3). The support of items can be seen as the number of occurrences of items. Based on TWU ascending order , Secondary() = {} (Line 8). Because no items are removed in Secondary(), the database is invariant (Line 9). The transactions are sorted in and the results are shown in Table 4 (Line 9). The result of subtree utility is: = ($4 + $4) + ($4 + $9 + $4 + $2) + ($16 + $3) + ($12 + $2) + ($4 + $9 +$4 + $2) = $79, = $4 + ($4 + $2) + ($4 + $3) + ($4 + $2) + ($6 + $4) = $33, = $2 + $3 + $2 + $2 + $3 + $2 + $4 = $18, = ($9 + $4 + $2) + ($9 + $2) + ($9 + $4 + $3) + ($9 + $4 + $2) + $25 = $82, = $18 + $31 + $23 + $13 = $85, = $12 + $21 + $17 + $21 = $71. Thus, the Primary() = {}, which means that only {} will be explored through the depth-first search.

Tid Transaction
Table 4: Sort transactions and items by and
Tid Transaction Prefix utility
$4
$4
$16
$12
$4
Table 5: The projected database on {}

In Algorithm 2, it uses the elements of Primary() to execute a depth-first search. Because = $44, = $20, = $18, = $51, = $22, = $12, only and can be output as CoHUIs. The reason is that = $44 ( minUtil TU) and Kulc({}) = 1 ( 0.3). We assume = {} (Line 7), and construct the database projection {}- in Table 5. Then, using two utility-bin arrays to compute and . The results are: = + + = $18 + $31 + $19 = $68, = $8 + $10 + $10 = $28, = + + + = $90, = $6 + $19 + $14 + $6 = $45, = + = $50, and = $38. The primary items and secondary items are updated as: Primary({})= {} and Secondary({}) = {} (Lines 15–16). This means that only the {} should be extended, so only items {} can be considered for the depth-first search. Therefore, the algorithm only needs to identify whether itemsets {}, {} and their supersets ({}, {}) are CoHUIs.

5 Experimental Study

In this section, massive experiments were presented to assess the efficiency of CoIUM. Because CoUMP and CoHUI-Miner are the state-of-the-art algorithms for correlated high-utility itemset mining, we selected CoUPM and CoHUI-Miner as the benchmark algorithms. The experiments were performed on a computer with a 3.0 GHz i7-9700 Intel Core Processor with 16 GB of main memory running on Windows 10 (64-bit operating system). All three algorithms involved were implemented by the Java language.

5.1 Data Description and Experimental Setup

We evaluated the efficiency of CoIUM algorithm on six spare and dense datasets. All datasets (BMSPOS, Chess, Mushroom, Retail, T10I4D100K and T40I10D100K) are downloaded from the SPMF website111SPMF: https://www.philippe-fournier-viger.com/spmf/. Table 6 reveals the detailed features of the experimental datasets. Note that #Trans represents the quantity of transactions, #Items represents the summation of different items, #AvgLen represents the average quantity of items, and #Type represents two different types: dense and sparse datasets.

Dataset #Trans #Items #AvgLen #Type
Chess 3,196 75 37.0 Dense
Mushroom 8,142 119 23.0 Dense
T40I10D100K 100,000 942 39.6 Dense
T10I4D100K 100,000 870 10.1 Sparse
Retail 88,162 16,470 10.3 Sparse
BMSPOS 515,366 1,656 6.51 Sparse
Table 6: Experimental dataset characteristics

To assess the efficiency of CoIUM, minCor is adjusted twice on each sparse and dense dataset. Consequently, the minCor threshold setting is as follows: (1) in Chess dataset: 0.2, 0.3; (2) in Mushroom dataset: 0.4, 0.5; (3) in T40I10D100K dataset: 0.3, 0.4; (4) in T10I4D100K dataset: 0.5, 0.6; (5) in Retail dataset: 0.6, 0.7; and (6) in BMSPOS dataset: 0.1, 0.2.

Runtime on dense datasets under changed
Figure 1: Runtime on dense datasets under changed minUtil.
Runtime on sparse datasets under changed
Figure 2: Runtime on sparse datasets under changed minUtil.
Memory on dense datasets under changed
Figure 3: Memory on dense datasets under changed minUtil.
Memory on sparse datasets under changed
Figure 4: Memory on sparse datasets under changed minUtil.

5.2 Experiments on Runtime

We compare the runtime of CoIUM with two other algorithms: CoUPM and CoHUI-Miner. The runtime of all three algorithms on dense datasets with various minCor values was shown in Fig. 1. The results indicate that CoIUM is faster than CoUPM and CoHUI-Miner on all dense datasets. For example, on the Chess dataset with minCor = 0.2, the runtime of CoIUM is less than 5 seconds, while that of CoUPM is more than 10 seconds and that of CoHUI-Miner is about 8 seconds. On the T40I10D100K dataset with minCor = 0.3, the performance of CoIUM is 2X and 3X faster than that of CoUPM and CoHUI-Miner, respectively. On the Mushroom dataset with minCor = 0.4, the performance of CoIUM is 4X and 2X faster than that of CoHUI-Miner and CoUPM, respectively. We can see that the execution time of CoUPM and CoHUI-Miner increases dramatically as the utility of each dense dataset decreases to a certain point. On the contrary, when the utility of each dense dataset decreases to a certain extent, the execution time of CoIUM remains stable within a certain range.

The runtime of all three algorithms on sparse datasets with different minUtil and various minCor values was shown in Fig. 2. The results demonstrate that CoIUM is faster than CoUPM and CoHUI-Miner on BMSPOS (up to 1.2X and 2.3X, respectively) and Retail (up to 1.5X and 2X, respectively). On the T10I4D100K dataset, it is interesting that CoIUM has similar performance as CoHUI-Miner and the performance of CoIUM is 3X faster than that of CoUPM. The minCor threshold can also affect the efficiency of three algorithms, especially on dense datasets. For example, on the Mushroom dataset with minCor = 0.4, CoIUM takes about 5 seconds to complete the progress, while CoUPM requires 10 seconds and CoHUI-Miner requires 20 seconds. When minCor = 0.5, CoIUM takes about 2 seconds to complete the progress, whereas CoUPM requires 4 seconds and CoHUI-Miner requires 6 seconds. As can be seen, with the increase of minCor, the performance of the three algorithms also increases.

The most important reason why CoIUM is often faster than CoUPM and CoHUI-Miner is that CoIUM utilizes two effective upper bounds ( and ). Compared with the other two algorithms, and can prune more space. Therefore, CoIUM only utilizes a few itemsets to discover HUIs. However, CoHUI-Miner has a large size of a projected dataset, which will produce more candidates, especially on dense datasets. CoUPM also takes a lot of time to construct the utility-list structure.

5.3 Experiments on Memory Usage

Memory consumption is an important indicator for evaluating the merits of data mining algorithms, so in this subsection, we will compare the memory usage between CoIUM, CoUPM, and CoHUI-Miner. Based on the same variable setting as shown in Fig. 1, the detailed results on the dense datasets are shown in Fig. 3. It can be clearly obtained that the memory consumption of CoIUM is much less than that of CoHUI-Miner on all dense datasets. For example, on the T40I10D100K dataset with minCor = 0.3, the memory loss of CoHUI-Miner is always maintained at about 1400 MB, while the memory usage of CoIUM is less than 1000 MB and remains below 500 MB. This result proves that the performance of CoIUM is much better than that of CoHUI-Miner, because CoHUI-Miner produces too many candidates and requires a lot of memory to store information. Then, on the Chess dataset, the memory usage of CoIUM is nearly the same but lower than CoUPM. However, the memory usage of CoIUM is often lower than that of CoUPM on T40I10D100K and Mushroom datasets. For instance, on the Mushroom dataset with minCor = 0.4, the memory usage of CoIUM tends to remain at 220 MB, while CoUPM remains between 200 MB and 600 MB. In most cases, the memory consumption of CoUPM exceeds 220 MB.

Based on the same variable setting as displayed in Fig. 2, the detailed results on the sparse datasets are shown in Fig. 4. As we can observe from the Retail dataset, the memory usage of CoIUM is much lower than that of both CoUPM and CoHUI-Miner (less than 2X and 3X, respectively). On the BMSPOS and T10I4D100K datasets, the memory usage of CoIUM is lower than that of CoHUI-Miner, and close to that of CoUPM when minUtil increases. Because of sparse datasets, the size of the transaction dataset is small, so both CoUPM and CoHUI-Miner require only a small amount of memory to conserve candidates.

The main reason why the memory consumption of CoIUM is lower than that of CoUPM in all datasets is that the projected technique can reduce dataset size more effectively than the utility-list used in the CoUPM algorithm. Although both CoIUM and CoHUI-Miner use the same projected technique, the consumption of CoIUM is lower than CoHUI-Miner on all datasets. The reason is that both of them use different strategies to explore CoHUIs. CoIUM uses two upper bounds ( and ) to reduce the number of candidates generated. These two upper bounds are more compact than TWU utilized in CoHUI-Miner, so they can remove more unpromising candidates than TWU. The number of candidates produced by CoIUM is lower than that of CoUPM and CoHUI-Miner. As a result, the memory consumption of CoIUM is lower than that of CoUPM and CoHUI-Miner.

Dataset minCor # Algorithm # candidate generation under different thresholds
#CoIUM 59,039 28,384 13,808 6,614 3,187 1,501
0.2 #CoUPM 200,606 111,453 63,757 37,323 22,593 13,794
#CoHUI-Miner 2620,117 1240,899 595,812 288,709 140,696 67,589
Chess #CoIUM 59,039 28,384 13,808 6,614 3,187 1,501
0.3 #CoUPM 200,606 111,453 63,757 37,323 22,593 13,794
#CoHUI-Miner 2,620,117 1,240,899 595,812 288,709 140,696 67,589
#CoIUM 152,930 15,935 3,738 1,475 957 796
0.3 #CoUPM 447,807 140,222 69,670 35,506 21,179 13,068
#CoHUI-Miner 17435,132 7761,519 3077,564 1362,147 863,037 626,828
T40I10D100K #CoIUM 112,925 14,147 3,549 1,468 957 796
0.4 #CoUPM 245,896 95,410 52,987 31,469 19,752 12,724
#CoHUI-Miner 2449,760 1563,543 1107,732 823,317 673,769 585,371
#CoIUM 276,518 269,858 253,590 97,812 88,277 45,928
0.4 #CoUPM 281,139 277,600 267,469 105,995 96,669 57,134
#CoHUI-Miner 13,045,179 12,697,922 12,367,803 4,615,310 4,083,339 2,015,229
Mushroom #CoIUM 98,887 96,816 88,517 33,401 30,094 17,187
0.5 #CoUPM 102,875 101,615 97,757 38,210 34,701 22,059
#CoHUI-Miner 3,170,082 3,093,407 3,004,022 1,073,346 952,846 525,577
#CoIUM 3,747,571 3,420,817 2,659,589 233,859 48,056 6,779
0.6 #CoUPM 134,722,599 129,039,485 120,942,326 73,168,998 49,245,865 10,939,731
#CoHUI-Miner 533,955,947 485,002,701 389,635,998 157,096,263 107,204,879 30,288,360
Retail #CoIUM 3,615,594 3,297,971 2,570,798 231,801 47,899 6,773
0.7 #CoUPM 134,588,149 128,906,771 120,821,509 73,158,370 49,245,259 10,939,665
#CoHUI-Miner 311,399,670 292,452,873 264,220,651 154,067,657 106,915,555 30,200,504
Table 7: # Candidate generation under varying minUtil with fixed minCor

5.4 Experiments on Candidates Generation

Table 7 shows the candidate generation results of CoIUM, CoUPM, and CoHUI-Miner, under different minUtil and minCor. Note that represents a different minUtil threshold, and its specific scope is from Fig. 1(a) to (c) and Fig. 2(a) to (c). The results indicate that the candidate generation by CoIUM is lower than CoUPM and CoHUI-Miner on dense datasets. For example, on T40I10D100K dataset with minCor = 0.3, the quantity of candidates increased by CoUPM is more than 2X higher than that of CoIUM, and the quantity of candidates generated by CoHUI-Miner is up to 10 orders of magnitude higher than that of CoIUM. With the increase of minUtil, the decrease of CoIUM is significantly greater than that of CoUPM and CoHUI-Miner. On the Mushroom dataset with minCor = 0.4, the quantity of candidates increased by CoIUM is always close to but lower than that of CoUPM, and the quantity of candidates produced by CoHUI-Miner is up to four orders of magnitude higher than that of CoIUM. These results prove that both CoUPM and CoHUI-Miner discover numerous meaningless Itemsets, which is why the runtime of CoIUM is better than CoUPM and CoHUI-Miner on all dense datasets. The reason why CoIUM generated fewer candidates than CoUPM and CoHUI-Miner on dense datasets is that the CoIUM algorithm adopts two tighter upper bounds (lu and su). These upper bounds can be used to remove more unpromising itemsets that do not satisfy the properties of CoHUIs.

On a spare dataset, it is found that the sum of candidates produced by CoIUM is also lower than that of CoUPM and CoHUI-Miner, especially on Retail dataset. For example, on Retail dataset with minCor = 0.6, the sum of candidates generated by CoUPM and CoHUI-Miner is up to 4 and 10 orders of magnitude higher than CoIUM respectively. And with the increase of minUtil, the gap between the number of candidates they produce becomes larger. With the increase of minCor on different datasets except Chess, the number of candidates generated by the three algorithms decreases correspondingly. It can be seen that minCor also has an impact on the discovery of HUIs. Because the data of Chess dataset is too dense, the small promotion of minCor cannot effectively filter out unpromising candidates.

5.5 Experiments on Scalability Test

Moreover, we test the scalability of CoIUM, CoUPM, and CoHUI-Miner. We adopt the size of the Retail dataset from 20% (= 17,632 transactions) to 100% (= 88,162 transactions) with minCor = 0.7. Fig. 5 shows the details of execution time and memory usage. From Fig. 5(a), We can know that the runtime of CoIUM, CoUPM, and CoHUI-Miner respectively increases linearly with the increasing dataset size. Compared with the other two algorithms, CoIUM has a smaller runtime increment because lu and su can reduce more unpromising candidates than that of TWU. At the same time, the gap in runtime between CoIUM and the other two algorithms is becoming larger. This conclusion proves that the pruning strategies of CoIUM are obviously better than those of the other two algorithms. In addition, we can see from Fig. 5(b) that with the growth of dataset size, the memory usage of CoIUM increases slowly and remains at about 100 MB, and the memory use of CoHUI-Miner also increases slowly and remains at about 1000 MB after a significant increase, while the memory use of CoUPM remains a decrease-increase-decrease phenomenon. We can see that with the increase in dataset size, the memory trend of CoIUM is more gentle than that of the other two algorithms. The results demonstrate that CoIUM has preferable scalability in runtime and memory consumption compared with CoUPM and CoHUI-Miner. In addition, the execution time and memory usage of CoIUM increase slowly when the data size increases. These results indicate that CoIUM is more suitable for large-scale datasets.

Runtime and memory scalability on
Figure 5: Runtime and memory scalability on Retail.

6 Conclusions

In this study, we present an efficient algorithm named CoIUM to extract the set of CoHUIs. The proposed CoIUM algorithm adopts two new upper bounds called local utility and subtree utility, which can prune more unpromising candidates effectively. Then, a novel array-based structure known as UA is applied to store useful information when pruning the search space. A novel technique called database projection is designed to reduce the cost of datasets in memory. In addition, we utilize TWU ascending order to sort all items. Extensive experiments on different datasets show that the performance of the CoIUM algorithm is obviously better than the CoUPM and CoHUI-Miner algorithms.

In the future, we would like to explore other novel data structures to reduce memory consumption effectively. We will extend the designed algorithm to different fields such as on-shelf utility mining, HUIM without setting minUtil and fuzzy utility mining. We also plan to apply the proposed algorithm to resolve other types of complex data (e.g., episodes, streaming data), not just static data.

References

  • [1] C. C. Aggarwal, M. A. Bhuiyan, and M. A. Hasan (2014) Frequent pattern mining algorithms: a survey. In Frequent pattern mining, pp. 19–64. Cited by: §1.
  • [2] R. Agrawal, R. Srikant, et al. (1994) Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, Vol. 1215, pp. 487–499. Cited by: §1.
  • [3] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and H. J. Choi (2011) A framework for mining interesting high utility patterns with a strong frequency affinity. Information Sciences 181 (21), pp. 4878–4894. Cited by: §2.2, §2.2, §3.
  • [4] C. F. Ahmed, S. K. Tanbeer, B. S. Jeong, and Y. K. Lee (2009) Efficient tree structures for high utility pattern mining in incremental databases. IEEE Transactions on Knowledge and Data Engineering 21 (12), pp. 1708–1721. Cited by: §1, §2.1.
  • [5] Y. Baek, U. Yun, H. Kim, J. Kim, B. Vo, T. Truong, and Z. Deng (2021) Approximate high utility itemset mining in noisy environments. Knowledge-Based Systems 212, pp. 106596. Cited by: §2.1.
  • [6] M. Barsky, S. Kim, T. Weninger, and J. Han (2011) Mining flipping correlations from large datasets with taxonomies. bioinformatics 5 (4), pp. 370–381. Cited by: §1.
  • [7] J. Chen, X. Guo, W. Gan, C. M. Chen, W. Ding, and G. Chen (2022) On-shelf utility mining from transaction database. Engineering Applications of Artificial Intelligence 107, pp. 104516. Cited by: §2.1.
  • [8] C. J. Chu, V. S. Tseng, and T. Liang (2008) An efficient algorithm for mining temporal high utility itemsets from data streams. Journal of Systems and Software 81 (7), pp. 1105–1117. Cited by: §1.
  • [9] Q. H. Duong, P. Fournier-Viger, H. Ramampiaro, K. Nørvåg, and T. L. Dam (2018) Efficient high utility itemset mining using buffered utility-lists. Applied Intelligence 48 (7), pp. 1859–1877. Cited by: §2.1.
  • [10] P. Fournier-Viger, W. Gan, Y. Wu, M. Nouioua, W. Song, T. Truong, and H. Duong (2022) Pattern mining: current challenges and opportunities. In International Conference on Database Systems for Advanced Applications, pp. 34–49. Cited by: §1.
  • [11] P. Fournier-Viger, C. W. Wu, S. Zida, and V. S. Tseng (2014) FHM: faster high-utility itemset mining using estimated utility co-occurrence pruning. In Proceedings of the International Symposium on Methodologies for Intelligent Systems, pp. 83–92. Cited by: §1, §2.1.
  • [12] P. Fournier-Viger, Y. Zhang, J. C. W. Lin, D. T. Dinh, and H. Bac Le (2020) Mining correlated high-utility itemsets using various measures. Logic Journal of the IGPL 28 (1), pp. 19–32. Cited by: §2.2.
  • [13] W. Gan, J. C. W. Lin, H. C. Chao, T. P. Hong, and P. S. Yu (2018) CoUPM: correlated utility-based pattern mining. In IEEE International Conference on Big Data, pp. 2607–2616. Cited by: §1, §1, §2.2, §3, Strategy 4.
  • [14] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. C. Chao, and H. Fujita (2018) Extracting non-redundant correlated purchase behaviors by utility measure. Knowledge-Based Systems 143, pp. 30–41. Cited by: §1, §2.2, §3.
  • [15] W. Gan, J. C. W. Lin, P. Fournier-Viger, H. Chao, and P. S. Yu (2019) A survey of parallel sequential pattern mining. ACM Transactions on Knowledge Discovery from Data 13 (3), pp. 1–34. Cited by: §1.
  • [16] W. Gan, J. C. Lin, P. Fournier-Viger, H. Chao, V. S. Tseng, and P. S. Yu (2021) A survey of utility-oriented pattern mining. IEEE Transactions on Knowledge and Data Engineering 33 (4), pp. 1306–1327. Cited by: §1, §2.1.
  • [17] W. Gan, S. Wan, J. Chen, C. M. Chen, and L. Qiu (2020) TopHUI: top- high-utility itemset mining with negative utility. In Proceedings of the IEEE International Conference on Big Data, pp. 5350–5359. Cited by: §2.1.
  • [18] J. Han, J. Pei, Y. Yin, and R. Mao (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Mining and Knowledge Discovery 8 (1), pp. 53–87. Cited by: §1, §2.1.
  • [19] X. Han, X. Liu, J. Li, and H. Gao (2021) Efficient top- high utility itemset mining on massive data. Information Sciences 557, pp. 382–406. Cited by: §2.1.
  • [20] J. C. W. Lin, P. Fournier-Viger, and W. Gan (2016) FHN: an efficient algorithm for mining high-utility itemsets with negative unit profits. Knowledge-Based Systems 111, pp. 283–298. Cited by: §2.1.
  • [21] J. C. W. Lin, W. Gan, P. Fournier-Viger, T. P. Hong, and H. C. Chao (2017) FDHUP: fast algorithm for mining discriminative high utility patterns. Knowledge and Information Systems 51 (3), pp. 873–909. Cited by: §2.2.
  • [22] J. Liu, K. Wang, and B. C. Fung (2012) Direct discovery of high utility itemsets without candidate generation. In Proceedings of the IEEE 12th International Conference on Data Mining, pp. 984–989. Cited by: §2.1.
  • [23] M. Liu and J. Qu (2012) Mining high utility itemsets without candidate generation. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, pp. 55–64. Cited by: §1, §2.1, §2.1, Definition 6.
  • [24] Y. Liu, W. k. Liao, and A. Choudhary (2005) A two-phase algorithm for fast discovery of high utility itemsets. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 689–695. Cited by: §1, §2.1, §3, Strategy 1.
  • [25] J. M. Luna, P. Fournier-Viger, and S. Ventura (2019) Frequent itemset mining: a 25 years review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 9 (6), pp. e1329. Cited by: §1, §2.1.
  • [26] T. Mai, B. Vo, and L. T. Nguyen (2017) A lattice-based approach for mining high utility association rules. Information Sciences 399, pp. 81–97. Cited by: §2.1.
  • [27] J. Miao, S. Wan, W. Gan, J. Sun, and J. Chen (2022) Targeted high-utility itemset querying. IEEE Transactions on Artificial Intelligence, DOI: 10.1109/TAI.2022.3171530. Cited by: §2.1.
  • [28] L. T. Nguyen, P. Nguyen, T. D. Nguyen, B. Vo, P. Fournier-Viger, and V. S. Tseng (2019) Mining high-utility itemsets in dynamic profit databases. Knowledge-Based Systems 175, pp. 130–144. Cited by: §1.
  • [29] T. D. Nguyen, L. T. Nguyen, L. Vu, B. Vo, and W. Pedrycz (2021) Efficient algorithms for mining closed high utility itemsets in dynamic profit databases. Expert Systems with Applications 186, pp. 115741. Cited by: §2.1.
  • [30] E. R. Omiecinski (2003) Alternative interest measures for mining associations in databases. IEEE Transactions on Knowledge and Data Engineering (1), pp. 57–69. Cited by: §1, §2.2.
  • [31] J. F. Qu, P. Fournier-Viger, M. Liu, B. Hang, and F. Wang (2020) Mining high utility itemsets using extended chain structure and utility machine. Knowledge-Based Systems 208, pp. 106457. Cited by: §2.1.
  • [32] H. Ryang and U. Yun (2017) Indexed list-based high utility pattern mining with utility upper-bound reduction and pattern combination techniques. Knowledge and Information Systems 51 (2), pp. 627–659. Cited by: §2.1.
  • [33] R. Rymon (1992) Search through systematic set enumeration. In Proceedings of the 3rd International Conference on Principles of Knowledge Representation and Reasoning, pp. 539–550. Cited by: §4.1.
  • [34] R. Saeed, A. Rauf, F. H. Quradaa, and S. M. Asim (2021) Efficient utility tree-based algorithm to mine high utility patterns having strong correlation. Complexity 2021. Cited by: §1, §2.2.
  • [35] W. Song, C. Fang, and W. Gan (2021) TopUMS: top- utility mining in stream data. In The International Conference on Data Mining Workshops, pp. 615–622. Cited by: §2.1.
  • [36] W. Song, L. Liu, and C. Huang (2021) Generalized maximal utility for mining high average-utility itemsets. Knowledge and Information Systems 63 (11), pp. 2947–2967. Cited by: §2.1.
  • [37] V. S. Tseng, C. W. Wu, B. E. Shie, and P. S. Yu (2010) UP-Growth: an efficient algorithm for high utility itemset mining. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 253–262. Cited by: §1, §2.1.
  • [38] B. Vo, L. V. Nguyen, V. V. Vu, M. T. Lam, T. T. Duong, L. T. Manh, T. T. Nguyen, L. T. Nguyen, and T. P. Hong (2020) Mining correlated high utility itemsets in one phase. IEEE Access 8, pp. 90465–90477. Cited by: §1, §2.2, §3.
  • [39] S. Wan, J. Chen, P. Zhang, W. Gan, and T. Gu (2022) Discovering top- profitable patterns for smart manufacturing. In Companion Proceedings of the Web Conference, pp. 1–9. Cited by: §2.1.
  • [40] S. Wan, W. Gan, X. Guo, J. Chen, and U. Yun (2021) FUIM: fuzzy utility itemset mining. arXiv preprint, arXiv:2111.00307. Cited by: §2.1.
  • [41] T. Wu, Y. Chen, and J. Han (2010) Re-examination of interestingness measures in pattern mining: a unified framework. Data Mining and Knowledge Discovery 21 (3), pp. 371–397. Cited by: §1, §2.2, §3.
  • [42] U. Yun (2007) Efficient mining of weighted interesting patterns with a strong weight and/or support affinity. Information Sciences 177 (17), pp. 3477–3499. Cited by: §2.2.
  • [43] S. Zida, P. Fournier-Viger, J. C. W. Lin, C. W. Wu, and V. S. Tseng (2017) EFIM: a fast and memory efficient algorithm for high-utility itemset mining. Knowledge and Information Systems 51 (2), pp. 595–625. Cited by: 1st item, §1, §2.1, §3, Strategy 2, Strategy 3, Definition 10, Definition 5, Definition 7, Definition 8, Definition 9.