#
EHAUSM: An efficient algorithm for high average utility sequence mining

EHAUSM: An efficient algorithm for high average utility sequence mining

Identifying high utility sequences in a quantitative sequence database is an important data mining task. However, a key problem of current approaches is that extensions of a high utility sequence often have a high utility. Hence, traditional techniques are often biased toward finding long patterns. To circumvent this problem, this paper proposes techniques for the problem of high average-utility sequence (HAUS) mining (HAUSM). HAUSs are more meaningful than high utility sequences because the former are found using the average-utility measure, which consider the length of patterns in utility calculations. HAUSM is more general than high average-utility itemset mining but it is also a more difficult problem because the downward-closure property used for search space reduction does not hold for the average-utility. To overcome that challenge, this paper introduces two upper bounds and a weak upper bound on the average-utility measure, and four width pruning, depth pruning, reducing, and tightening strategies. These strategies are designed to eliminate candidate HAUSs to speed up HAUS mining. Based on these theoretical results, a novel algorithm named EHAUSM is proposed for HAUSM. Experiments on both real-life and synthetic quantitative sequence databases have confirmed its efficiency in terms of memory consumption and runtime.