Concepts and Techniques

(3rd ed.)

— Chapter 6 —

Jiawei Han, Micheline Kamber, and Jian Pei

University of Illinois at Urbana-Champaign &

Simon Fraser University

©2013 Han, Kamber & Pei. All rights reserved.

Adapted for CSE 347-447, Lecture 5b, Spring 2015

1

Chapter 6: Mining Frequent Patterns, Association and Correlations: Basic Concepts and Methods n Basic Concepts n Frequent Itemset Mining Methods n Which Patterns Are Interesting?—Pattern

Evaluation Methods n Summary

2

What Is Frequent Pattern Analysis? u Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

u

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining

u

u

Motivation: Finding inherent regularities in data u What products were often purchased together?— Beer and diapers?!

u

What are the subsequent purchases after buying a PC?

u

What kinds of DNA are sensitive to this new drug?

u

Can we automatically classify web documents?

Applications u Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.

3

Why Is Frequent Pattern Mining Important? n n

Frequent patterns: An intrinsic and important property of datasets Foundation for many essential data mining tasks n n n n n n n n

Association, correlation, and causality analysis

Sequential, structural (e.g., sub-graph) patterns

Pattern analysis in spatiotemporal, multimedia, time-series, and stream data

Classification: discriminative, frequent pattern analysis

Cluster analysis: frequent pattern-based clustering

Data warehousing: iceberg cube and cube-gradient

Semantic data compression: fascicles

Broad applications

4

Basic Concepts: Frequent Patterns

Tid

Items bought

10

Beer, Nuts, Diaper

20

Beer, Coffee, Diaper

30

Beer, Diaper, Eggs

40

Nuts, Eggs, Milk

50

Nuts, Coffee, Diaper, Eggs, Milk

Customer

buys both

Customer buys diaper

n

n n n

n

Customer buys beer

itemset: A set of one or more items k-itemset X = {x1, …, xk}

(absolute) support, or, support count of X: Frequency or occurrence of an itemset X

(relative) support, s, is the fraction of transactions that contains X (i.e., the probability that a transaction contains X)

An itemset X is frequent if X’s support is no less than a minsup threshold 5

Basic Concepts: Association Rules

Tid

Items bought

10

Beer, Nuts, Diaper

20

Beer, Coffee, Diaper

30

Beer, Diaper, Eggs

40

50

Nuts, Eggs, Milk

n

Nuts, Coffee, Diaper, Eggs, Milk

Customer

buys both

Customer buys beer

Customer buys diaper

Find all the rules X à Y with minimum support and confidence n support, s, probability that a transaction contains X and Y n confidence, c, conditional probability that a transaction having X also contains Y

Let minsup = 50%, minconf = 50%

Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,

{Beer, Diaper}:3 n Association rules: (many more!) n Beer à Diaper (60%, 100%) n Diaper à Beer (60%, 75%)

6

Closed Patterns and Max-Patterns n n n n

A long pattern contains a combinatorial number of sub100

100

patterns, e.g., {a1, …, a100} contains ( 1 ) + ( 2 ) + … +

100 – 1 = 1.27*1030 sub-patterns!

(100

100) = 2

Solution: Mine closed patterns and max-patterns instead

An itemset X is closed if X is frequent and there exists no super-pattern Y כX, with the same support as X

(proposed by Pasquier, et al. @ ICDT’99)

An itemset X is a max-pattern if X is frequent and there exists no frequent super-pattern Y כX (proposed by

Bayardo @ SIGMOD’98)

7

Closed Patterns and Max-Patterns n n

n

Exercise: Suppose a DB contains only two transactions n <a1, …, a100>, <a1, …, a50>

n

Let min_sup = 1

What is the set of closed itemsets? n {a1, …, a100}: 1

n

{a1, …, a50}: 2

What is the set of max-patterns? n n

{a1, …, a100}: 1

What is