Contents

Introduction
Association Rule Mining
Association Rule Mining on Titanic Data
Algorithm Evaluation
References

Introduction

In real world, We deal with various types of data for example date, currency, stock rate, categories and rank. These are all not same data types and also not easy to associate these all in single line information. There are lot of methods in Data Mining to extract the association or information from the complex data. Some methods are,

Classification
Estimation
Prediction
Affinity Grouping or Association Rules
Clustering
Anomaly Detection

In this post, I tried to explain the data mining process on Nominal Data Set.
The technique to extract the interesting information from Nominal data or Categorical data is Association Rule Mining. —

Association Rules Mining

Algorithms:

Apriori
FP Growth

Parameters:

Support
- Ratio of the particular Object observation count to the total count.
- In another words, the percentage of a object strength in total strength.
- Range [0 - 1]
\[Support(B) = {Observations containing (B) \over Total Observations }\]
Confidence
- How much confident association has with its pair.
- Range [0 - 1]
\[Confidence(A→B) = { Observations containing both (A and B)) \over (Observations containing A)}\]
Lift
- How much likely associated than individually occurred.
- Range [0 - inf]
- if lift > 1 means, It is an interesting scenario to consider.
\[Lift(A→B) = {Confidence (A→B) \over Support (B)}\]
Leverage
- Range [-1, 1]
- If leverage =0 means, Both are independent.
\[L (A → B) = {S (A→B) \over S (A) * S (B)}\]
Conviction
- It is the metric to find the dependency on premise by the consequent.
- Range [0 - inf]
- If conviction = 1, items are independent.
- High Confident with Lower support. That means it is mostly depends on the another product.
\[C (A -> B) = {1 - S (B) \over 1 - Confidence (A → B)}\]

Association Rule Mining on Titanic Data

Ready Up

Algorithm : Apriori
Language : Python 2.7.15
Data Set : Titanic Data From Kaggle -

Import Packages

import matplotlib.pyplot as plt
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

import warnings
warnings.filterwarnings("ignore")
import seaborn as sns

Loading Data-set

titanic = pd.read_csv('train.csv')
nominal_cols = ['Embarked','Pclass','Age', 'Survived', 'Sex']
cat_cols = ['Embarked','Pclass','Age', 'Survived', 'Title']
titanic['Title'] = titanic.Name.str.extract('\, ([A-Z][^ ]*\.)',expand=False)
titanic['Title'].fillna('Title_UK', inplace=True)
titanic['Embarked'].fillna('Unknown',inplace=True)
titanic['Age'].fillna(0, inplace=True)
# Replacing Binary with String
rep = {0: "Dead", 1: "Survived"}
titanic.replace({'Survived' : rep}, inplace=True)

Binning Age Column

## Binning Method to categorize the Continous Variables
def binning(col, cut_points, labels=None):
  minval = col.min()
  maxval = col.max()
  break_points = [minval] + cut_points + [maxval]
  if not labels:
    labels = range(len(cut_points)+1)
  colBin = pd.cut(col,bins=break_points,labels=labels,include_lowest=True)
  return colBin

cut_points = [1, 10, 20, 50 ]
labels = ["Unknown", "Child", "Teen", "Adult", "Old"]
titanic['Age'] = binning(titanic['Age'], cut_points, labels)
in_titanic = titanic[nominal_cols]
cat_titanic = titanic[cat_cols]

The data type of the Age column is converted from Number to Categorical using the method Binning. The data Set of the age column is ["Unknown", "Child", "Teen", "Adult", "Old"] and also ensured that all the columns are only have nominal data. The data set is separated into two types. They are,

Gender Data
Title Data

Gender Data

in_titanic.head()

	Embarked	Pclass	Age	Survived	Sex
0	S	3	Adult	Dead	male
1	C	1	Adult	Survived	female
2	S	3	Adult	Survived	female
3	S	1	Adult	Survived	female
4	S	3	Adult	Dead	male

Title Data

cat_titanic.head()

	Embarked	Pclass	Age	Survived	Title
0	S	3	Adult	Dead	Mr.
1	C	1	Adult	Survived	Mrs.
2	S	3	Adult	Survived	Miss.
3	S	1	Adult	Survived	Mrs.
4	S	3	Adult	Dead	Mr.

Data Visualization with Plots

for x in ['Embarked', 'Pclass','Age', 'Sex', 'Title']:
    sns.set(style="whitegrid")
    ax = sns.countplot(y=x, hue="Survived", data=titanic)
    plt.ylabel(x)
    plt.title('Survival Plot')
    plt.show()

png

Analysis - Methodology

Gender Wise
Title Wise

Because title is also a keyword which shows the Gender type of a person. Analysing these both fields together will cause for the results with 100% association with both fields.

Example:

(Mr.) always associated with Male.
(Mrs.) always associated with Female.

Putting these two fields together does not make any sense. So that the analysis split into two types.

Gender Analysis

dataset = []
for i in range(0, in_titanic.shape[0]-1):
    dataset.append([str(in_titanic.values[i,j]) for j in range(0, in_titanic.shape[1])])
# dataset = in_titanic.to_xarray()

oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
print df.head()

Output:

1 2 3 Adult C Child Dead Old Q S \ 0 False False True True False False True False False True 1 True False False True True False False False False False 2 False False True True False False False False False True 3 True False False True False False False False False True 4 False False True True False False True False False True Survived Teen Unknown female male 0 False False False False True 1 True False False True False 2 True False False True False 3 True False False True False 4 False False False False True

All Nominal Values

print oht.columns_

Output:

[‘1’, ‘2’, ‘3’, ‘Adult’, ‘C’, ‘Child’, ‘Dead’, ‘Old’, ‘Q’, ‘S’, ‘Survived’, ‘Teen’, ‘Unknown’, ‘female’, ‘male’]

Implementing Apriori Algorithm:

output = apriori(df, min_support=0.2, use_colnames=oht.columns_)
print output.head()

idx support itemsets 0 0.242697 (1) 1 0.206742 (2) 2 0.550562 (3) 3 0.528090 (Adult) 4 0.615730 (Dead)

Rules Configuration

config = [
    ('antecedent support', 0.7),
    ('support', 0.5),
    ('confidence', 0.8),
    ('conviction', 3)
]

for metric_type, th in config:
    rules = association_rules(output, metric=metric_type, min_threshold=th)
    if rules.empty:
        print 'Empty Data Frame For Metric Type : ',metric_type,' on Threshold : ',th
        continue
    print rules.columns.values
    print '-------------------------------------'
    print 'Configuration : ', metric_type, ' : ', th
    print '-------------------------------------'
    print (rules)

    support=rules.as_matrix(columns=['support'])
    confidence=rules.as_matrix(columns=['confidence'])

    plt.scatter(support, confidence, edgecolors='red')
    plt.xlabel('support')
    plt.ylabel('confidence')
    plt.title(metric_type+' : '+str(th))
    plt.show()

Output : Config 1: antecedent support = 0.7

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : antecedent support : 0.7 ------------------------------------- antecedents consequents antecedent support \ 0 (S) (male) 0.723596 1 (S) (Adult, Dead) 0.723596 2 (S) (female, Adult, Survived) 0.723596 3 (S) (male, Dead) 0.723596 ...

png

Output : Config 2: antecedent support = 0.7

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : support : 0.5 ------------------------------------- antecedents consequents antecedent support consequent support support \ 0 (male) (Dead) 0.647191 0.615730 0.524719 1 (Dead) (male) 0.615730 0.647191 0.524719 confidence lift leverage conviction 0 0.810764 1.316752 0.126224 2.030636 1 0.852190 1.316752 0.126224 2.386905

png

Output : Config 3: confidence: 0.8

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : confidence : 0.8 ------------------------------------- antecedents consequents antecedent support \ 0 (1, female) (Survived) 0.105618 1 (Adult, Dead) (S) 0.319101 2 (2, male) (Dead) 0.121348 3 (2, Dead) (male) 0.108989 ...

png

Output : Config 4: conviction: 3

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : conviction : 3 ------------------------------------- antecedents consequents antecedent support consequent support support \ 0 (1, female) (Survived) 0.105618 0.384270 0.102247 1 (2, Dead) (male) 0.108989 0.647191 0.102247 ...

png

Gender Result

Interesting Information: Gender Analysis

Persons Who are Sex: female With PcClass: 1, have 96.80 % Confidence Survived : True
Persons Who are PcClass: 2 With Survived: False, have 93.81% Confidence Sex: Male

Common Information:

Persons Who are Survived : False With Age : UnKnown , have 81.88 % Confidence PcClass : 3
Persons Who are Age : Adult With PcClass : 2 , have 90.2 % Confidence Embarked : S
Persons Who are Survived: False With Age : Adult and PcClass : 3, have 86.36% Confidence Embarked: S

Title Analysis

dataset = []
in_titanic=cat_titanic
for i in range(0, in_titanic.shape[0]-1):
    dataset.append([str(in_titanic.values[i,j]) for j in range(0, in_titanic.shape[1])])
# dataset = in_titanic.to_xarray()

oht = TransactionEncoder()
oht_ary = oht.fit(dataset).transform(dataset)
df = pd.DataFrame(oht_ary, columns=oht.columns_)
print df.head()

Output:

1 2 3 Adult C Capt. Child Col. Dead Don. \ 0 False False True True False False False False True False 1 True False False True True False False False False False 2 False False True True False False False False False False 3 True False False True False False False False False False .... [5 rows x 30 columns]

All Nominal values:

print oht.columns_

Output:

['1', '2', '3', 'Adult', 'C', 'Capt.', 'Child', 'Col.', 'Dead', 'Don.', 'Dr.', 'Jonkheer.', 'Lady.', 'Major.', 'Master.', 'Miss.', 'Mlle.', 'Mme.', 'Mr.', 'Mrs.', 'Ms.', 'Old', 'Q', 'Rev.', 'S', 'Sir.', 'Survived', 'Teen', 'Title_UK', 'Unknown']

Implementing Apriori Algorithm:

output = apriori(df, min_support=0.2, use_colnames=oht.columns_)
print output.head()

support itemsets 0 0.242697 (1) 1 0.206742 (2) 2 0.550562 (3) 3 0.528090 (Adult) 4 0.615730 (Dead)

Rules Configuration

config = [
    ('antecedent support', 0.7),
    ('confidence', 0.8),
    ('conviction', 3)
]

for metric_type, th in config:
    rules = association_rules(output, metric=metric_type, min_threshold=th)
    if rules.empty:
        print 'Empty Data Frame For Metric Type : ',metric_type,' on Threshold : ',th
        continue
    print rules.columns.values
    print '-------------------------------------'
    print 'Configuration : ', metric_type, ' : ', th
    print '-------------------------------------'
    print (rules)

    support=rules.as_matrix(columns=['support'])
    confidence=rules.as_matrix(columns=['confidence'])

    plt.scatter(support, confidence, edgecolors='red')
    plt.xlabel('support')
    plt.ylabel('confidence')
    plt.title(metric_type+' : '+str(th))
    plt.show()

Output : Config 1: antecedent support = 0.7

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : antecedent support : 0.7 ------------------------------------- antecedents consequents antecedent support consequent support \ 0 (S) (Adult, Dead) 0.723596 0.319101 1 (S) (Mr.) 0.723596 0.579775 2 (S) (Dead) 0.723596 0.615730 3 (S) (Adult) 0.723596 0.528090 ...

png

Output : Config 2: confidence: 0.8

Empty Data Frame For Metric Type : support on Threshold : 0.5 ['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : confidence : 0.8 ------------------------------------- antecedents consequents antecedent support consequent support \ 0 (Adult, Dead) (S) 0.319101 0.723596 1 (3, Mr.) (Dead) 0.357303 0.615730 2 (S, Mr.) (Dead) 0.446067 0.615730 3 (Mr., Adult) (S) 0.328090 0.723596 ...

png

Output : Config 3: conviction: 3

['antecedents' 'consequents' 'antecedent support' 'consequent support' 'support' 'confidence' 'lift' 'leverage' 'conviction'] ------------------------------------- Configuration : conviction : 3 ------------------------------------- antecedents consequents antecedent support consequent support support \ 0 (3, Mr.) (Dead) 0.357303 0.61573 0.316854 1 (S, Mr., 3) (Dead) 0.275281 0.61573 0.244944 confidence lift leverage conviction 0 0.886792 1.440229 0.096851 3.394382 1 0.889796 1.445107 0.075445 3.486891

png

Title Result

Interesting Information - Title Analysis:

Persons Who are Title : Mr. With Class : 3 and Embarked : S, have 88.9796 % Confidence Survived : Dead

How to filter ? - A simple Demo

rules[rules['confidence']==rules['confidence'].min()]
rules[rules['confidence']==rules['confidence'].max()]

Output Tables:

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
8	(True)	(female)	0.38427	0.352809	0.261798	0.681287	1.931035	0.126224	2.030636

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
12	(1, female)	(True)	0.105618	0.38427	0.102247	0.968085	2.519286	0.061661	19.292884

rules = association_rules (output, metric='support', min_threshold=0.1)
rules[rules['confidence'] == rules['confidence'].min()]
rules[rules['confidence'] == rules['confidence'].max()]

Output Tables:

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
274	(S)	(True, Adult, female)	0.723596	0.14382	0.103371	0.142857	0.993304	-0.000697	0.998876

	antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
55	(1, female)	(True)	0.105618	0.38427	0.102247	0.968085	2.519286	0.061661	19.292884

Algorithm Evaluation

Use this Python script to evaluate the algorithms Apriori and FP Growth.

The evaluation output would be like,

For Data Matrix : 891 x 5 Number of Individuals : 15 Apriori : 0.872148990631 FP-Algorithm : 0.0637619495392 -------------------------- For Data Matrix : 17999 x 5 Number of Individuals : 25 Apriori : 0.493063926697 FP-Algorithm : 0.621915102005 -------------------------- For Data Matrix : 35998 x 5 Number of Individuals : 25 Apriori : 0.990983963013 FP-Algorithm : 1.18582415581

Conclusion:

In terms of reading process, the algorithms Apriori and FP Growth differs. According to that FP Growth is more efficient than apriori for bigger data because it reads only two times a file. But for me both are working in same manner and almost consumes same time for a specific data. It may be differ with respect to the data and nominal value count. Any way before implementing these algorithm, once check with Algorithm-Evaluation</ark> as said before and find the suitable algorithm for your work.

Also published in Kaggle.

References:

Thanks to the Sources, - Apriori - FP Growth - Association Rule Mining Via Apriori Algorithm in python - Mining Frequent Items using apriori algorithm - Finding Frequent Patterns - Efficient - Apriori - Python 3.6 - Data mining with apriori

How To Mine Interesting Informations From Nominal Data

Turning Data into Interesting Information

Introduction

Association Rules Mining

Algorithms:

Parameters:

Association Rule Mining on Titanic Data

Ready Up

Import Packages

Loading Data-set

Binning Age Column

Gender Data

Title Data

Data Visualization with Plots

Analysis - Methodology

Example:

Gender Analysis

Output:

All Nominal Values

Output:

Implementing Apriori Algorithm:

Rules Configuration

Gender Result

Interesting Information: Gender Analysis

Common Information:

Title Analysis

Output:

All Nominal values:

Output:

Implementing Apriori Algorithm:

Rules Configuration

Title Result

Interesting Information - Title Analysis:

How to filter ? - A simple Demo

Output Tables:

Output Tables:

Algorithm Evaluation

Conclusion:

References:

Related Posts