In this R tutorial, we will be focusing on purchases for groceries and how one has impulse purchases. Many times we go into a store for something and end up buying many other items and sometimes even forget what we came into the store for; well at least I do. These impulse buys are no coincidence, as retailers use sophisticated data analysis techniques to identify patterns that will drive retail behavior.
In the past, many recommendation systems were based on the subjective intuition of marketing professionals and inventory managers or buyers. Machine learning has been applied with the increase of bar-code scanners, computerized inventory systems, and online shopping trends increasing.
What are Association Rules?
Association rules are the building blocks of a market analysis are the items that may appear in any given transaction. There are groups of items that are surrounded by brackets to indicate that they form a set or an item set that appears in the data with some regularity. This result of a market basket analysis is a collection of association rules that will specify patterns that are found in the relationships among those items.
Most association rules are used for market basket analysis, but there are other potential applications:
- Searching for interesting and frequently occurring patterns of DNA and protein sequences in cancer data.
- Patterns among purchases or medical claims among credit card fraud or insurance use.
- Identify combinations of behavior that precede customers dropping their cellular phone service or upgrading their cable television package.
This report will be using market basket analysis.
Install and Load Packages
Below are the packages and libraries that we will need to load to complete this tutorial.
Input:
1 2 3 4 5 6 7 8 | install.packages("C50") install.packages("gmodels") install.packages("car") install.packages("arules") library(C50) library(gmodels) library(car) library(arules) |
Import the Grocery Dataset
Since we will be using the groceries dataset, you will need to download this dataset. This dataset is already packaged and available for an easy download from the dataset page or directly from the below
Input:
1 | groceries <- read.transactions("groceries.csv", sep = ",") |
View the Grocery Dataset
str() function
Input:
1 | str(groceries) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 | Formal class 'transactions' [package "arules"] with 3 slots ..@ data :Formal class 'ngCMatrix' [package "Matrix"] with 5 slots .. .. ..@ i : int [1:43367] 29 88 118 132 33 157 167 166 38 91 ... .. .. ..@ p : int [1:9836] 0 4 7 8 12 16 21 22 27 28 ... .. .. ..@ Dim : int [1:2] 169 9835 .. .. ..@ Dimnames:List of 2 .. .. .. ..$ : NULL .. .. .. ..$ : NULL .. .. ..@ factors : list() ..@ itemInfo :'data.frame': 169 obs. of 1 variable: .. ..$ labels: chr [1:169] "abrasive cleaner" "artif. sweetener" "baby cosmetics" "baby food" ... ..@ itemsetInfo:'data.frame': 0 obs. of 0 variables |
summary() function
Input:
1 | summary(groceries) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | transactions as itemMatrix in sparse format with 9835 rows (elements/itemsets/transactions) and 169 columns (items) and a density of 0.02609146 most frequent items: whole milk other vegetables rolls/buns soda yogurt 2513 1903 1809 1715 1372 (Other) 34055 element (itemset/transaction) length distribution: sizes 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46 29 14 14 9 21 22 23 24 26 27 28 29 32 11 4 6 1 1 1 1 3 1 Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 4.409 6.000 32.000 includes extended item information - examples: labels 1 abrasive cleaner 2 artif. sweetener 3 baby cosmetics |
inspect() function
We can use the inspect() function to view the first five transactions.
Input:
1 | inspect(groceries[1:5]) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | items [1] {citrus fruit, margarine, ready soups, semi-finished bread} [2] {coffee, tropical fruit, yogurt} [3] {whole milk} [4] {cream cheese, meat spreads, pip fruit, yogurt} [5] {condensed milk, long life bakery product, other vegetables, whole milk} |
itemFrequency() function
Also, we can use the itemFrequency() function to view the most frequently bought items.
Input:
1 | itemFrequency(groceries[, 1:3]) |
Output:
1 2 | abrasive cleaner artif. sweetener baby cosmetics 0.0035587189 0.0032536858 0.0006100661 |
Top 20 Grocery Items Plot
This plot will show decreasing support and diagram of the top 20 grocery items.
Input:
1 | itemFrequencyPlot(groceries, topN = 20) |
Output:
Grocery Items Transaction Visualization
The below may be hard to read, but it will show the 5 transactions, with four items each. I like this visualization because it’s very useful for data exploring.
Columns that could be filled all the way down could indicate items that are purchased in every transaction; a problem that could arise if a retailer’s name or ID was included int he data.
In addition, depending on the season or holiday, toys, candy or a turkey could be more common. This type of visualization will be very powerful if the items were also sorted into categories.
Matrix with 5 rows and 169 columns:
Matrix with 100 rows and 100 columns:
Train the Model with Apriori Algorithm
We will be using the Apriori algorithm to explore and prepare the groceries data.
Input:
1 | apriori(groceries) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Apriori Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.8 0.1 1 none FALSE TRUE 5 0.1 1 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 983 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. sorting and recoding items ... [8 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 done [0.00s]. writing ... [0 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. set of 0 rules |
Association rules learning with Apriori Algorithm
The Apriori algorithm is the most-widely used approach for efficiently searching large databases for rules. The algorithm name is derived from that fact that the algorithm utilizes a simple prior believe about the properties of frequent itemsets. Below are a few strengths and weakness of Apriori:
Strengths
- Capable of working with large data of transactions(retail)
- Results in rules that are easy to understand
- Great for data mining and discovering unexpected kofwledge in data
Weaknesses
- Not very helpful for small data
- Requires effort to separate the true insight from common sense
- Easy to draw spurious conclusions from random patterns
The support of an itemset or rule measures how frequently it occurs in the data. The rule’s confidence is a measurement of its predictive power or accuracy. It is defined as the support of the itemset containing both X and Y divided by the support of the itemset containing only X. Rules like {peanut butter} –> {jelly} are known as strong rules, because they both have high support and confidence.
Input:
1 2 | groceryrules <- apriori(groceries, parameter = list(support = 0.006, confidence = 0.25, minlen = 2)) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | Apriori Parameter specification: confidence minval smax arem aval originalSupport maxtime support minlen maxlen target ext 0.25 0.1 1 none FALSE TRUE 5 0.006 2 10 rules FALSE Algorithmic control: filter tree heap memopt load sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE Absolute minimum support count: 59 set item appearances ...[0 item(s)] done [0.00s]. set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s]. sorting and recoding items ... [109 item(s)] done [0.00s]. creating transaction tree ... done [0.00s]. checking subsets of size 1 2 3 4 done [0.00s]. writing ... [463 rule(s)] done [0.00s]. creating S4 object ... done [0.00s]. |
Input:
1 | groceryrules |
Output:
1 | set of 463 rules |
Input::
1 | > summary(groceryrules) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 | set of 463 rules rule length distribution (lhs + rhs):sizes 2 3 4 150 297 16 Min. 1st Qu. Median Mean 3rd Qu. Max. 2.000 2.000 3.000 2.711 3.000 4.000 summary of quality measures: support confidence lift count Min. :0.006101 Min. :0.2500 Min. :0.9932 Min. : 60.0 1st Qu.:0.007117 1st Qu.:0.2971 1st Qu.:1.6229 1st Qu.: 70.0 Median :0.008744 Median :0.3554 Median :1.9332 Median : 86.0 Mean :0.011539 Mean :0.3786 Mean :2.0351 Mean :113.5 3rd Qu.:0.012303 3rd Qu.:0.4495 3rd Qu.:2.3565 3rd Qu.:121.0 Max. :0.074835 Max. :0.6600 Max. :3.9565 Max. :736.0 mining info: data ntransactions support confidence groceries 9835 0.006 0.25 |
Input:
1 | > inspect(groceryrules[1:3]) |
Output:
1 2 3 4 | lhs rhs support confidence lift count [1] {potted plants} => {whole milk} 0.006914082 0.4000000 1.565460 68 [2] {pasta} => {whole milk} 0.006100661 0.4054054 1.586614 60 [3] {herbs} => {root vegetables} 0.007015760 0.4312500 3.956477 69 |
Improve the model
Improvement is best by sorting and evaluation rules based on different criteria. Depending on the objectives of the market basket analysis, the most useful rules might be done with the highest support, confidence, or lift. Below we will be using the sort() function with vector operators, we can obtain a specific number of interesting rules.
We will be using the best 5 rules according to the lift statistic and can be examined using the below.
Input:
1 | inspect(sort(groceryrules, by = "lift")[1:5]) |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 | lhs rhs support confidence [1] {herbs} => {root vegetables} 0.007015760 0.4312500 [2] {berries} => {whipped/sour cream} 0.009049314 0.2721713 [3] {other vegetables,tropical fruit,whole milk} => {root vegetables} 0.007015760 0.4107143 [4] {beef,other vegetables} => {root vegetables} 0.007930859 0.4020619 [5] {other vegetables,tropical fruit} => {pip fruit} 0.009456024 0.2634561 lift count [1] 3.956477 69 [2] 3.796886 89 [3] 3.768074 69 [4] 3.688692 78 [5] 3.482649 93 |
Subset of Berries
If one wanted to put together a campaign to sell berries, one would need to learn if berries are often purchased with other items. We can do this by creating a subset() function which will provide a method to search for subsets of transactions, items, and rules.
One can see below that berries are often purchased with whipped.sour cream, yogurt, other vegetables and whole milk.
This function is very powerful for marketing and how to target an audience with other items for purchase.
The function subset() is very powerful and below are a few topics to remember:
- The keyword items explained should match an item appearing anywhere in the rule. One can limit the subset on the left or ride side by using lhs or rhs.
- The operator %in% means that at least one of the items must be found in the list you defined. For instance, one could write a subset to search for berries or yogurt as such; %in%c(“berries”, “yogurt”)
- Additional operators are available to partial search (%pin) and complete search (%ain).
- Subsets can be limited by support, confidence, or lift. For example, one could search for ruled with confidence greater than 50 percent; confidence > 0.50
Input:
1 2 | berryrules <- subset(groceryrules, items %in% "berries") inspect(berryrules) |
Output:
1 2 3 4 5 | lhs rhs support confidence lift count [1] {berries} => {whipped/sour cream} 0.009049314 0.2721713 3.796886 89 [2] {berries} => {yogurt} 0.010574479 0.3180428 2.279848 104 [3] {berries} => {other vegetables} 0.010269446 0.3088685 1.596280 101 [4] {berries} => {whole milk} 0.011794611 0.3547401 1.388328 116 |
Additional Subsets for Groceries
Let’s search multiple items(berries and yogurt) with a confidence level greater than 60 percent.
Input:
1 2 3 | berryrules_a <- subset(groceryrules, confidence > 0.60, items %ain%c("berries", "yogurt")) inspect(berryrules_a) |
Output:
1 2 3 4 5 6 7 8 9 | lhs rhs support confidence lift count [1] {curd,tropical fruit} => {whole milk} 0.006507372 0.6336634 2.479936 64 [2] {butter,whipped/sour cream} => {whole milk} 0.006710727 0.6600000 2.583008 66 [3] {butter,tropical fruit} => {whole milk} 0.006202339 0.6224490 2.436047 61 [4] {butter,root vegetables} => {whole milk} 0.008235892 0.6377953 2.496107 81 [5] {butter,yogurt} => {whole milk} 0.009354347 0.6388889 2.500387 92 [6] {domestic eggs,tropical fruit} => {whole milk} 0.006914082 0.6071429 2.376144 68 [7] {other vegetables,tropical fruit,yogurt} => {whole milk} 0.007625826 0.6198347 2.425816 75 [8] {other vegetables,root vegetables,yogurt} => {whole milk} 0.007829181 0.6062992 2.372842 77 |
Saving Association Rules
By using the write() function we can export the grocery rules or you can find them on the R-ALGO Engineering Big Data dataset page.
Input:
1 | write(groceryrules, file = "groceryrules.csv", sep = ",", quote = TRUE, row.names= FALSE) |
The association rules are frequently used to find useful insights in the transnational databases. Users are able to extract data from databases without any prior knowledge of what patterns a person may wish to seek. When we used the Apriori algorithm was used for the grocery data, this was used by setting minimum thresholds of interestingness and reported any associations meeting that criteria.