February 1, 2017

This page replicates the codes written by Grun & Leish (2007) in ‘FlexMix: An R package for finite mixture modelling’, University of Wollongong, Australia. My intent here was to learn the flexmix package by replicating the results by the authors.

Model Based Clustering

The model based clustering on the whiskey dataset. The whiskey dataset is from the Simmons Study of Media and Markets (Fall 1997), and contains the incidence matrix for scotch brands in households who reported consuming scotch for period of 1 year. The dataset is taken from Edwards and Allenby (2003).

Load the necessary packges first: tidyverse and flexmix.

library(tidyverse)
library(flexmix)

Quick EDA

Quick look at the data itself. The dataframe consists of 2 elements - frequency (numeric vector), and the incidence matrix. There are total of 484 observations.

data("whiskey")
df <- whiskey
set.seed(1802)
str(df)
## 'data.frame':    484 obs. of  2 variables:
##  $ Freq     : int  1 1 10 14 10 23 9 8 1 12 ...
##  $ Incidence: num [1:484, 1:21] 1 0 0 0 0 0 0 0 0 0 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr  "1" "2" "3" "4" ...
##   .. ..$ : chr  "Singleton" "Knockando" "White Horse" "Scoresby Rare" ...

The column names of the df$Incidence matrix are the brands of whiskey.

colnames(df$Incidence)
##  [1] "Singleton"                  "Knockando"                 
##  [3] "White Horse"                "Scoresby Rare"             
##  [5] "Ushers"                     "Macallan"                  
##  [7] "Grant's"                    "Passport"                  
##  [9] "Black & White"              "Clan MacGregor"            
## [11] "Ballantine"                 "Pinch (Haig)"              
## [13] "Other brands"               "Cutty Sark"                
## [15] "Glenfiddich"                "Glenlivet"                 
## [17] "J&B"                        "Dewar's White Label"       
## [19] "Johnnie Walker Black Label" "Johnnie Walker Red Label"  
## [21] "Chivas Regal"

The incidence matrix shows a relationship between two classes of variables - in this case: freqencies of the brand of whiskey in the past year, and the brand of whiskey itself. Quick look at a portion of the matrix:

df$Incidence[sample(x = 1:484,size = 10),sample(1:21,3)]
##     Ballantine Johnnie Walker Red Label White Horse
## 304          1                        1           0
## 404          0                        0           0
## 465          0                        1           0
## 349          0                        0           0
## 19           0                        0           0
## 331          0                        1           0
## 370          0                        1           0
## 367          0                        1           0
## 82           0                        1           0
## 384          1                        1           0

The popularity of the whiskeys can be seen here. Chivas Regal seems to be a favourite, which puts my personal preference in line with a larger population :)

c <- colSums(df$Incidence)
d1 <- data.frame(Brand=names(c),counts=c,row.names = NULL)
d1 <- d1 %>% left_join(whiskey_brands) %>% arrange(-counts)
ggplot(d1,aes(reorder(Brand,counts),counts,fill=Type))+geom_bar(stat='identity')+coord_flip()+labs(y='Counts',x='Whiskey Brand')

Model building

The first model in the paper is a stepped Flexmix model, specific for binary variables using the FLXMcvbinary() model. Since the objective is to cluster the model based on the Incidence counts and Frequencies, the formula used is Incidence ~ 1. The frequencies themselves are input as weights in the formula.

wh_mix <- stepFlexmix(Incidence ~ 1,
                      weights = ~ Freq, 
                      data = df,
                      model = FLXMCmvbinary(truncated = TRUE),
                      control = list(minprior = 0.005),
                      k=1:7,
                      nrep=5)
## 1 : * * * * *
## 2 : * * * * *
## 3 : * * * * *
## 4 : * * * * *
## 5 : * * * * *
## 6 : * * * * *
## 7 : * * * * *
summary(wh_mix)
##      Length       Class        Mode 
##           1 stepFlexmix          S4

A top model can be selecting using BIC or AIC criteria. The BIC criteria selects a model with 5 clusters.

plot(BIC(wh_mix),type='b',ylab='BIC')
points(x = which.min(BIC(wh_mix)),min(BIC(wh_mix)),col='red',pch=20)

wh_best <- getModel(wh_mix,'BIC')
print(wh_best)
## 
## Call:
## stepFlexmix(Incidence ~ 1, weights = ~Freq, data = df, model = FLXMCmvbinary(truncated = TRUE), 
##     control = list(minprior = 0.005), k = 5, nrep = 5)
## 
## Cluster sizes:
##   1   2   3   4   5 
## 804 941 163 286  24 
## 
## convergence after 73 iterations

The proportions of the observations in each cluster are shown here:

round(prop.table(table(wh_best@cluster)),2)
## 
##    1    2    3    4    5 
## 0.19 0.27 0.23 0.26 0.04

The parameter estimates plotted for model with k=5 is shown below graphically. Component 3 (4% of households) contain the largest number of different brands. Component 1 (25% of households) seen to prefer single malt whiskeys. Component 4 (23% of households) are across the board with Brands, but perhaps show lesser of an interest in single malts, just like Component 5 (29% of the households).

# wh_best.prior <- prior(wh_best)
wh_best.param <- parameters(wh_best)
wh_best.param <- data.frame(Brand=stringr::str_replace(rownames(wh_best.param),pattern = 'center.',replacement = ''),
wh_best.param,row.names = NULL) 
wh_best.param <- wh_best.param %>% gather(Components,Value,Comp.1:Comp.5)
wh_best.param <- wh_best.param %>% left_join(y = whiskey_brands,by = 'Brand')
ggplot(wh_best.param,aes(y=Value,x=Brand,fill=Type))+
    geom_bar(stat='identity')+
    coord_flip()+
    facet_grid(.~Components)

Mixtures of Regressions

The next example in the paper is the patent data in Wang et al. (1998). The help file ?patent notes that the data consists of the number of patents, R&D spending and sales in millions of dollar for 70 pharmaceutical and biomedical companies in 1976, taken from the National Bureau of Economic Research R&D Masterfile.

Quick EDA

The dependant variable here is Patents. Independant variable is lgRD which is the log of R&D spending. The objective in this exercise is to try and find how many may clusters exist within this bi-variate dataset. When I started this exercise, it seemed quite moot to me, since visually, I couldn’t really tell any distict clusters. But, the results show otherwise.

data("patent")
df_patent <- tbl_df(patent)
df_patent
## # A tibble: 70 x 4
##                                     Company Patents    RDS    lgRD
##  *                                    <chr>   <int>  <dbl>   <dbl>
##  1 ABBOTT LABORATORIES                           42 0.0549  4.0869
##  2 AFFILIATED HOSPITAL PRDS                       1 0.0032 -2.0794
##  3 ALBERTO-CULVER CO                              3 0.0078  0.1187
##  4 ALCON LABORATORIES                             2 0.0803  1.8796
##  5 ALLERGAN PHARMACEUTICALS INC                   3 0.0686  1.1033
##  6 ALZA CORP-CL A                                40 3.3319  2.0794
##  7 AMERICAN HOME PRODUCTS CORP                   60 0.0243  4.0953
##  8 AMERICAN HOSPITAL SUPPLY                      30 0.0128  2.8333
##  9 AMERICAN STERILIZER CO                         7 0.0252  1.3915
## 10 AVON PRODUCTS                                  3 0.0094  2.6048
## # ... with 60 more rows
plot(Patents~lgRD,df_patent)

Model Building

The paper mentions that Wang et al. (1998) chose a finite mixture of three Poisson regression models to represent the data. The FLXMRglm() is used for the Poisson model with a concomitant variable modeled using FLXPmultinom().

pat_mix <- flexmix(Patents ~ lgRD, k = 3, data = df_patent, model = FLXMRglm(family = "poisson"), concomitant = FLXPmultinom(~RDS))
pat_mix
## 
## Call:
## flexmix(formula = Patents ~ lgRD, data = df_patent, k = 3, 
##     model = FLXMRglm(family = "poisson"), concomitant = FLXPmultinom(~RDS))
## 
## Cluster sizes:
##  1  2  3 
## 37 25  8 
## 
## convergence after 21 iterations

The clusters obtained from the analysis are given by a cluster() function.

clusters(pat_mix)
##  [1] 2 1 1 2 1 3 2 1 1 2 1 1 1 1 2 2 2 1 1 2 2 2 3 1 1 1 1 1 1 2 1 1 1 2 3
## [36] 1 2 2 3 2 1 1 2 1 2 1 1 1 3 1 1 2 1 2 1 1 1 2 2 2 1 2 1 1 3 3 1 2 3 2

Results

The data is replotted but with colors for the clusters and additional splines. As we can see, the model beautifully models three lines through three clusters in the data.

Components <- factor(clusters(pat_mix))
xyplot(Patents~lgRD,groups = Components,df_patent,type=c('p','spline'))

Further investigation

The flexmix package has a function to plot rootograms of the posterior probabilities of observations. Observations where the a-posteriori probability is large for component #1 and #3 are indicated. As we can see where component #1 has highest probabilities indicated in the 1st bucket, they are lowest in #2 and #3 buckets.

plot(pat_mix,mark=1)

plot(pat_mix,mark=3)

A summary of the mixture model results show the estimated priors, number of observations within each cluster (size), number of observations with p>10^-4 (post>0), and a ratio of the two. The rations of 0.58, 0.42 and 0.18 indicate big overlaps of the clusters. This can also be observed by the large portion of values in the mid-section of the rootogram above.

summary(pat_mix)
## 
## Call:
## flexmix(formula = Patents ~ lgRD, data = df_patent, k = 3, 
##     model = FLXMRglm(family = "poisson"), concomitant = FLXPmultinom(~RDS))
## 
##        prior size post>0 ratio
## Comp.1 0.497   37     63 0.587
## Comp.2 0.380   25     60 0.417
## Comp.3 0.124    8     45 0.178
## 
## 'log Lik.' -202.7633 (df=10)
## AIC: 425.5265   BIC: 448.0115

Tests of significance of the coefficients are obtained by the refit(). In each cluster the intercept and lgRD are both statistically significant at the 0.05 level. The black bars in the plot are 95% CI over the point estimates.

rm <- refit(pat_mix)
summary(rm)
## $Comp.1
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.64189    0.28348  2.2643  0.02355 *  
## lgRD         0.92618    0.06562 14.1143  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $Comp.2
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.32675    0.52622 -2.5213  0.01169 *  
## lgRD         1.29118    0.12463 10.3601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## $Comp.3
##             Estimate Std. Error z value  Pr(>|z|)    
## (Intercept)  1.95769    0.41935  4.6684 3.036e-06 ***
## lgRD         0.70556    0.12009  5.8754 4.217e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(rm,bycluster=F)

The authors note that “estimates vary between all components even though the co- efficients for lgRD are similar for the first and third component”.

Notes

  • For both the models, although I’m using the exact datasets and seed values, I observe different values for proportions in each cluster (for model #1) as well as P values for significance etc (for model #2). This could be due to some changes in the underlying flexmix codes since 2007.

September 27, 2017

September 17, 2017
comments powered by Disqus