fuzzy-clustering

library(fussclust)

1 Introduction

The fussclust package provides methods for distance-based fuzzy clustering, including both unsupervised and semi-supervised approaches.

The original motivation for developing the package was the lack of implementation of semi-supervised fuzzy clustering methods in R. However, both unsupervised and semi-supervised fuzzy clustering share a common optimization framework and estimation procedure. Consequently, fussclust implements a unified estimation framework for fuzzy clustering models: both unsupervised and semi-supervised.

The current release of fussclust includes four classical models:

A typical workflow consists of the following steps:

  1. Prepare a numerical feature matrix X.
  2. For semi-supervised models, prepare a binary supervision matrix superF (see Section 4.1).
  3. Fit the selected clustering model.
  4. Inspect the estimated memberships U and prototypes V.

2 Methodology

The fuzzy clustering methods implemented in fussclust are distance-based. Membership values for \(N\) observations and \(C\) clusters are estimated under the assumption that similarity can be represented by distances in a metric space: objects located closer to each other are considered more similar.

The optimization problem is formulated as

\[ \min Q(U, V; X) \]

where:

The optimization problem does not admit a closed-form solution. Therefore, an iterative approximation procedure known as the Alternating Optimization (AO) algorithm is used.

The AO procedure minimizes the corresponding partial objective functions, yielding the matrix-valued update functions \(\hat{U}\) for estimating memberships and \(\hat{V}\) for estimating cluster prototypes.

At each iteration, the algorithm alternates between computing the estimated membership matrix \(\tilde{U}\) using the update function \(\hat{U}\) and computing the estimated prototype matrix \(\tilde{V}\) using the update function \(\hat{V}\).

For additional methodological details, see Kmita et al. (2024).

The AO algorithm can be summarized as follows:

  1. Initialize the algorithm with an initial membership matrix \(U^{(0)}\).
  2. Set the iteration counter \(l = 1\).
  3. Estimate prototypes \(\tilde{V}^{(l)}\) using the formula for optimal prototypes \(\hat{V}\) and current values of memberships.
  4. Estimate memberships \(\tilde{U}^{(l)}\) using the formula for optimal memberships \(\hat{U}\) and current values of prototypes.
  5. Check the convergence criterion. If the difference between \(\tilde{U}^{(l)}\) and \(\tilde{U}^{(l-1)}\) is smaller than a predefined threshold, stop the algorithm. Otherwise, increase \(l\) by one and repeat Steps 3–5.

3 Unsupervised models

The following examples use the well-known iris dataset (R.A. Fisher, 1936). We consider two features: sepal length and sepal width.

3.1 Preparing the data

fussclust requires the input to be of class matrix.

X <- iris[, c("Sepal.Length", "Sepal.Width")] |> as.matrix()

cat(
  paste0(
    "Class of `iris`: ", class(iris),
    "; class of `X`: ",
    paste(class(X), collapse = " & ")
  )
)
#> Class of `iris`: data.frame; class of `X`: matrix & array

The iris dataset contains three species of iris flowers. Therefore, we set the number of clusters to C = 3. In practice, however, the user may specify any number of clusters such that \(C > 1\).

model_fcm <- fussclust::FCM(X = X, C = 3)
model_pcm <- fussclust::PCM(X = X, C = 3)

The estimated cluster prototypes obtained from the FCM model are:

model_fcm$V
#>      Sepal.Length Sepal.Width
#> [1,]     6.814646    3.070350
#> [2,]     4.979951    3.355452
#> [3,]     5.830402    2.762639

The estimated memberships for the first ten observations obtained from the PCM model are:

model_pcm$U[1:10, ]
#>            [,1]      [,2]      [,3]
#>  [1,] 0.5153643 0.5153737 0.5153653
#>  [2,] 0.4969738 0.4969839 0.4969748
#>  [3,] 0.3983422 0.3983501 0.3983430
#>  [4,] 0.3671885 0.3671957 0.3671893
#>  [5,] 0.4485335 0.4485415 0.4485344
#>  [6,] 0.4674081 0.4674135 0.4674087
#>  [7,] 0.3451158 0.3451224 0.3451165
#>  [8,] 0.4966804 0.4966900 0.4966814
#>  [9,] 0.3058682 0.3058740 0.3058688
#> [10,] 0.4925528 0.4925630 0.4925539

4 Semi-supervised models

4.1 Encoding partial supervision

As described in Section 1, semi-supervised models in fussclust require partial supervision encoded as a binary matrix superF.

The package assumes that the number of clusters \(C\) equals the number of distinct classes represented in the supervision information.

In many applications, class labels are stored in a response variable. This is also the case for the iris dataset:

iris[1:3, ]
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa

The class labels are stored in the Species column:

unique(iris$Species)
#> [1] setosa     versicolor virginica 
#> Levels: setosa versicolor virginica

To construct the supervision matrix, the user must first define the arbitrary ordering of the classes. Let

\[ \mathcal{Y} = (y^1 = \texttt{setosa}, y^2 = \texttt{versicolor}, y^3 = \texttt{virginica}) \]

denote the ordered set of class labels.

The supervision matrix superF is defined as follows. Let \(F = [f_{jk}]\) denote the binary supervision matrix, where:

The matrix satisfies the following properties:

The matrix can be constructed as follows:

superF <- matrix(0, nrow = nrow(iris), ncol = 3)

superF[iris$Species == "setosa", 1] <- 1
superF[iris$Species == "versicolor", 2] <- 1
superF[iris$Species == "virginica", 3] <- 1

4.2 Fitting the models

When fitting a semi-supervised model, the user must provide the scaling parameter \(\alpha\) (alpha), which regulates the impact of partial supervision.

There is no universally optimal choice of \(\alpha\), as the appropriate value depends on the dataset and application. For further discussion, see Kmita et al. (2024).

For this example, we set \(\alpha = 1\).

Semi-Supervised Fuzzy c-Means:

model_ssfcm <- fussclust::SSFCM(
  X = X,
  C = 3,
  alpha = 1,
  superF = superF
)

Semi-Supervised Possibilistic c-Means:

model_sspcm <- fussclust::SSPCM(
  X = X,
  C = 3,
  alpha = 1,
  superF = superF
)

The estimated prototypes are:

model_ssfcm$V
#>      Sepal.Length Sepal.Width
#> [1,]     4.999393    3.402151
#> [2,]     5.898059    2.766335
#> [3,]     6.646567    3.004991
model_sspcm$V
#>      Sepal.Length Sepal.Width
#> [1,]     5.212768    3.274814
#> [2,]     5.964629    2.865122
#> [3,]     6.335417    2.957180

5 Additional details

5.1 The \(\Gamma\) parameter in possibilistic models

The vector of cluster-specific hyperparameters

\[ \Gamma = (\gamma_1, \ldots, \gamma_C) \]

is specific to possibilistic clustering methods.

This is an optional argument in both fussclust::PCM() and fussclust::SSPCM(). If not provided, the default value is a unit vector:

\[ \Gamma = (1, \ldots, 1) \]

Alternatively, the user may apply the initialization strategy proposed by Krishnapuram and Keller (1993), which computes

\[ \gamma_k = \frac{ \sum_{j=1}^N u_{jk}^2 d_{jk}^2 }{ \sum_{j=1}^N u_{jk}^2 } \]

using values obtained from an initial FCM fit.

This strategy can be enabled by setting initFCM = TRUE.

model_pcm_init <- fussclust::PCM(
  X = X,
  C = 3,
  initFCM = TRUE
)

The user may also choose to specify the values of \(\Gamma\) manually, for example:

model_pcm_custom_gammas <- fussclust::PCM(
  X = X,
  C = 3,
  gammas = c(1, 2, 3)
)

5.2 Retrieving intermediate results

In addition to the final estimated memberships and prototypes (accessible via model$U and model$V), users may inspect intermediate results produced during the AO optimization procedure.

The following histories can be stored:

By default, these objects are set to NULL.

model_fcm$U_history
#> NULL

To store the optimization history, set store_history = TRUE.

model_fcm_history <- fussclust::FCM(
  X = X,
  C = 3,
  store_history = TRUE
)

cat(
  paste0(
    "Class of `model_fcm_history$U_history`: ",
    class(model_fcm_history$U_history),
    "; length: ",
    length(model_fcm_history$U_history),
    "; number of AO iterations: ",
    model_fcm_history$counter
  )
)
#> Class of `model_fcm_history$U_history`: list; length: 32; number of AO iterations: 32

To retrieve a specific intermediate result, for example the prototype matrix \(\hat{V}^{(3)}\) obtained at the third AO iteration, index the corresponding history object directly:

model_fcm_history$V_history[[3]]
#>      Sepal.Length Sepal.Width
#> [1,]     5.521657    3.103216
#> [2,]     5.967252    3.095491
#> [3,]     6.038396    2.977654