Title: | Gaussian Mixture Modeling Algorithms and the Belief-Based Mixture Modeling |
---|---|
Description: | Two partially supervised mixture modeling methods: soft-label and belief-based modeling are implemented. For completeness, we equipped the package also with the functionality of unsupervised, semi- and fully supervised mixture modeling. The package can be applied also to selection of the best-fitting from a set of models with different component numbers or constraints on their structures. For detailed introduction see: Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software <doi:10.18637/jss.v047.i03>. |
Authors: | Przemyslaw Biecek and Ewa Szczurek |
Maintainer: | Przemyslaw Biecek <[email protected]> |
License: | GPL-3 |
Version: | 1.8.6 |
Built: | 2024-10-13 03:02:33 UTC |
Source: | https://github.com/pbiecek/bgmm |
This package implements partially supervised mixture modeling methods: soft-label and belief-based modeling, the semi-supervised methods and for completeness also unsupervised and fully supervised methods for mixture modeling.
For short overview see the webpage http://bgmm.molgen.mpg.de/rapBGMM/.
Przemyslaw Biecek and Ewa Szczurek
Maintainer: Przemyslaw Biecek <[email protected]>
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
Package for unsupervised learning of Gaussian mixture model link{mclust}
,
methods for supervised learning link{MASS::lda()}
, link{MASS::qda()}
.
## Do not run ## It could take more than one minute #demo(bgmm)
## Do not run ## It could take more than one minute #demo(bgmm)
Time course expression data for 384 cell cycle genes (Cho et al., 1998). Literature examples of genes that should, and of genes that should not peak at each time point are given. For each cycle phase, there is a characteristic binary profile, stating when the phase occurs.
data(CellCycle)
data(CellCycle)
CellCycleData list: 17x384 CellCycleBeliefs list: 17x (35x2) CellCycleCenters matrix: 5x17 CellCycleClass vector: 384
CellCycleData
: A list, where each entry corresponds to one
time-point. A given time point entry contains a vector with expression
ratios for 384 cell cycle genes measured in this time point.
CellCycleBeliefs
: A list, where each entry corresponds to one
time-point. A given time point entry gives the certainty
(belief/plausibility) for each out of 35 example genes. Out of the
genes, seven are known to peak in this time point and the remaining
28 are known to peak in other cycle phases.
CellCycleCenters
: A matrix, where the columns are the 17
time-points and the rows to the five cell phase clusters. A given
entry in the matrix is equal to 1 if the genes from the cluster
should peak in the time point, and 0 otherwise.
CellCycleClass
:Gives the true cluster for each gene. Each
cluster corresponds to a cell cycle phase.
Ewa Szczurek
Cho, R., Campbell, M., Winzeler, E., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T., Gabrielian, A., Landsman, D., Lockhart, D., and Davis, R. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Molecular Cell, 2(1), 65–73.
library(bgmm) data(CellCycle) print(CellCycleData) print(CellCycleBeliefs) print(CellCycleCenters) print(CellCycleClass)
library(bgmm) data(CellCycle) print(CellCycleData) print(CellCycleBeliefs) print(CellCycleCenters) print(CellCycleClass)
The function chooseModels extracts a sublist of models that match constraints on the number of components or on the model structure. The function chooseOptimal returns the model which is the best according the given model selection criteria.
chooseModels(models, kList = NULL, struct = NULL) chooseOptimal(models, penalty=2, ...)
chooseModels(models, kList = NULL, struct = NULL) chooseOptimal(models, penalty=2, ...)
models |
an object of the class |
kList |
a vector which specifies the requested numbers of Gaussian components (constraints on the number of components). |
struct |
a vector which specifies four letter abbreviations of names of the requested model structures (constraints on the model structure). |
penalty |
a penalty parameter in the GIC criteria. This parameter can be a single number or a string, either "BIC" or "AIC". |
... |
other arguments that will be passed to the getGIC function. |
The function chooseModels() extracts a sublist of models from the models
argument.
The returned sublist is also an object of the class mModelList
and is composed of models that simultaneously satisfy the restrictions of the number of Gaussian components defined by kList
and restrictions of the model structure defined by struct
.
If the argument kList
is set to NULL then no restrictions of the number of components are applied, same with the argument struct
.
The function chooseOptimal() returns an object of the class mModel
which is the single model that has the best (smallest) GIC score.
An object of the class mModelList
or mModel
.
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
simulated = simulateData(d=2, k=3, n=100, m=50, cov="0", within="E", n.labels=2) models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=2:4, mean="D", within="D") plotGIC(models3, penalty="BIC") ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plotGIC(models3, penalty="BIC") # # models4 = chooseModels(models3, kList=2:5, struct=c("DDDD","DDED","DDE0")) # plot(models4) # plotGIC(models4, penalty="BIC") # # model4 = chooseOptimal(models3, "BIC") # plot(model4)
simulated = simulateData(d=2, k=3, n=100, m=50, cov="0", within="E", n.labels=2) models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=2:4, mean="D", within="D") plotGIC(models3, penalty="BIC") ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plotGIC(models3, penalty="BIC") # # models4 = chooseModels(models3, kList=2:5, struct=c("DDDD","DDED","DDE0")) # plot(models4) # plotGIC(models4, penalty="BIC") # # model4 = chooseOptimal(models3, "BIC") # plot(model4)
The function crossval()
performes k-fold cross-validation.
crossval(model = NULL, X = NULL, knowns = NULL, class = NULL, k = length(unique(class)), B = NULL, P = NULL, model.structure = getModelStructure(), ..., folds = 2, fun = belief)
crossval(model = NULL, X = NULL, knowns = NULL, class = NULL, k = length(unique(class)), B = NULL, P = NULL, model.structure = getModelStructure(), ..., folds = 2, fun = belief)
model |
an object of the class |
X |
a data.frame with unknown realizations. If not supplied |
knowns |
a data.frame with labeled realizations. If not supplied |
class , B , P
|
a vector of classes, beliefs and plausibilities. If not supplied they will be extracted from the |
fun |
function that will be used for modeling, one of |
model.structure , k , ...
|
arguments that will be passed to |
folds |
number of folds in k-fold cross validation. Cannot be grated that number of labeled samples. |
The function crossval()
divides the dataset into k
equal subsets, the number of labeled cases versus number of unlabeled cases is keep as close to constant as possible (the subset are generated with stratification).
Then each subset is used as test set against a train set build from all remaining sets. In total k
new models are estimated thus this procedure is time consuming.
For each model the error is calculated as average absolute differences between the distribution of estimated posteriors and distribution of beliefs/plausibilities for labeled cases.
The list with three vectors: errors calculated as mean absolute differences between estimated posteriors and initial beliefs for known cases, indexes of folds for both labeled and unlabeled cases.
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
set.seed(1313) simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) amodel = belief(X=simulated$X, knowns=simulated$knowns, B=simulated$B, k=4) str(crossval(model=amodel, folds=6)) amodel = supervised(knowns=rbind(simulated$X, simulated$knowns), class=simulated$Ytrue) str(crossval(model=amodel, folds=6, fun=supervised))
set.seed(1313) simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) amodel = belief(X=simulated$X, knowns=simulated$knowns, B=simulated$B, k=4) str(crossval(model=amodel, folds=6)) amodel = supervised(knowns=rbind(simulated$X, simulated$knowns), class=simulated$Ytrue) str(crossval(model=amodel, folds=6, fun=supervised))
The DEprobs
function is an application of mixture modeling to differential gene expression analysis. The function takes as input a two- or three-component model of one-dimensional gene expression data. The data is assumed to represent log fold change expression values and be negative when the corresponding genes are down-regulated. The function calculates probabilities of differential expression for the data and gives them a sign according to the sign of the data.
DEprobs(model, verbose=FALSE)
DEprobs(model, verbose=FALSE)
model |
an object of the class |
verbose |
indicates whether log messeges should be prited out. |
Given the input model
, the function identifies the component which corresponds to the differentially expressed genes as the one which looks differential according to the posterior probabilities.
For input models with two Gaussian components the differential component should be the one with a broader range (encompassing the other), or the one with higher deviation from 0 (we assume the data are centered around 0).
For input models with three Gaussian components there are two differential components: one corresponding to the down-regulated genes, and one corresponding to the up-regulated genes. Those components are identified as the ones with the lowest and the highest mean, respectively.
For verbose=TRUE
the index of the differential component is printed out.
An list with the following elements:
diff.p.X |
a vector with the calculated signed differential expression probabilities for the unlabeled observations in the dataset |
diff.p.knowns |
a vector with the calculated signed differential expression probabilities for the unlabeled observations in the dataset |
diff.c |
the index (or two indexes, in case of a three-component input model) of the identified differential component. |
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
data(Ste12) X = Ste12Data[ match(names(Ste12Data), rownames(Ste12Beliefs), nomatch = 0) ==0 ] knowns = Ste12Data[rownames(Ste12Beliefs)] model = belief(X=X, knowns=knowns, B=Ste12Beliefs) dep=DEprobs(model) str(dep)
data(Ste12) X = Ste12Data[ match(names(Ste12Data), rownames(Ste12Beliefs), nomatch = 0) ==0 ] knowns = Ste12Data[rownames(Ste12Beliefs)] model = belief(X=X, knowns=knowns, B=Ste12Beliefs) dep=DEprobs(model) str(dep)
The genotypes dataset describes 333 SNPs. Each SNP is characterized by the presence of one of its two possible alleles (or the presence of both of them). Therefore, the SNPs can be divided into three types. The first type corresponds to the SNPs with the first possible allele, the second type with the second allele, and the third with both alleles. The presence of the alleles is measured experimentally with fluorescence intensities. The dataset contains the intensities in the slots X
and knowns
.
15 SNPs in the dataset are given their correct type. These 'known' SNPs can be used as the input for the semi-, partially and fully superised modeling methods. They were selected by taking at random five SNPs per each type. Their intensities are contained in the slot knowns
. Their belief/plausibility values (given in the slot B
) of the most probable type (the slot labels
) are set to 0.95, and of the other two types are equal 0.025.
The remaining 318 SNPs are kept unlabeled.
data(genotypes)
data(genotypes)
X : a matrix with 318 rows (unlabeled SNPs) and 2 columns (alleles) knowns : a matrix with 15 rows (known SNPs) and 2 columns (alleles) B : a matrix with 15 rows (known SNPs) and 3 columns (types) labels : a vector of length 15 (types of the known SNPs)
The rows of both the slots X
and knowns
correspond to the SNPs. For each SNP, the values in the columns represent the intensities of the fluorescence signal corresponding to the alleles of the SNP. The slot B
corresponds to the belief matrix while labels
contains the true types for the labeled SNPs.
Takitoh, S. Fujii, S. Mase, Y. Takasaki, J. Yamazaki, T. Ohnishi, Y. Yanagisawa, M. Nakamura, Y. Kamatani, N., Accurate automated clustering of two-dimensional data for single-nucleotide polymorphism genotyping by a combination of clustering methods: evaluation by large-scale real data, Bioinformatics (2007) Vol. 23, 408–413.
library(bgmm) data(gnotypes)
library(bgmm) data(gnotypes)
This function creates an object which describes constraints over the model parameters.
getModelStructure(mean = "D", between = "D", within = "D", cov = "D")
getModelStructure(mean = "D", between = "D", within = "D", cov = "D")
Each argument is a single character, by default equal "D" (Different). If argument is set to "E" (or "0" for the argument cov
) then the given parameter is constrained. By default all arguments are set to "D".
mean |
|
between |
|
within |
|
cov |
|
List of four elements specifying the constraints on 1) relations between the component means, 2) relations between the covariance matrices of the model components, 3) relations within each covariance matrix and 4) the covariances within each matrix. By default, the function returns an unconstrained structure.
Ewa Szczurek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
getModelStructure() getModelStructure(mean="E")
getModelStructure() getModelStructure(mean="E")
Methods for the initiation of model parameters for the EM algorithm. Two initiation procedures are implemented. The first procedure is available by setting the argument method='knowns'
. It takes into account only labeled observations and is thus suitable for datasets with a high percentage of labeled cases. The second is available by setting method='all'
and does not take the labeling into account.
init.model.params(X = NULL, knowns = NULL, class = NULL, k = length(unique(class)), method = "all", B = P, P = NULL)
init.model.params(X = NULL, knowns = NULL, class = NULL, k = length(unique(class)), method = "all", B = P, P = NULL)
X |
a |
knowns |
a |
B |
a beliefs matrix with the distribution of beliefs for the labeled observations. If not specified and the argument |
P |
a matrix of plausibilities, specified only for the labeled observations. The function assumes that the remaining observations are unlabeled and gives them
uniformly distributed plausibilities by default. If not specified and the argument |
class |
|
k |
the desired number of model components. |
method |
a method for parameter initialization, one of following |
For method='knowns'
, the initialization is based only on the labeled observations. i.e. those observations which have certain or probable components assigned. The initial model parameters
for each component are estimated in one step from the observations that are assigned to this component (as in fully supervised learning).
If method='all'
(default), the initialization is based on all observations. In this case, to obtain the initial set of model components, we start by clustering the data using the k-means algorithm (repeated 10 times to get stable results). The only exception is for one dimensional data. In such a case the clusters are identified by dividing the data into k
equal subsets of observations, where the subsets are separated by empirical quantiles c(1/2k, 3/2k, 5/2k, ..., (2k-1)/2k). After this initial clustering each cluster is linked to one model component and initial values for the model parameters are derived from the clustered observations.
For the partially and semi-supervised methods, correspondence of labels from the initial clustering algorithm and labels for the observations in the knowns
dataset rises a technical problem. The cluster corresponding to component y
should be as close as possible to the set of labeled observations with label y
.
Note that for the unsupervised modeling this problem is irrelevant and any cluster may be used to initialize any component.
To mach the cluster labels with the labels of model components a greedy heuristic is used. The heuristic calculates weighted distances between all possible pairs of cluster centers and sets of observations grouped by their labels. In each step, the pair with a minimal distance is chosen (the pair: a group of observations with a common label and a cluster, for which the center of the group is the closest to the center of the cluster). For the chosen pair, the cluster is labeled with the same label as the group of observations. Then, this pair is removed and the heuristic repeats for the reduced set of pairs.
A list with the following elements:
pi |
a vector of length |
mu |
a matrix with the means' vectors with the initial values for |
cvar |
a three-dimensional matrix with the covariance matrices with the initial values for |
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
data(genotypes) initial.params = init.model.params(X=genotypes$X, knowns=genotypes$knowns, class = genotypes$labels) str(initial.params)
data(genotypes) initial.params = init.model.params(X=genotypes$X, knowns=genotypes$knowns, class = genotypes$labels) str(initial.params)
miRNA transfection data (Lim et al., 2005) and knowledge from computational miRNA target predictions.
data(miRNA)
data(miRNA)
miR1Data vector: 117, miR124Data vector: 117, miRNABeliefs matrix of example certainty: 26 x 2, miRNAClass vector: 117
miR1Data
Log2 expression ratios of miR1 transfection versus
wild type, for 117 genes.
miR124Data
Log2 expression ratios of miR124 transfection versus
wild type, for 117 genes.
miRNABeliefs
Gives the certainty (belief/plausibility) for
each out of 26 example miRNA targets to belong to their cluster.
miRNAClass
Gives the true cluster for each gene. Cluster 1
corresponds to the experimentally verified targets of miR1. Cluster 2
corresponds to the targets of miR124.
Ewa Szczurek
Lim, L. P., Lau, N. C., Garrett-Engele, P., Grimson, A., Schelter, J. M., Castle, J., Bartel, D. P., Linsley, P. S., and Johnson, J. M. (2005). Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature, 433(7027).
library(bgmm) data(miRNA) print(miR1Data) print(miR124Data) print(miRNABeliefs) print(miRNAClass)
library(bgmm) data(miRNA) print(miR1Data) print(miR124Data) print(miRNABeliefs) print(miRNAClass)
These functions fit different variants of Gaussian mixture models. These variants differ in the fraction of knowledge utilized into the the fitting procedure.
belief(X, knowns, B = NULL, k = ifelse(!is.null(B), ncol(B), ifelse(!is.null(P), ncol(P), length(unique(class)))), P = NULL, class = map(B), init.params = init.model.params(X, knowns, B = B, P = P, class = class, k = k), model.structure = getModelStructure(), stop.likelihood.change = 10^-5, stop.max.nsteps = 100, trace = FALSE, b.min = 0.025, all.possible.permutations=FALSE, pca.dim.reduction = NA) soft(X, knowns, P = NULL, k = ifelse(!is.null(P), ncol(P), ifelse(!is.null(B), ncol(B), length(unique(class)))), B = NULL, class = NULL, init.params = init.model.params(X, knowns, class = class, B = P, k = k), model.structure = getModelStructure(), stop.likelihood.change = 10^-5, stop.max.nsteps = 100, trace = FALSE, b.min = 0.025, all.possible.permutations=FALSE, pca.dim.reduction = NA, ...) semisupervised(X, knowns, class = NULL, k = ifelse(!is.null(class), length(unique(class)), ifelse(!is.null(B), ncol(B), ncol(P))), B = NULL, P = NULL, ..., init.params = NULL, all.possible.permutations=FALSE, pca.dim.reduction = NA) supervised(knowns, class = NULL, k = length(unique(class)), B = NULL, P = NULL, model.structure = getModelStructure(), ...) unsupervised(X, k, init.params=init.model.params(X, knowns=NULL, k=k), model.structure=getModelStructure(), stop.likelihood.change=10^-5, stop.max.nsteps=100, trace=FALSE, ...)
belief(X, knowns, B = NULL, k = ifelse(!is.null(B), ncol(B), ifelse(!is.null(P), ncol(P), length(unique(class)))), P = NULL, class = map(B), init.params = init.model.params(X, knowns, B = B, P = P, class = class, k = k), model.structure = getModelStructure(), stop.likelihood.change = 10^-5, stop.max.nsteps = 100, trace = FALSE, b.min = 0.025, all.possible.permutations=FALSE, pca.dim.reduction = NA) soft(X, knowns, P = NULL, k = ifelse(!is.null(P), ncol(P), ifelse(!is.null(B), ncol(B), length(unique(class)))), B = NULL, class = NULL, init.params = init.model.params(X, knowns, class = class, B = P, k = k), model.structure = getModelStructure(), stop.likelihood.change = 10^-5, stop.max.nsteps = 100, trace = FALSE, b.min = 0.025, all.possible.permutations=FALSE, pca.dim.reduction = NA, ...) semisupervised(X, knowns, class = NULL, k = ifelse(!is.null(class), length(unique(class)), ifelse(!is.null(B), ncol(B), ncol(P))), B = NULL, P = NULL, ..., init.params = NULL, all.possible.permutations=FALSE, pca.dim.reduction = NA) supervised(knowns, class = NULL, k = length(unique(class)), B = NULL, P = NULL, model.structure = getModelStructure(), ...) unsupervised(X, k, init.params=init.model.params(X, knowns=NULL, k=k), model.structure=getModelStructure(), stop.likelihood.change=10^-5, stop.max.nsteps=100, trace=FALSE, ...)
X |
a data.frame with the unlabeled observations. The rows correspond to the observations while the columns to variables/dimensions of the data. |
knowns |
a data.frame with the labeled observations. The rows correspond to the observations while the columns to variables/dimensions of the data. |
B |
a beliefs matrix which specifies the distribution of beliefs for the labeled observations. The number of rows in B should equal the number of rows in the data.frame |
P |
a matrix of plausibilities, i.e., weights of the prior probabilities for the labeled observations. If matrix |
class |
a vector of classes/labels for the labeled observations. The number of its unique values has to be less or equal |
k |
a number of components, by default equal to the number of columns of |
init.params |
initial values for the estimates of the model parameters (means, variances and mixing proportions), by default derived with the
use of the |
stop.likelihood.change , stop.max.nsteps
|
the parameters for the EM algorithms defining the stop criteria, i.e., the minimum required improvement of loglikelihood and the maximum number of steps. |
trace |
if |
model.structure |
an object returned by the |
b.min |
this argument is passed to the |
... |
these arguments will be passed tothe |
all.possible.permutations |
If equal |
pca.dim.reduction |
Since the fitting for high dimensional space is numerically a bad idea an attempt to PCA will be performed if |
In the belief()
function, if the argument B
is not provided, it is
by default initialized from the argument P
. If the argument P
is not
provided, B
is derived from the class
argument with the use of the function get.simple.beliefs()
which assigns 1-(k-1)*b.min
to the component given by class
and
b.min
to all remaining components.
In the soft()
function, if the argument P
is not provided, it is
by default initialized from the argument B
. If the argument B
is not
provided, P
is derived from the class
argument as in the belief()
function.
In the supervised()
function, if the argument class
is not provided,
it is by default initialized from argument B
or P
, taking the label of each
observation as its most believed or plausible component (by the MAP rule).
The number of columns in the beliefs matrix B
or in the matrix of
plausibilities P
may be smaller than the number of model components
defined by the argument k
. Such situation corresponds to the scenario
when the user does not know any examples for some component. In other words, this component is not used as a label for
any observation, and thus can be omitted from the beliefs matrix. An
equivalent would be to include a column for this component and fill it
with beliefs/plausibilities equal 0.
Slots in the returned object are listed in section Value.
The returned object differs slighty with respect to the used function. Namely, the belief()
function returns an object with the slot B
. The function soft()
returns an object with a slot P
, while the functions supervised()
and semisupervised()
return objects with a slot class
instead.
The object returned by the function supervised()
does not have the slot X
.
An object of the class mModel
, with the following slots:
pi |
a vector with the fitted mixing proportions |
mu |
a matrix with the means' vectors, fitted for all components |
cvar |
a three-dimensional matrix with the covariance matrices, fitted for all components |
X |
the unlabeled observations |
knowns |
the labeled observations |
B |
the beliefs matrix |
n |
the number of all observations |
m |
the number of the unlabeled observations |
k |
the number of fitted model components |
d |
the data dimension |
likelihood |
the log-likelihood of the fitted model |
n.steps |
the number of steps performed by the EM algorithm |
model.structure |
the set of constraints kept during the fitting process. |
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
data(genotypes) modelSupervised = supervised(knowns=genotypes$knowns, class=genotypes$labels) plot(modelSupervised) modelSemiSupervised = semisupervised(X=genotypes$X, knowns=genotypes$knowns, class = genotypes$labels) plot(modelSemiSupervised) modelBelief = belief(X=genotypes$X, knowns=genotypes$knowns, B=genotypes$B) plot(modelBelief) modelSoft = soft(X=genotypes$X, knowns=genotypes$knowns, P=genotypes$B) plot(modelSoft) modelUnSupervised = unsupervised(X=genotypes$X, k=3) plot(modelUnSupervised)
data(genotypes) modelSupervised = supervised(knowns=genotypes$knowns, class=genotypes$labels) plot(modelSupervised) modelSemiSupervised = semisupervised(X=genotypes$X, knowns=genotypes$knowns, class = genotypes$labels) plot(modelSemiSupervised) modelBelief = belief(X=genotypes$X, knowns=genotypes$knowns, B=genotypes$B) plot(modelBelief) modelSoft = soft(X=genotypes$X, knowns=genotypes$knowns, P=genotypes$B) plot(modelSoft) modelUnSupervised = unsupervised(X=genotypes$X, k=3) plot(modelUnSupervised)
These functions fit collection of models of one particular variant/class. Models to be fitted may differ in the requested number of Gaussian components or in the requested model structure.
mModelList(X, knowns, B = NULL, P = NULL, class = NULL, kList = ncol(B), init.params = NULL, stop.likelihood.change = 10^-5, stop.max.nsteps = 100, trace = FALSE, mean = c("D", "E"), between = c("D", "E"), within = c("D", "E"), cov = c("D", "0"), funct = belief, all.possible.permutations = FALSE, ...) beliefList(..., funct=belief) softList(..., funct=soft) semisupervisedList(..., funct=semisupervised) unsupervisedList(X, kList = 2, ...)
mModelList(X, knowns, B = NULL, P = NULL, class = NULL, kList = ncol(B), init.params = NULL, stop.likelihood.change = 10^-5, stop.max.nsteps = 100, trace = FALSE, mean = c("D", "E"), between = c("D", "E"), within = c("D", "E"), cov = c("D", "0"), funct = belief, all.possible.permutations = FALSE, ...) beliefList(..., funct=belief) softList(..., funct=soft) semisupervisedList(..., funct=semisupervised) unsupervisedList(X, kList = 2, ...)
X |
a data.frame with the unlabeled observations. The rows correspond to the observations while the columns to variables/dimensions of the data. |
knowns |
a data.frame with the labeled observations. The rows correspond to the observations while the columns to variables/dimensions of the data. |
B |
a beliefs matrix which specifies the distribution of beliefs for the labeled observations. The number of rows in B should equal the number of rows in the data.frame |
P |
a matrix of plausibilities, i.e., weights of the prior probabilities for the labeled observations. If matrix |
class |
a vector of classes/labels for the labeled observations. The number of its unique values has to be less or equal |
kList |
a vector or a list with numbers of Gaussian components to fit. By default it is one number equal to the number of columns of |
init.params |
initial values for the estimates of the model parameters (means, variances and mixing proportions). The initial parameters are internally passed to the |
stop.likelihood.change , stop.max.nsteps , trace
|
the parameters for the EM algorithm. Internally, these parameters are passed to the |
mean , between , within , cov
|
four vectors which define the model structures for models to be fitted. For example, if |
funct |
a function which fits a variant of Gaussian mixture model, one of the: |
... |
arguments that are passed to function |
all.possible.permutations |
If equal |
Arguments kList
, as well as mean
, between
, within
, and cov
define the list of models to be fitted. All combinations of specified model sizes and model structures are considered. List of fitted models is returned as a result.
The argument funct
defines which variant of Gaussian mixture models should be used for model fitting. One can use the wrappers beliefList()
, softList()
, semisupervisedList()
, unsupervisedList()
which call the mModelList()
function and have a prespecified argument funct
.
An object of the class mModelList, with the following slots:
models |
a list of models, each of the class |
loglikelihoods |
a vector with log likelihoods of the models from list |
names |
a vector with names of the models from list |
params |
a vector with the number of parameters of models from list |
kList |
equals the input argument |
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
simulated = simulateData(d=2, k=3, n=100, m=60, cov="0", within="E", n.labels=2) models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=3:4, mean=c("D","E"), between="D", within="D", cov="0", funct=belief) plot(models1) plotGIC(models1, penalty="BIC") ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=3, mean=c("D","E"), between=c("D","E"), within=c("D","E"), # cov=c("D","0"), funct=belief) # plot(models1) # plotGIC(models1, penalty="BIC") # # models2 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D", between="D", within="E", cov="0") # plot(models2) # plotGIC(models2, penalty="BIC") # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plotGIC(models3, penalty="BIC")
simulated = simulateData(d=2, k=3, n=100, m=60, cov="0", within="E", n.labels=2) models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=3:4, mean=c("D","E"), between="D", within="D", cov="0", funct=belief) plot(models1) plotGIC(models1, penalty="BIC") ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=3, mean=c("D","E"), between=c("D","E"), within=c("D","E"), # cov=c("D","0"), funct=belief) # plot(models1) # plotGIC(models1, penalty="BIC") # # models2 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D", between="D", within="E", cov="0") # plot(models2) # plotGIC(models2, penalty="BIC") # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plotGIC(models3, penalty="BIC")
The generic function plot
is used to visualize the data set and Gaussian model components fitted to this data. On the resulting plot the observations without labels are presented with black points, whereas the labeled observations are marked by different colors and different symbols.
The fitted Gaussian components are represented by ellipses into the two-dimensional case and by densities in the one dimensional case.
If data has more than two dimensions thus graphs are presented on the subspace generated by first two PCA components. Note that the estimation is done in higher dimension and the reduction to 2D is done only for illustration.
That gives different results than data reduction prior to modeling process.
## S3 method for class 'mModel' plot(x, ...)
## S3 method for class 'mModel' plot(x, ...)
x |
an object of the class |
... |
graphical arguments that are passed to the underlying |
For one dimensional data the width of the density corresponds to standard deviation of the fitted Gaussian component. Fitted means are marked by vertical dashed lines.
For two dimensional data ellipses represents covariances for the corresponding model components.
For more dimensional points and ellipses are projected into 2D subspace spanned by first two PCA components.
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
data(genotypes) modelSupervised = supervised(knowns=genotypes$knowns, class=genotypes$labels) plot(modelSupervised) # semi-supervised modeling modelSemiSupervised = semisupervised(X=genotypes$X, knowns=genotypes$knowns, class = genotypes$labels) plot(modelSemiSupervised) # belief-based modeling modelBelief = belief(X=genotypes$X, knowns=genotypes$knowns, B=genotypes$B) plot(modelBelief) # soft-label modeling modelSoft = soft(X=genotypes$X, knowns=genotypes$knowns, P=genotypes$B) plot(modelSoft) # unsupervised modeling modelUnSupervised = unsupervised(X=genotypes$X, k=3) plot(modelUnSupervised)
data(genotypes) modelSupervised = supervised(knowns=genotypes$knowns, class=genotypes$labels) plot(modelSupervised) # semi-supervised modeling modelSemiSupervised = semisupervised(X=genotypes$X, knowns=genotypes$knowns, class = genotypes$labels) plot(modelSemiSupervised) # belief-based modeling modelBelief = belief(X=genotypes$X, knowns=genotypes$knowns, B=genotypes$B) plot(modelBelief) # soft-label modeling modelSoft = soft(X=genotypes$X, knowns=genotypes$knowns, P=genotypes$B) plot(modelSoft) # unsupervised modeling modelUnSupervised = unsupervised(X=genotypes$X, k=3) plot(modelUnSupervised)
The function plot.mModelList()
creates a grid of panels and then plots a set of input fitted models in the consecutive panels. The plot.mModel()
function is used to plot each single model.
## S3 method for class 'mModelList' plot(x, ...)
## S3 method for class 'mModelList' plot(x, ...)
x |
an object of the class |
... |
graphical arguments that are passed to underlying |
The argument x
is a list of models. If these models differ both by component numbers and by the model structures,
in the resulting grid of panels columns correspond to the different model structures while rows correspond to the different component numbers.
If considered models differ only by component numbers or only by the model structures, the grid of panels is as close to square as possible and consecutive panels contain consecutive models from the list of models x
.
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
simulated = simulateData(d=2, k=3, n=100, m=60, cov="0", within="E", n.labels=2) models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=3:4, mean=c("D","E"), between="D", within="D", cov="0", funct=belief) plot(models1) ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=3, mean=c("D","E"), between=c("D","E"), within=c("D","E"), # cov=c("D","0"), funct=belief) # plot(models1) # # models2 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D", between="D", within="E", cov="0") # plot(models2) # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plot(models3)
simulated = simulateData(d=2, k=3, n=100, m=60, cov="0", within="E", n.labels=2) models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=3:4, mean=c("D","E"), between="D", within="D", cov="0", funct=belief) plot(models1) ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=3, mean=c("D","E"), between=c("D","E"), within=c("D","E"), # cov=c("D","0"), funct=belief) # plot(models1) # # models2 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D", between="D", within="E", cov="0") # plot(models2) # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plot(models3)
The function plotGIC()
plots the GIC scores for an input collection of models. The function getGIC()
extracts GIC for given model and penalty function. The function getDF()
extracts the number of degree of freedom for model parameters.
plotGIC(models, penalty = 2, plot.it = TRUE, ...) getGIC(model, p = 2, whichobs="unlabeled") getDF(model)
plotGIC(models, penalty = 2, plot.it = TRUE, ...) getGIC(model, p = 2, whichobs="unlabeled") getDF(model)
models |
an object of the class |
model |
an object of the class |
penalty |
a penalty for the GIC criteria. This parameter can be a single number or a string, on of the "BIC", "AIC", "AIC3", "AIC4", "AICc", "AICu", "CAIC", "BIC", "MDL", "CLC", "ICL-BIC", "AWE". |
p |
same as |
whichobs |
one of "unlabeled", "labeled", "all". This parameter specify which observations should be used in the likelihood and gic score calculation, |
plot.it |
a logical value, if |
... |
other arguments that will be passed to the getGIC function. |
The function plotGIC()
calculates the GIC scores for each model from the models
list and, given plot.it=TRUE
, plots a dotchart with the calculated GIC scores.
As a result the function plotGIC()
returns a matrix with the calculated GIC scores. This matrix or its submatrix can be used in next call of the plotGIC()
function as models
argument. The columns of the matrix correspond to different component numbers of the models, while the rows correspond to their structures. The structures are coded with four-letter strings. The letters refer, in order from left to right: first, the relation between the means' vectors of the components, which can either be equal (letter "E") or unconstrained ("D"). Second, the relation between covariance matrices, which can all either be equal ("E"), or unconstrained ("D"). Third, the relation between the data vector components (corresponding to data dimensions) within each covariance matrix, i.e. each covariance matrix can either have all variances equal to some constant and all covariances equal to some constant ("E") or can be unconstrained ("D"). Fourth, the covariances in each covariance matrix, which can either all be forced to equal 0 ("0") or be unconstrained ("D").
The best model, i.e. model with the smallest GIC score is marked with a star on the plotted chart.
The matrix with GIC scores calculated for the list of models specified by the models
argument.
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
simulated = simulateData(d=2, k=3, n=100, m=60, cov="0", within="E", n.labels=2) models1 = mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=3:4, mean=c("D","E"), between="D", within="D", cov="0", funct=belief) plotGIC(models1, penalty="BIC") ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=3, mean=c("D","E"), between=c("D","E"), within=c("D","E"), # cov=c("D","0"), funct=belief) # plotGIC(models1, penalty="BIC") # # models2 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D", between="D", within="E", cov="0") # plotGIC(models2, penalty="BIC") # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plotGIC(models3, penalty="BIC")
simulated = simulateData(d=2, k=3, n=100, m=60, cov="0", within="E", n.labels=2) models1 = mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, kList=3:4, mean=c("D","E"), between="D", within="D", cov="0", funct=belief) plotGIC(models1, penalty="BIC") ## Do not run ## It could take more than one minute # simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) # # models1=mModelList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=3, mean=c("D","E"), between=c("D","E"), within=c("D","E"), # cov=c("D","0"), funct=belief) # plotGIC(models1, penalty="BIC") # # models2 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D", between="D", within="E", cov="0") # plotGIC(models2, penalty="BIC") # # models3 = beliefList(X=simulated$X, knowns=simulated$knowns, B=simulated$B, # kList=2:7, mean="D") # plotGIC(models3, penalty="BIC")
For every row in the matrix X
the posterior probability of belonging to class i
is calculated.
## S3 method for class 'mModel' predict(object, X, knowns = NULL, B = NULL, P = NULL, ...)
## S3 method for class 'mModel' predict(object, X, knowns = NULL, B = NULL, P = NULL, ...)
object |
an object of the class |
X |
a |
knowns |
a |
P |
a matrix with plausibilities for object |
B |
a matrix with beliefs for object |
... |
all other arguments will be neglected. |
The matrix tij of posterior probabilities is calculated as normalized products of priors pi's and density of model components in values specified by rows of the matrix X
.
If arguments knowns
and B
are specified then the priors's for objects in knowns
are replaced by belief matrix B
.
If arguments knowns
and P
are specified then the priors's for objects in knowns
are multiplied by plausibility matrix P
.
An list with the following elements:
tij.X , tij.knowns
|
the matrix tij.X is a matrix with number of rows equal to number of rows in the matrix |
class.X , class.knowns
|
vactors of labels/classes obtained with the MAP rule. The vector |
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
http://bgmm.molgen.mpg.de
data(genotypes) modelSoft = soft(X=genotypes$X, knowns=genotypes$knowns, P=genotypes$B) preds = predict(modelSoft, X = genotypes$X) str(preds)
data(genotypes) modelSoft = soft(X=genotypes$X, knowns=genotypes$knowns, P=genotypes$B) preds = predict(modelSoft, X = genotypes$X) str(preds)
The function simulateData
generates an artificial dataset from a mixture of Gaussian components with a given set of parameters.
simulateData(d = 2, k = 4, n = 100, m = 10, mu = NULL, cvar = NULL, s.pi = rep(1/k, k), b.min = 0.02, mean = "D", between = "D", within = "D", cov = "D", n.labels = k)
simulateData(d = 2, k = 4, n = 100, m = 10, mu = NULL, cvar = NULL, s.pi = rep(1/k, k), b.min = 0.02, mean = "D", between = "D", within = "D", cov = "D", n.labels = k)
d |
the dimension of the data set, |
k |
the number of the model components, |
n |
the total number of observations, both labeled and unlabeled, |
mu |
a matrix with |
cvar |
a three-dimensional array with the dimensions ( |
s.pi |
a vector of |
mean , between , within , cov
|
constraints on the model structure. By default all are equal to "D". If other values are set, the parameters |
m |
the number of the observations, for which the beliefs are to be calculated, |
b.min |
the belief that an observation does not belong to a component. Formally, the belief bij for the observation i to belong to component j is equal |
n.labels |
the number of components used as labels, defining the number of columns in the resulting beliefs matrix. By default |
An list with the following elements:
X |
the matrix of size n-m rows and d columns with generated values of unlabeled observations, |
knowns |
the matrix of size m rows and d columns with generated values of labeled observations, |
B |
the belief matrix of the size m rows and k columns derived for knowns matrix, |
model.params |
the list of model parameters, |
Ytrue |
indexes of the true Gaussian components from which each observation was generated. Lables for knowns go first. |
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) model = belief(X = simulated$X, knowns = simulated$knowns, B=simulated$B) plot(model) simulated = simulateData(d=1, k=2, n=300, m=60, n.labels=2) model = belief(X = simulated$X, knowns = simulated$knowns, B=simulated$B) plot(model)
simulated = simulateData(d=2, k=3, n=300, m=60, cov="0", within="E", n.labels=2) model = belief(X = simulated$X, knowns = simulated$knowns, B=simulated$B) plot(model) simulated = simulateData(d=1, k=2, n=300, m=60, n.labels=2) model = belief(X = simulated$X, knowns = simulated$knowns, B=simulated$B) plot(model)
Ste12 knockout expression data (Roberts et al., 2002) and knowledge from a Ste12 binding experiment (Harbison et al., 2004) used for identifying Ste12 target genes under pheromone treatment.
data(Ste12)
data(Ste12)
Ste12Data vector: 601 Ste12Beliefs matrix of example certainty: 42 x 2 Ste12Binding vector: 42
Ste12Data
Log2 expression ratios of Ste12 knockout versus
wild type, both under 50nM alpha-factor treatment for 30min. This
data is for 601 genes that had more than 1.5 fold change in expression
after pheromone treatment versus wild type.
Ste12Beliefs
: Gives the certainty (belief/plausibility) for
each out of 42 example Ste12 targets to belong to their cluster.
Ste12Beliefs
: Gives the certainty (belief/plausibility) for
each out of 42 example Ste12 targets to belong to their cluster. The
42 examples were chosen to meet two criteria: (1) Had a binding
p-value <0.0001 (see Ste12Binding
), and (2) Had a 2-fold change in response to pheromone treatment (versus wild-type)
Ste12Binding
: Gives the binding p-value for each example Ste12 target (see Ste12Belief
).
Ewa Szczurek
Roberts, C. J., Nelson, B., Marton, M. J., Stoughton, R., Meyer, M. R., Bennett, H. A., He, Y. D., Dai, H., Walker, W. L., Hughes, T. R., Tyers, M., Boone, C., and Friend, S. H. (2000). Signaling and Circuitry of Multiple MAPK Pathways Revealed by a Matrix of Global Gene Expression Profiles. Science, 287(5454), 873–880.
Harbison, C. T., Gordon, D. B., Lee, T. I., Rinaldi, N. J., Macisaac, K. D., Danford, T. W., Hannett, N. M., Tagne, J.-B., Reynolds, D. B., Yoo, J., Jennings, E. G., Zeitlinger, J., Pokholok, D. K., Kellis, M., Rolfe, P. A., Takusagawa, K. T., Lander, E. S., Gifford, D. K., Fraenkel, E., and Young, R. A. (2004). Transcriptional regulatory code of a eukaryotic genome. Nature, 431(7004), 99–104.
data("Ste12") print(Ste12Data) print(Ste12Beliefs) print(Ste12Binding)
data("Ste12") print(Ste12Data) print(Ste12Beliefs) print(Ste12Binding)
Set of supplementary functions for bgmm
package.
## S3 method for class 'numeric' determinant(x, logarithm = TRUE, ...) map(B) loglikelihood.mModel(model, X)
## S3 method for class 'numeric' determinant(x, logarithm = TRUE, ...) map(B) loglikelihood.mModel(model, X)
x |
a single number. |
X |
a data.frame with the unlabeled observations, the rows correspond to the observations and the columns to the dimensions of the data. |
B |
a beliefs matrix with the distribution of beliefs for the labeled observations. |
model |
an object of the class mModel. |
logarithm , ...
|
these arguments are ignored. |
Przemyslaw Biecek
Przemyslaw Biecek, Ewa Szczurek, Martin Vingron, Jerzy Tiuryn (2012), The R Package bgmm: Mixture Modeling with Uncertain Knowledge, Journal of Statistical Software.
data(genotypes) map(genotypes$B)
data(genotypes) map(genotypes$B)