Module mixture :: Class MixtureModel
[hide private]
[frames] | no frames]

Class MixtureModel

source code


Class for a context-specific independence (CSI) mixture models. The components are naive Bayes models (i.e. ProductDistribution objects).

Instance Methods [hide private]
 
__init__(self, G, pi, components, compFix=None, struct=0, identifiable=1)
Constructor
source code
 
__eq__(self, other)
Interface for the '==' operation
source code
 
__copy__(self)
Interface for the copy.copy function
source code
 
__str__(self)
String representation of the DataSet
source code
 
initStructure(self)
Initializes the CSI structure.
source code
 
modelInitialization(self, data, rtype=1, missing_value=None)
Perform model initialization given a random assigment of the data to the components.
source code
 
pdf(self, x)
Density function.
source code
 
sample(self)
Samples a single value from the distribution.
source code
 
sampleSet(self, nr)
Samples several values from the distribution.
source code
 
sampleDataSet(self, nr)
Returns a DataSet object of size 'nr'.
source code
 
sampleDataSetLabels(self, nr)
Samples a DataSet of size 'nr' and returns the DataSet and the true component labels
source code
 
sampleSetLabels(self, nr)
Same as sample but the component labels are returned as well.
source code
 
EM(self, data, max_iter, delta, silent=False, mix_pi=None, mix_posterior=None, tilt=0, EStep=None, EStepParam=None)
Reestimation of mixture parameters using the EM algorithm.
source code
 
EStep_old(self, data, mix_posterior=None, mix_pi=None, EStepParam=None)
[Old implementation, kept around for regression testing]
source code
 
EStep(self, data, mix_posterior=None, mix_pi=None, EStepParam=None)
Reestimation of mixture parameters using the EM algorithm.
source code
 
randMaxEM(self, data, nr_runs, nr_steps, delta, tilt=0, silent=False)
Performs `nr_runs` normal EM runs with random initial parameters and returns the model which yields the maximum likelihood.
source code
 
structureEM(self, data, nr_repeats, nr_runs, nr_steps, delta, tilt=0, silent=False)
EM training for models with CSI structure.
source code
 
MStep(self, posterior, data, mix_pi=None)
Maximization step of the EM procedure.
source code
 
mapEM(self, data, prior, max_iter, delta, silent=False, mix_pi=None, mix_posterior=None, tilt=0)
Reestimation of maximum a posteriori (MAP) mixture parameters using the EM algorithm.
source code
 
classify(self, data, labels=None, entropy_cutoff=None, silent=0, EStep=None, EStepParam=None)
Classification of input 'data'.
source code
 
isValid(self, x)
Exhaustive check whether a given DataSet is compatible with the model.
source code
 
formatData(self, x)
Formats samples 'x' for inclusion into DataSet object.
source code
 
reorderComponents(self, order)
Reorder components into a new order
source code
 
identifiable(self)
To provide identifiability the components are ordered by the mixture coefficient in ascending order.
source code
 
flatStr(self, offset)
Returns the model parameters as a string compatible with the WriteMixture/ReadMixture flat file format.
source code
 
printClusterEntropy(self, data)
Print out cluster stability measured by the entropy of the component membership posterior.
source code
 
posteriorTraceback(self, x)
Returns the decoupled posterior distribution for each sample in 'x'.
source code
 
printTraceback(self, data, z, en_cut=1.01)
Prints out the posterior traceback, i.e.
source code
 
update_suff_p(self)
Updates the .suff_p field.
source code
 
updateStructureGlobal(self, data, silent=1)
Updating CSI structure by chosing smallest KL distance merging, optimizing the AIC score.
source code
 
minimalStructure(self)
Finds redundant components in the model structure and collapses the structure to a minimal representation.
source code
 
removeComponent(self, ind)
Deletes a component from the model.
source code
 
merge(self, dlist, weights)
Merges 'self' with the distributions in'dlist' by an convex combination of the parameters as determined by 'weights'
source code
 
printStructure(self, data=None)
Pretty print of the model structure
source code
 
updateFreeParams(self)
Updates the number of free parameters for the current group structure
source code
 
validStructure(self)
Checks whether the CSI structure is syntactically correct.
source code
 
sufficientStatistics(self, posterior, data)
Returns sufficient statistics for a given data set and posterior.
source code
Method Details [hide private]

__init__(self, G, pi, components, compFix=None, struct=0, identifiable=1)
(Constructor)

source code 

Constructor

Parameters:
  • G - number of components
  • pi - mixture weights
  • components - list of ProductDistribution objects, each entry is one component
  • compFix - list of optional flags for fixing components in the reestimation the following values are supported: 1 distribution parameters are fixed, 2 distribution parameters and mixture coefficients are fixed
  • struct - Flag for CSI structure, 0 = no CSI structure, 1 = CSI structure
Overrides: ProbDistribution.__init__

__eq__(self, other)
(Equality operator)

source code 

Interface for the '==' operation

Parameters:
  • other - object to be compared
Overrides: ProbDistribution.__eq__
(inherited documentation)

__copy__(self)

source code 

Interface for the copy.copy function

Overrides: ProbDistribution.__copy__
(inherited documentation)

__str__(self)
(Informal representation operator)

source code 

String representation of the DataSet

Returns:
string representation
Overrides: ProbDistribution.__str__
(inherited documentation)

modelInitialization(self, data, rtype=1, missing_value=None)

source code 

Perform model initialization given a random assigment of the data to the components.

Parameters:
  • data - DataSet object
  • rtype - type of random assignments. 0 = fuzzy assingment 1 = hard assingment
  • missing_value - missing symbol to be ignored in parameter estimation (if applicable)
Returns:
posterior assigments

pdf(self, x)

source code 

Density function. MUST accept either numpy or DataSet object of appropriate values. We use numpys as input for the atomar distributions for efficiency reasons (The cleaner solution would be to construct DataSet subset objects for the different features and we might switch over to doing that eventually).

Parameters:
  • data - DataSet object or numpy array
Returns:
log-value of the density function for each sample in 'data'
Overrides: ProbDistribution.pdf
(inherited documentation)

sample(self)

source code 

Samples a single value from the distribution.

Returns:
sampled value
Overrides: ProbDistribution.sample
(inherited documentation)

sampleSet(self, nr)

source code 

Samples several values from the distribution.

Parameters:
  • nr - number of values to be sampled.
Returns:
sampled values
Overrides: ProbDistribution.sampleSet
(inherited documentation)

sampleDataSet(self, nr)

source code 

Returns a DataSet object of size 'nr'.

Parameters:
  • nr - size of DataSet to be sampled
Returns:
DataSet object

sampleDataSetLabels(self, nr)

source code 

Samples a DataSet of size 'nr' and returns the DataSet and the true component labels

Parameters:
  • nr - size of DataSet to be sampled
Returns:
tuple of DataSet object and list of labels

sampleSetLabels(self, nr)

source code 

Same as sample but the component labels are returned as well. Useful for testing purposes mostly.

EM(self, data, max_iter, delta, silent=False, mix_pi=None, mix_posterior=None, tilt=0, EStep=None, EStepParam=None)

source code 

Reestimation of mixture parameters using the EM algorithm.

Parameters:
  • data - DataSet object
  • max_iter - maximum number of iterations
  • delta - minimal difference in likelihood between two iterations before convergence is assumed.
  • silent - 0/1 flag, toggles verbose output
  • mix_pi - [internal use only] necessary for the reestimation of mixtures as components
  • mix_posterior - [internal use only] necessary for the reestimation of mixtures as components
  • tilt - 0/1 flag, toggles the use of a deterministic annealing in the training
  • EStep - function implementing the EStep, by default self.EStep
  • EStepParam - additional paramenters for more complex EStep implementations
Returns:
tuple of posterior matrix and log-likelihood from the last iteration

EStep_old(self, data, mix_posterior=None, mix_pi=None, EStepParam=None)

source code 

[Old implementation, kept around for regression testing]

Reestimation of mixture parameters using the EM algorithm.

Parameters:
  • data - DataSet object
  • mix_posterior - [internal use only] necessary for the reestimation of mixtures as components
  • mix_pi - [internal use only] necessary for the reestimation of mixtures as components
  • EStepParam - additional paramenters for more complex EStep implementations, in this implementaion it is ignored
Returns:
tuple of log likelihood matrices and sum of log-likelihood of components

EStep(self, data, mix_posterior=None, mix_pi=None, EStepParam=None)

source code 

Reestimation of mixture parameters using the EM algorithm.

Parameters:
  • data - DataSet object
  • mix_posterior - [internal use only] necessary for the reestimation of mixtures as components
  • mix_pi - [internal use only] necessary for the reestimation of mixtures as components
  • EStepParam - additional paramenters for more complex EStep implementations, in this implementaion it is ignored
Returns:
tuple of log likelihood matrices and sum of log-likelihood of components

randMaxEM(self, data, nr_runs, nr_steps, delta, tilt=0, silent=False)

source code 

Performs `nr_runs` normal EM runs with random initial parameters and returns the model which yields the maximum likelihood.

Parameters:
  • data - DataSet object
  • nr_runs - number of repeated random initializations
  • nr_steps - maximum number of steps in each run
  • delta - minimim difference in log-likelihood before convergence
  • tilt - 0/1 flag, toggles the use of a deterministic annealing in the training
  • silent - 0/1 flag, toggles verbose output
Returns:
log-likelihood of winning model

structureEM(self, data, nr_repeats, nr_runs, nr_steps, delta, tilt=0, silent=False)

source code 

EM training for models with CSI structure. First a candidate model is generated by using the randMaxEM procedure, then the structure is trained.

Parameters:
  • data - DataSet object
  • nr_repeats - number of candidate models to be generated
  • nr_runs - number of repeated random initializations
  • nr_steps - maximum number of steps for the long training run
  • delta - minimim difference in log-likelihood before convergence
  • tilt - 0/1 flag, toggles the use of deterministic annealing in the training
  • silent - 0/1 flag, toggles verbose output
Returns:
log-likelihood of winning model

MStep(self, posterior, data, mix_pi=None)

source code 

Maximization step of the EM procedure. Reestimates the distribution parameters using the posterior distribution and the data.

MUST accept either numpy or DataSet object of appropriate values. numpys are used as input for the atomar distributions for efficiency reasons

Parameters:
  • posterior - posterior distribution of component membership
  • data - DataSet object or 'numpy' of samples
  • mix_pi - mixture weights, necessary for MixtureModels as components.
Overrides: ProbDistribution.MStep
(inherited documentation)

mapEM(self, data, prior, max_iter, delta, silent=False, mix_pi=None, mix_posterior=None, tilt=0)

source code 

Reestimation of maximum a posteriori (MAP) mixture parameters using the EM algorithm.

Parameters:
  • data - DataSet object
  • max_iter - maximum number of iterations
  • prior - an appropriate MixtureModelPrior object
  • delta - minimal difference in likelihood between two iterations before convergence is assumed.
  • silent - 0/1 flag, toggles verbose output
  • mix_pi - [internal use only] necessary for the reestimation of mixtures as components
  • mix_posterior - [internal use only] necessary for the reestimation of mixtures as components
  • tilt - 0/1 flag, toggles the use of a deterministic annealing in the training
Returns:
tuple of posterior matrix and log-likelihood from the last iteration

classify(self, data, labels=None, entropy_cutoff=None, silent=0, EStep=None, EStepParam=None)

source code 

Classification of input 'data'. Assignment to mixture components by maximum likelihood over the component membership posterior. No parameter reestimation.

Parameters:
  • data - DataSet object
  • labels - optional sample IDs
  • entropy_cutoff - entropy threshold for the posterior distribution. Samples which fall above the threshold will remain unassigned
  • silent - 0/1 flag, toggles verbose output
  • EStep - function implementing the EStep, by default self.EStep
  • EStepParam - additional paramenters for more complex EStep implementations
Returns:
list of class labels

isValid(self, x)

source code 

Exhaustive check whether a given DataSet is compatible with the model. If self is a lower hierarchy mixture 'x' is a single data sample in external representation.

Parameters:
  • x - single sample in external representation, i.e.. an entry of DataSet.dataMatrix
Returns:
True/False flag
Overrides: ProbDistribution.isValid

formatData(self, x)

source code 

Formats samples 'x' for inclusion into DataSet object. Used by DataSet.internalInit()

Parameters:
  • x - list of samples
Returns:
two element list: first element = dimension of self, second element = sufficient statistics for samples 'x'
Overrides: ProbDistribution.formatData
(inherited documentation)

reorderComponents(self, order)

source code 

Reorder components into a new order

Parameters:
  • order - list of indices giving the new order

flatStr(self, offset)

source code 

Returns the model parameters as a string compatible with the WriteMixture/ReadMixture flat file format.

Parameters:
  • offset - number of ' ' characters to be used in the flatfile.
Overrides: ProbDistribution.flatStr
(inherited documentation)

printClusterEntropy(self, data)

source code 

Print out cluster stability measured by the entropy of the component membership posterior.

Parameters:
  • data - DataSet object

posteriorTraceback(self, x)

source code 

Returns the decoupled posterior distribution for each sample in 'x'. Used for analysis of clustering results.

Parameters:
  • x - list of samples
Returns:
decoupled posterior
Overrides: ProbDistribution.posteriorTraceback
(inherited documentation)

printTraceback(self, data, z, en_cut=1.01)

source code 

Prints out the posterior traceback, i.e. a detailed account of the contribution to the component membership posterior of each sample in each feature ordered by a clustering.

Parameters:
  • data - DataSet object
  • z - class labels
  • en_cut - entropy threshold

update_suff_p(self)

source code 

Updates the .suff_p field.

Overrides: ProbDistribution.update_suff_p
(inherited documentation)

updateStructureGlobal(self, data, silent=1)

source code 

Updating CSI structure by chosing smallest KL distance merging, optimizing the AIC score. This was the first approach implemented for the CSI structure learning and using the Bayesian approach instead is stronly recommended.

Parameters:
  • data - DataSet object
  • silent - verbosity flag
Returns:
number of structure changes

removeComponent(self, ind)

source code 

Deletes a component from the model.

Parameters:
  • ind - ind of component to be removed

merge(self, dlist, weights)

source code 

Merges 'self' with the distributions in'dlist' by an convex combination of the parameters as determined by 'weights'

Parameters:
  • dlist - list of distribution objects of the same type as 'self'
  • weights - list of weights, need not to sum up to one
Overrides: ProbDistribution.merge
(inherited documentation)

validStructure(self)

source code 

Checks whether the CSI structure is syntactically correct. Mostly for debugging.

sufficientStatistics(self, posterior, data)

source code 

Returns sufficient statistics for a given data set and posterior.

Parameters:
  • posterior - numpy vector of component membership posteriors
  • data - numpy vector holding the data
Returns:
list with dot(posterior, data) and dot(posterior, data**2)
Overrides: ProbDistribution.sufficientStatistics