cluster¶
Bindings for protocols::cluster namespace
-
class
pyrosetta.rosetta.protocols.cluster.
APCluster
¶ Bases:
pybind11_builtins.pybind11_object
Public interface for doing affinity propagation clustering.
Based on Frey and Dueck, “Clustering by Passing Messages Between Data Points”, Science 315 (2007). Useful for choosing a set of representative data points (exemplars) out of a large set (e.g. all decoys from a large Rosetta run) given a measure of similarity (e.g. RMSD, maxsub, GDT, …).
As I understand it, this procedures tries to maximize the sum of similarities between each data point and its exemplar, while balancing that with total number of clusters. Reasonable measures of similarity would be negative RMSD, log-likelihoods, or squared distance (i.e. squared error), depending on what the points represent. Note there is no requirement for symmetry: s(i,j) need not equal s(j,i). The self-similarity s(k,k) (“preference”) for each point controls the likelihood it will be selected as an exemplar, and thus indirectly controls the total number of clusters. There is no way to directly specify a specific number of clusters. The authors suggest that using the median of all other similarities will give a moderate number of clusters, and using the minimum of the other similaries will give a small number of clusters.
This implementation is designed for clustering very large numbers of data points with sparse similarity [ s(i,k) = -Inf for most i,k ]. Similarities for each input point are kept in a heap so that you can limit to only the L highest for each. (This scheme is quite likely to break symmetry, as some points will have more close neighbors than others.) Alternately, you may choose to do your own pre-filtering and only enter the G globally highest similarities between any points in the data set. Run time (per cycle) is linear in the number of similarities, or O(N^2) in the limit of a dense similarity matrix.
I follow the conventions of the original paper, where “i” is the index of some generic data point, and “k” is the index of a data point being considered as an exemplar (cluster center).
-
__delattr__
¶ Implement delattr(self, name).
-
__dir__
() → list¶ default dir() implementation
-
__eq__
¶ Return self==value.
-
__format__
()¶ default object formatter
-
__ge__
¶ Return self>=value.
-
__getattribute__
¶ Return getattr(self, name).
-
__gt__
¶ Return self>value.
-
__hash__
¶ Return hash(self).
-
__init__
(*args, **kwargs)¶ Overloaded function.
- __init__(self: pyrosetta.rosetta.protocols.cluster.APCluster, arg0: int) -> None
doc
- __init__(self: pyrosetta.rosetta.protocols.cluster.APCluster, total_pts: int, max_sims_per_pt: int) -> None
-
__init_subclass__
()¶ This method is called when a class is subclassed.
The default implementation does nothing. It may be overridden to extend subclasses.
-
__le__
¶ Return self<=value.
-
__lt__
¶ Return self<value.
-
__ne__
¶ Return self!=value.
-
__new__
()¶ Create and return a new object. See help(type) for accurate signature.
-
__reduce__
()¶ helper for pickle
-
__reduce_ex__
()¶ helper for pickle
-
__repr__
¶ Return repr(self).
-
__setattr__
¶ Implement setattr(self, name, value).
-
__sizeof__
() → int¶ size of object in memory, in bytes
-
__str__
¶ Return str(self).
-
__subclasshook__
()¶ Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
-
assign
(self: pyrosetta.rosetta.protocols.cluster.APCluster, : pyrosetta.rosetta.protocols.cluster.APCluster) → pyrosetta.rosetta.protocols.cluster.APCluster¶ C++: protocols::cluster::APCluster::operator=(const class protocols::cluster::APCluster &) –> class protocols::cluster::APCluster &
-
cluster
(self: pyrosetta.rosetta.protocols.cluster.APCluster, maxits: int, convits: int, lambda: float) → bool¶ Run the actual clustering algorithm.
C++: protocols::cluster::APCluster::cluster(unsigned long, unsigned long, double) –> bool
-
get_all_exemplars
(self: pyrosetta.rosetta.protocols.cluster.APCluster, exemplars: pyrosetta.rosetta.utility.vector1_unsigned_long) → None¶ Return the indices of data points chosen as exemplars (cluster centers).
C++: protocols::cluster::APCluster::get_all_exemplars(class utility::vector1<unsigned long, class std::allocator<unsigned long> > &) const –> void
-
get_cluster_for
(self: pyrosetta.rosetta.protocols.cluster.APCluster, k: int, cluster: pyrosetta.rosetta.utility.vector1_unsigned_long) → None¶ - Returns the indices of points with the specified exemplar k.
- Note that k is the index of an (input) data point that was chosen as an exemplar, not some “cluster index” between 1 and get_num_exemplars().
C++: protocols::cluster::APCluster::get_cluster_for(unsigned long, class utility::vector1<unsigned long, class std::allocator<unsigned long> > &) const –> void
-
get_exemplar_for
(self: pyrosetta.rosetta.protocols.cluster.APCluster, i: int) → int¶ Return the index of the point that is the exemplar for point i.
C++: protocols::cluster::APCluster::get_exemplar_for(unsigned long) const –> unsigned long
-
get_net_sim
(self: pyrosetta.rosetta.protocols.cluster.APCluster) → float¶ - The sum of similarities s(i,k) between every data point i and its exemplar k,
- plus the self preferences of the exemplars. The algorithm should minimize this value – if it dips and climbs again, increase lambda.
C++: protocols::cluster::APCluster::get_net_sim() const –> double
-
get_num_exemplars
(self: pyrosetta.rosetta.protocols.cluster.APCluster) → int¶ - The number of exemplars selected (number of clusters).
- Monotonically related to the self-preferences s(k,k).
C++: protocols::cluster::APCluster::get_num_exemplars() const –> unsigned long
-
load_binary
(self: pyrosetta.rosetta.protocols.cluster.APCluster, filename: str) → bool¶ - Wipes all currently held data and reads in similarity values and cluster assignments.
- Afterwards, points may be re-clustered with different parameters if desired. File format is custom binary and is not portable (host endian-ness).
C++: protocols::cluster::APCluster::load_binary(const class std::basic_string<char> &) –> bool
-
num_pts
(self: pyrosetta.rosetta.protocols.cluster.APCluster) → int¶ C++: protocols::cluster::APCluster::num_pts() const –> unsigned long
-
save_binary
(self: pyrosetta.rosetta.protocols.cluster.APCluster, filename: str) → bool¶ - Saves the (sparse) similarity matrix and current cluster assignments (if any),
- but not the accumulated evidence from the last clustering [ r(i,k) and a(i,k) ]. File format is custom binary and is not portable (host endian-ness).
C++: protocols::cluster::APCluster::save_binary(const class std::basic_string<char> &) const –> bool
-
set_sim
(self: pyrosetta.rosetta.protocols.cluster.APCluster, i: int, k: int, sim: float) → None¶ How appropriate is k as an exemplar for i?
C++: protocols::cluster::APCluster::set_sim(unsigned long, unsigned long, double) –> void
-
-
class
pyrosetta.rosetta.protocols.cluster.
DataPoint
¶ Bases:
pybind11_builtins.pybind11_object
Data structure for one input data point for affinity propagation clustering.
There should be one instance of this class for each input point. Fields are public because it’s a glorified struct – clients shouldn’t use this directly.
-
__delattr__
¶ Implement delattr(self, name).
-
__dir__
() → list¶ default dir() implementation
-
__eq__
¶ Return self==value.
-
__format__
()¶ default object formatter
-
__ge__
¶ Return self>=value.
-
__getattribute__
¶ Return getattr(self, name).
-
__gt__
¶ Return self>value.
-
__hash__
¶ Return hash(self).
-
__init__
(self: pyrosetta.rosetta.protocols.cluster.DataPoint, i_in: int) → None¶
-
__init_subclass__
()¶ This method is called when a class is subclassed.
The default implementation does nothing. It may be overridden to extend subclasses.
-
__le__
¶ Return self<=value.
-
__lt__
¶ Return self<value.
-
__ne__
¶ Return self!=value.
-
__new__
()¶ Create and return a new object. See help(type) for accurate signature.
-
__reduce__
()¶ helper for pickle
-
__reduce_ex__
()¶ helper for pickle
-
__repr__
¶ Return repr(self).
-
__setattr__
¶ Implement setattr(self, name, value).
-
__sizeof__
() → int¶ size of object in memory, in bytes
-
__str__
¶ Return str(self).
-
__subclasshook__
()¶ Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
-
add_similarity
(self: pyrosetta.rosetta.protocols.cluster.DataPoint, k: int, s_ik: float, max_sims: int) → None¶ Set similarity s(i,k), the suitability of point k to be an exemplar for this point.
C++: protocols::cluster::DataPoint::add_similarity(unsigned long, double, unsigned long) –> void
-
is_set_s_kk
(self: pyrosetta.rosetta.protocols.cluster.DataPoint) → bool¶ C++: protocols::cluster::DataPoint::is_set_s_kk() const –> bool
-
-
class
pyrosetta.rosetta.protocols.cluster.
Exemplar
¶ Bases:
pybind11_builtins.pybind11_object
Data structure for one similarity measurement (s_ik) for affinity propagation clustering.
There will be one instance of this class for each (finite) similarity between two input points, up to a maximum of N*N instances if the similarity matrix is fully populated.
-
__delattr__
¶ Implement delattr(self, name).
-
__dir__
() → list¶ default dir() implementation
-
__eq__
¶ Return self==value.
-
__format__
()¶ default object formatter
-
__ge__
¶ Return self>=value.
-
__getattribute__
¶ Return getattr(self, name).
-
__gt__
¶ Return self>value.
-
__hash__
¶ Return hash(self).
-
__init__
(self: pyrosetta.rosetta.protocols.cluster.Exemplar, k_in: int, s_ik_in: float) → None¶
-
__init_subclass__
()¶ This method is called when a class is subclassed.
The default implementation does nothing. It may be overridden to extend subclasses.
-
__le__
¶ Return self<=value.
-
__lt__
¶ Return self<value.
-
__ne__
¶ Return self!=value.
-
__new__
()¶ Create and return a new object. See help(type) for accurate signature.
-
__reduce__
()¶ helper for pickle
-
__reduce_ex__
()¶ helper for pickle
-
__repr__
¶ Return repr(self).
-
__setattr__
¶ Implement setattr(self, name, value).
-
__sizeof__
() → int¶ size of object in memory, in bytes
-
__str__
¶ Return str(self).
-
__subclasshook__
()¶ Abstract classes can override this to customize issubclass().
This is invoked early on by abc.ABCMeta.__subclasscheck__(). It should return True, False or NotImplemented. If it returns NotImplemented, the normal algorithm is used. Otherwise, it overrides the normal algorithm (and the outcome is cached).
-
min_heap
(a: pyrosetta.rosetta.protocols.cluster.Exemplar, b: pyrosetta.rosetta.protocols.cluster.Exemplar) → bool¶ “Less than” (actually greater than) comparator for making a heap of exemplars
C++: protocols::cluster::Exemplar::min_heap(class protocols::cluster::Exemplar, class protocols::cluster::Exemplar) –> bool
-