core¶

PyRosettaCluster is a class for reproducible, high-throughput job distribution of user-defined PyRosetta protocols efficiently parallelized on the user’s local computer, high-performance computing (HPC) cluster, or elastic cloud computing infrastructure with available compute resources.

Args:

tasks: A list of dict objects, a callable or called function returning

a list of dict objects, or a callable or called generator yielding a list of dict objects. Each dictionary object element of the list is accessible via kwargs in the user-defined PyRosetta protocols. In order to initialize PyRosetta with user-defined PyRosetta command line options at the start of each user-defined PyRosetta protocol, either extra_options and/or options must be a key of each dictionary object, where the value is a str, tuple, list, set, or dict of PyRosetta command line options. Default: [{}]

input_packed_pose: Optional input PackedPose object that is accessible via

the first argument of the first user-defined PyRosetta protocol. Default: None

seeds: A list of int objects specifying the random number generator seeds

to use for each user-defined PyRosetta protocol. The number of seeds provided must be equal to the number of user-defined input PyRosetta protocols. Seeds are used in the same order that the user-defined PyRosetta protocols are executed. Default: None

decoy_ids: A list of int objects specifying the decoy numbers to keep after

executing user-defined PyRosetta protocols. User-provided PyRosetta protocols may return a list of Pose and/or PackedPose objects, or yield multiple Pose and/or PackedPose objects. To reproduce a particular decoy generated via the chain of user-provided PyRosetta protocols, the decoy number to keep for each protocol may be specified, where other decoys are discarded. Decoy numbers use zero-based indexing, so 0 is the first decoy generated from a particular PyRosetta protocol. The number of decoy_ids provided must be equal to the number of user-defined input PyRosetta protocols, so that one decoy is saved for each user-defined PyRosetta protocol. Decoy ids are applied in the same order that the user-defined PyRosetta protocols are executed. Default: None

client: An initialized dask distributed.client.Client object to be used as

the dask client interface to the local or remote compute cluster. If None, then PyRosettaCluster initializes its own dask client based on the PyRosettaCluster(scheduler=…) class attribute. Deprecated by the PyRosettaCluster(clients=…) class attribute, but supported for legacy purposes. Either or both of the client or clients attribute parameters must be None. Default: None

clients: A list or tuple object of initialized dask distributed.client.Client

objects to be used as the dask client interface(s) to the local or remote compute cluster(s). If None, then PyRosettaCluster initializes its own dask client based on the PyRosettaCluster(scheduler=…) class attribute. Optionally used in combination with the PyRosettaCluster().distribute(clients_indices=…) method. Either or both of the client or clients attribute parameters must be None. See the PyRosettaCluster().distribute() method docstring for usage examples. Default: None

scheduler: A str of either “sge” or “slurm”, or None. If “sge”, then

PyRosettaCluster schedules jobs using SGECluster with dask-jobqueue. If “slurm”, then PyRosettaCluster schedules jobs using SLURMCluster with dask-jobqueue. If None, then PyRosettaCluster schedules jobs using LocalCluster with dask.distributed. If PyRosettaCluster(client=…) or PyRosettaCluster(clients=…) is provided, then PyRosettaCluster(scheduler=…) is ignored. Default: None

cores: An int object specifying the total number of cores per job, which

is input to the dask_jobqueue.SLURMCluster(cores=…) argument or the dask_jobqueue.SGECluster(cores=…) argument. Default: 1

processes: An int object specifying the total number of processes per job,

which is input to the dask_jobqueue.SLURMCluster(processes=…) argument or the dask_jobqueue.SGECluster(processes=…) argument. This cuts the job up into this many processes. Default: 1

memory: A str object specifying the total amount of memory per job, which

is input to the dask_jobqueue.SLURMCluster(memory=…) argument or the dask_jobqueue.SGECluster(memory=…) argument. Default: “4g”

scratch_dir: A str object specifying the path to a scratch directory where

dask litter may go. Default: “/temp” if it exists, otherwise the current working directory

min_workers: An int object specifying the minimum number of workers to

which to adapt during parallelization of user-provided PyRosetta protocols. Default: 1

max_workers: An int object specifying the maximum number of workers to

which to adapt during parallelization of user-provided PyRosetta protocols. Default: 1000 if the initial number of tasks is <1000, else use the

the initial number of tasks

dashboard_address: A str object specifying the port over which the dask

dashboard is forwarded. Particularly useful for diagnosing PyRosettaCluster performance in real-time. Default=”:8787”

nstruct: An int object specifying the number of repeats of the first

user-provided PyRosetta protocol. The user can control the number of repeats of subsequent user-provided PyRosetta protocols via returning multiple clones of the output pose(s) from a user-provided PyRosetta protocol run earlier, or cloning the input pose(s) multiple times in a user-provided PyRosetta protocol run later. Default: 1

compressed: A bool object specifying whether or not to compress the output

“.pdb”, “.pkl_pose”, “.b64_pose”, and “.init” files with bzip2, resulting in appending “.bz2” to decoy output files and PyRosetta initialization files. Also see the ‘output_decoy_types’ and ‘output_init_file’ keyword arguments. Default: True

compression: A str object of ‘xz’, ‘zlib’ or ‘bz2’, or a bool or NoneType

object representing the internal compression library for pickled PackedPose objects and user-defined PyRosetta protocol kwargs objects. The default of True uses ‘xz’ for serialization if it’s installed, otherwise uses ‘zlib’ for serialization. Default: True

system_info: A dict or NoneType object specifying the system information

required to reproduce the simulation. If None is provided, then PyRosettaCluster automatically detects the platform and returns this attribute as a dictionary {‘sys.platform’: sys.platform} (for example, {‘sys.platform’: ‘linux’}). If a dict is provided, then validate that the ‘sys.platform’ key has a value equal to the current sys.platform, and log a warning message if not. Additional system information such as Amazon Machine Image (AMI) identifier and compute fleet instance type identifier may be stored in this dictionary, but is not validated. This information is stored in the simulation records for accounting. Default: None

pyrosetta_build: A str or NoneType object specifying the PyRosetta build as

output by pyrosetta._build_signature(). If None is provided, then PyRosettaCluster automatically detects the PyRosetta build and sets this attribute as the str. If a non-empty str is provided, then validate that the input PyRosetta build is equal to the active PyRosetta build, and raise an error if not. This ensures that reproduction simulations use an identical PyRosetta build from the original simulation. To bypass PyRosetta build validation with a warning message, an empty string (‘’) may be provided (but does not ensure reproducibility). Default: None

sha1: A str or NoneType object specifying the git SHA1 hash string of the

particular git commit being simulated. If a non-empty str object is provided, then it is validated to match the SHA1 hash string of the current HEAD, and then it is added to the simulation record for accounting. If an empty string is provided, then ensure that everything in the working directory is committed to the repository. If None is provided, then bypass SHA1 hash string validation and set this attribute to an empty string. Default: “”

project_name: A str object specifying the project name of this simulation.

This option just adds the user-provided project_name to the scorefile for accounting. Default: datetime.now().strftime(“%Y.%m.%d.%H.%M.%S.%f”) if not specified,

else “PyRosettaCluster” if None

simulation_name: A str object specifying the name of this simulation.

This option just adds the user-provided simulation_name to the scorefile for accounting. Default: project_name if not specified, else “PyRosettaCluster” if None

environment: A NoneType or str object specifying either the active conda/mamba environment

YML file string, active uv project requirements.txt file string, or active pixi project pixi.lock file string. If a NoneType object is provided, then generate an environment file string for the active conda/mamba/uv/pixi environment and save it to the full simulation record. If a non-empty str object is provided, then validate it against the active conda/mamba/uv/pixi environment YML/requirements/lock file string and save it to the full simulation record. This ensures that reproduction simulations use an identical conda/mamba/uv/pixi environment to the original simulation. To bypass conda/mamba/uv/pixi environment validation with a warning message, an empty string (‘’) may be provided (but does not ensure reproducibility). Default: None

output_path: A str object specifying the full path of the output directory

(to be created if it doesn’t exist) where the output results will be saved to disk. Default: “./outputs”

output_init_file: A str object specifying the output “.init” file path that caches

the ‘input_packed_pose’ keyword argument parameter upon PyRosettaCluster instantiation, and not including any output decoys, which is optionally used for exporting PyRosetta initialization files with output decoys by the pyrosetta.distributed.cluster.export_init_file() function after the simulation completes (see the ‘output_decoy_types’ keyword argument). If a NoneType object (or an empty str object (‘’)) is provided, or dry_run=True, then skip writing an output “.init” file upon PyRosettaCluster instantiation. If skipped, it is recommended to run pyrosetta.dump_init_file() before or after the simulation. If compressed=True, then the output file is further compressed by bzip2, and “.bz2” is appended to the filename. Default: output_path/`project_name`_`simulation_name`_pyrosetta.init

output_decoy_types: An iterable of str objects representing the output decoy

filetypes to save during the simulation. Available options are: “.pdb” for PDB files; “.pkl_pose” for pickled Pose files; “.b64_pose” for base64-encoded pickled Pose files; and “.init” for PyRosetta initialization files, each caching the host node PyRosetta initialization options (and input files, if any), the ‘input_packed_pose’ keyword argument parameter (if any) and an output decoy. Because each “.init” file contains a copy of the PyRosetta initialization input files and input PackedPose object, unless these objects are relatively small in size or there are relatively few expected output decoys, then it is recommended to run pyrosetta.distributed.cluster.export_init_file() on only decoys of interest after the simulation completes without specifying “.init”. If compressed=True, then each decoy output file is further compressed by bzip2, and “.bz2” is appended to the filename. Default: [“.pdb”,]

output_scorefile_types: An iterable of str objects representing the output scorefile

filetypes to save during the simulation. Available options are: “.json” for a JSON-encoded scorefile, and any filename extensions accepted by pandas.DataFrame().to_pickle(compression=”infer”) (including “.gz”, “.bz2”, and “.xz”) for pickled pandas.DataFrame objects of scorefile data that can later be analyzed using pyrosetta.distributed.cluster.io.secure_read_pickle(compression=”infer”). Note that in order to save pickled pandas.DataFrame objects, please ensure that pyrosetta.secure_unpickle.add_secure_package(“pandas”) has been first run. Default: [“.json”,]

scorefile_name: A str object specifying the name of the output JSON-formatted

scorefile, which must end in “.json”. The scorefile location is always output_path/scorefile_name. If “.json” is not in the ‘output_scorefile_types’ keyword argument parameter, the JSON-formatted scorefile will not be output, but other scorefile types will get the same filename before the “.json” extension. Default: “scores.json”

simulation_records_in_scorefile: A bool object specifying whether or not to

write full simulation records to the scorefile. If True, then write full simulation records to the scorefile. This results in some redundant information on each line, allowing downstream reproduction of a decoy from the scorefile, but a larger scorefile. If False, then write curtailed simulation records to the scorefile. This results in minimally redundant information on each line, disallowing downstream reproduction of a decoy from the scorefile, but a smaller scorefile. If False, also write the active conda/mamba/uv/pixi environment to a file in the output_path keyword argument parameter. Full simulation records are always written to the output decoy files (the types of which are specified by the output_decoy_types keyword argument parameter), which can be used to reproduce any decoy without the scorefile. Default: False

decoy_dir_name: A str object specifying the directory name where the

output decoys will be saved. The directory location is always output_path/decoy_dir_name. Default: “decoys”

logs_dir_name: A str object specifying the directory name where the

output log files will be saved. The directory location is always output_path/logs_dir_name. Default: “logs”

logging_level: A str object specifying the logging level of python tracer

output to write to the log file of either “NOTSET”, “DEBUG”, “INFO”, “WARNING”, “ERROR”, or “CRITICAL”. The output log file is always written to output_path/logs_dir_name/simulation_name.log on disk. Default: “INFO”

logging_address: A str object specifying the socket endpoint for sending and receiving

log messages across a network, so log messages from user-provided PyRosetta protocols may be written to a single log file on the host node. The str object must take the format ‘host:port’ where ‘host’ is either an IP address, ‘localhost’, or Domain Name System (DNS)-accessible domain name, and the ‘port’ is a digit greater than or equal to 0. If the ‘port’ is ‘0’, then the next free port is selected. Default: ‘localhost:0’ if scheduler=None or either the client or clients

keyword argument parameters specify instances of dask.distributed.LocalCluster, otherwise ‘0.0.0.0:0’

ignore_errors: A bool object specifying for PyRosettaCluster to ignore errors

raised in the user-provided PyRosetta protocols. This comes in handy when well-defined errors are sparse and sporadic (such as rare Segmentation Faults), and the user would like PyRosettaCluster to run without raising the errors. Default: False

timeout: A float or int object specifying how many seconds to wait between

PyRosettaCluster checking-in on the running user-provided PyRosetta protocols. If each user-provided PyRosetta protocol is expected to run quickly, then 0.1 seconds seems reasonable. If each user-provided PyRosetta protocol is expected to run slowly, then >1 second seems reasonable. Default: 0.5

max_delay_time: A float or int object specifying the maximum number of seconds to

sleep before returning the result(s) from each user-provided PyRosetta protocol back to the client. If a dask worker returns the result(s) from a user-provided PyRosetta protocol too quickly, the dask scheduler needs to first register that the task is processing before it completes. In practice, in each user-provided PyRosetta protocol the runtime is subtracted from max_delay_time, and the dask worker sleeps for the remainder of the time, if any, before returning the result(s). It’s recommended to set this option to at least 1 second, but longer times may be used as a safety throttle in cases of overwhelmed dask scheduler processes. Default: 3.0

filter_results: A bool object specifying whether or not to filter out empty

PackedPose objects between user-provided PyRosetta protocols. When a protocol returns or yields NoneType, PyRosettaCluster converts it to an empty PackedPose object that gets passed to the next protocol. If True, then filter out any empty PackedPose objects where there are no residues in the conformation as given by Pose.empty(), otherwise if False then continue to pass empty PackedPose objects to the next protocol. This is used for filtering out decoys mid-trajectory through user-provided PyRosetta protocols if protocols return or yield any None, empty Pose, or empty PackedPose objects. Default: True

save_all: A bool object specifying whether or not to save all of the returned

or yielded Pose and PackedPose objects from all user-provided PyRosetta protocols. This option may be used for checkpointing trajectories. To save arbitrary poses to disk, from within any user-provided PyRosetta protocol:

`pose.dump_pdb(
os.path.join(kwargs[“PyRosettaCluster_output_path”], “checkpoint.pdb”))`

Default: False

dry_run: A bool object specifying whether or not to save ‘.pdb’ files to

disk. If True, then do not write ‘.pdb’ or ‘.pdb.bz2’ files to disk. Default: False

security: A bool object or instance of dask.distributed.Security(), only having

effect if client=None and clients=None, that is passed to ‘dask’ if using scheduler=None or passed to ‘dask-jobqueue’ if using scheduler=”slurm” or scheduler=”sge”. If True is provided, then invoke the ‘cryptography’ package to generate a Security.temporary() object through ‘dask’ or ‘dask-jobqueue’. See https://distributed.dask.org/en/latest/tls.html#distributed.security.Security.temporary for more information. If a dask Security() object is provided, then pass it to dask with scheduler=None, or pass it to ‘dask-jobqueue’ (where ‘shared_temp_directory’ is set to the output_path keyword argument parameter) with scheduler=”slurm” or scheduler=”sge”. If False is provided, then security is disabled regardless of the scheduler keyword argument parameter (which is not recommended for remote clusters unless using a firewall). If None is provided, then True is used by default. In order to generate a dask.distributed.Security() object with OpenSSL, the pyrosetta.distributed.cluster.generate_dask_tls_security() function may also be used (see docstring for more information) instead of the ‘cryptography’ package. Default: False if scheduler=None, otherwise True

max_nonce: An int object greater than or equal to 1 specifying the maximum number of

nonces to cache per process if dask security is disabled while using remote clusters, which protects against replay attacks. If nonce caching is in use, each process (including the host node process and all dask worker processes) cache nonces upon communication exchange over the network, which can increase memory usage in each process. A rough estimate of additional memory usage is ~0.2 KB per task per user-provided PyRosetta protocol per process. For example, submitting 1000 tasks with 2 user-provided PyRosetta protocols adds ~0.2 KB/task/protocol * 1000 tasks * 2 protocols = ~0.4 MB of memory per processs. If memory usage per process permits, it is recommended to set this parameter to at least the number of tasks times the number of protocols submitted, so that every nonce from every communication exchange over the network gets cached. Default: 4096

cooldown_time: A float or int object specifying how many seconds to sleep after the

simulation is complete to allow loggers to flush. For very slow network filesystems, 2.0 or more seconds may be reasonable. Default: 0.5

norm_task_options: A bool object specifying whether or not to normalize the task

‘options’ and ‘extra_options’ values after PyRosetta initialization on the remote compute cluster. If True, then this enables more facile simulation reproduction by the use of the ProtocolSettingsMetric SimpleMetric to normalize the PyRosetta initialization options and by relativization of any input files and directory paths to the current working directory from which the task is running. Default: True

author: An optional str object specifying the author(s) of the simulation that is

written to the full simulation records and the PyRosetta initialization ‘.init’ file. Default: “”

email: An optional str object specifying the email address(es) of the author(s) of

the simulation that is written to the full simulation records and the PyRosetta initialization ‘.init’ file. Default: “”

license: An optional str object specifying the license of the output data of the

simulation that is written to the full simulation records and the PyRosetta initialization ‘.init’ file (e.g., “ODC-ODbL”, “CC BY-ND”, “CDLA Permissive-2.0”, etc.). Default: “”

Returns:

A PyRosettaCluster instance.

class pyrosetta.distributed.cluster.core.PyRosettaCluster(*, tasks: Any = [{}], nstruct=1, input_packed_pose: Any = None, seeds: Optional[Any] = None, decoy_ids: Optional[Any] = None, client: Optional[Client] = None, clients: Optional[List[Client]] = None, scheduler: str = None, cores=1, processes=1, memory='4g', scratch_dir: Any = None, min_workers=1, max_workers=_Nothing.NOTHING, dashboard_address=':8787', project_name='2025.12.19.22.19.24.115415', simulation_name=_Nothing.NOTHING, output_path='/home/benchmark/rosetta/source/build/PyRosetta/Linux-5.4.0-84-generic-x86_64-with-glibc2.27/clang-6.0.0/python-3.11/minsizerel.serialization.thread/documentation/outputs', output_decoy_types: Any = None, output_scorefile_types: Any = None, scorefile_name='scores.json', simulation_records_in_scorefile=False, decoy_dir_name='decoys', logs_dir_name='logs', logging_level='INFO', logging_address: str = _Nothing.NOTHING, compressed=True, compression: Optional[Union[str, bool]] = True, sha1: Any = '', ignore_errors=False, timeout=0.5, max_delay_time=3.0, filter_results: Any = None, save_all=False, dry_run=False, norm_task_options: Any = None, cooldown_time=0.5, system_info: Any = None, pyrosetta_build: Any = None, security=_Nothing.NOTHING, max_nonce: int = 4096, environment: Any = None, author=None, email=None, license=None, output_init_file=_Nothing.NOTHING)¶

Bases: IO[G], LoggingSupport[G], SchedulerManager[G], SecurityIO[G], TaskBase[G]