Orchestrator module

class cherrypick.orchestrator.Orchestrator(train: tuple[DataFrame, Series], test: tuple[DataFrame, Series], file_dir: str, problem_statement: Literal['regression', 'classification'], seed: int = 42, focus_classifier: Literal['recall', 'precision', 'f1score'] = 'f1score', focus_regressor: Literal['mse', 'mae', 'rmse'] = 'mse')

Bases: object

Orchestrator class to automate model training, evaluation, and selection for regression and classification tasks.

Parameters:

problem_statement (str) – Type of problem based on the dataset. Must be either 'regression' or 'classification'.
focus_regressor (str, default='mse') –
Metric used for selecting the best regression estimator.
- ’mse’ - Mean Squared Error
- ’mae’ - Mean Absolute Error
- ’rmse’ - Root Mean Squared Error
focus_classifier (str, default='f1score') –
Metric used for selecting the best classification estimator.
- 'recall' - Recall score
- 'precision' - Precision score
- 'f1score' - F1 score
train (tuple) – Training data in the tuple format (X_train, y_train).
test (tuple) – Testing data in the tuple format (X_test, y_test).
file_dir (str) – Directory where the best estimator will be saved. For example, if a folder model/ exists, use file_dir='model'.

Examples

>>> orch = Orchestrator(
            train = train,
            test=test,
            problem_statement='classification', ## for classification
            focus_classifier='f1score',
            file_dir='model'
        )

>>> orch = Orchestrator(
            train = train,
            test=test,
            problem_statement='regression',
            focus_regressor='mae',
            file_dir='model'
        )

property best_estimator

Returns best performing trained model.

Code

>>> orch.best_estimator

returns:: The estimator with the highest performance based on the selected evaluation metric.
rtype:: object

orchestrate()

The function orchestrate() triggers the ML-model orchestration by cherry-picking the best estimator.

Code

>>> orch.orchestrate() ## Orchestrates entire model training and selects best model based upon Orchestrator() configs.

cv(type_cv: str, param_grid: dict, scoring_type: str, n_jobs: int = -1, cv: int = 5): Under progress will be available soon!

critique(cv: int = 5, scoring: str = 'neg_mean_squared_error', topkmodel: int | None = None) → str

Evaluate model generalization and diagnose overfitting or underfitting (bias-variance tradeoff) using cross-validation.

Parameters:

cv (int, default=5) – Number of cross-validation folds.
scoring (str) – Metric used to evaluate each cross-validation fold.
topkmodel (int, optional) – Index of the model to evaluate from the top-k selected models. If not provided, the best estimator is used by default.

Returns:

Alert message including the relative gap:

\[\text{Relative Gap} = \frac{\text{overfitting\_gap}}{\text{MSE (training)}}\]

Where:

overfitting_gap = mean cross-validation score − training MSE

Return type:

str

Notes

Helps identify whether the model is overfitting or underfitting.
Higher variance indicates overfitting.
High bias indicates underfitting.

Code

>>> orch.critique(cv=n, scoring='neg_mean_squared_error') ## Checks the sanity(bias-variance tradeoffs) for best model
>>> orch.critique(cv=n, scoring='neg_mean_squared_error', topkmodel = model) ## Checks the sanity(bias-variance tradeoffs) for custom model

topkmodel(access_estimator: int | None = None, threshold: float | int | None = None) → DataFrame | None

Retrieve top-performing models or access a specific estimator.

Parameters:

access_estimator (int, optional) – Index of the estimator to access from the ranked model list (1st, 2nd, …, nth). If provided, returns the selected estimator.
threshold (float or int, optional) – Threshold value used to filter models based on the evaluation metric. Only models meeting the threshold criteria are returned.

Returns:

DataFrame containing ranked models and their evaluation metrics.

If access_estimator is provided, returns the selected estimator instead of a DataFrame.

Return type:

pandas.DataFrame or None

Notes

Models are ranked based on the selected evaluation metric.
If both parameters are None, the full ranked model table is returned.
If access_estimator is provided, threshold filtering is ignored.

Code

>>> orch.topkmodel() ## returns leaderboard of top K models
>>> orch.topkmodel(access_estimator = n) ## returns choosen model from nth rank(1st - nth)

auto_explain(n_classes: int | None = None, size: tuple | None = None, model: str = 'best')

Generate SHAP-based explanations for trained models.

Provides automatic model interpretability using SHAP (SHapley Additive Explanations) for both regression and classification tasks. Supports TreeExplainer and LinearExplainer, along with visualization tools such as summary plots and bar plots.

Parameters:

n_classes (int) – Number of unique output classes. Required for classification tasks.
size (tuple, optional) – Figure size used for resizing Decision Tree visualizations.
model (str, default='best') –
Specifies which model to explain.
- 'best' : Uses the top-performing estimator.

Returns:

Generates SHAP visualizations such as summary plots and bar plots. For classification tasks, n_classes must be specified.

Return type:

None

Notes

Uses SHAP TreeExplainer for tree-based models.
Uses SHAP LinearExplainer for linear models.
Supports both regression and classification workflows.
Outputs include summary plots and feature importance visualizations.

Code

>>> orch.auto_explain() ## for best model explaination
>>> orch.auto_explain(n_classes=class_ids) ## classification type best model explaination
>>> orch.auto_explain(model=model) ## custom model based explaination