Orchestrator module
- class cherrypick.orchestrator.Orchestrator(train: tuple[DataFrame, Series], test: tuple[DataFrame, Series], file_dir: str, problem_statement: Literal['regression', 'classification'], seed: int = 42, focus_classifier: Literal['recall', 'precision', 'f1score'] = 'f1score', focus_regressor: Literal['mse', 'mae', 'rmse'] = 'mse')
Bases:
objectOrchestrator class to automate model training, evaluation, and selection for regression and classification tasks.
- Parameters:
problem_statement (str) – Type of problem based on the dataset. Must be either
'regression'or'classification'.focus_regressor (str, default='mse') –
Metric used for selecting the best regression estimator.
’mse’ - Mean Squared Error
’mae’ - Mean Absolute Error
’rmse’ - Root Mean Squared Error
focus_classifier (str, default='f1score') –
Metric used for selecting the best classification estimator.
'recall'- Recall score'precision'- Precision score'f1score'- F1 score
train (tuple) – Training data in the tuple format
(X_train, y_train).test (tuple) – Testing data in the tuple format
(X_test, y_test).file_dir (str) – Directory where the best estimator will be saved. For example, if a folder
model/exists, usefile_dir='model'.
Examples
>>> orch = Orchestrator( train = train, test=test, problem_statement='classification', ## for classification focus_classifier='f1score', file_dir='model' )
>>> orch = Orchestrator( train = train, test=test, problem_statement='regression', focus_regressor='mae', file_dir='model' )
- property best_estimator
Returns best performing trained model.
Code
>>> orch.best_estimator
- returns:
The estimator with the highest performance based on the selected evaluation metric.
- rtype:
object
- orchestrate()
The function orchestrate() triggers the ML-model orchestration by cherry-picking the best estimator.
Code
>>> orch.orchestrate() ## Orchestrates entire model training and selects best model based upon Orchestrator() configs.
- cv(type_cv: str, param_grid: dict, scoring_type: str, n_jobs: int = -1, cv: int = 5)
Under progress will be available soon!
- critique(cv: int = 5, scoring: str = 'neg_mean_squared_error', topkmodel: int | None = None) str
Evaluate model generalization and diagnose overfitting or underfitting (bias-variance tradeoff) using cross-validation.
- Parameters:
cv (int, default=5) – Number of cross-validation folds.
scoring (str) – Metric used to evaluate each cross-validation fold.
topkmodel (int, optional) – Index of the model to evaluate from the top-k selected models. If not provided, the best estimator is used by default.
- Returns:
Alert message including the relative gap:
\[\text{Relative Gap} = \frac{\text{overfitting\_gap}}{\text{MSE (training)}}\]Where:
overfitting_gap = mean cross-validation score − training MSE
- Return type:
str
Notes
Helps identify whether the model is overfitting or underfitting.
Higher variance indicates overfitting.
High bias indicates underfitting.
Code
>>> orch.critique(cv=n, scoring='neg_mean_squared_error') ## Checks the sanity(bias-variance tradeoffs) for best model >>> orch.critique(cv=n, scoring='neg_mean_squared_error', topkmodel = model) ## Checks the sanity(bias-variance tradeoffs) for custom model
- topkmodel(access_estimator: int | None = None, threshold: float | int | None = None) DataFrame | None
Retrieve top-performing models or access a specific estimator.
- Parameters:
access_estimator (int, optional) – Index of the estimator to access from the ranked model list (1st, 2nd, …, nth). If provided, returns the selected estimator.
threshold (float or int, optional) – Threshold value used to filter models based on the evaluation metric. Only models meeting the threshold criteria are returned.
- Returns:
DataFrame containing ranked models and their evaluation metrics.
If
access_estimatoris provided, returns the selected estimator instead of a DataFrame.- Return type:
pandas.DataFrame or None
Notes
Models are ranked based on the selected evaluation metric.
If both parameters are
None, the full ranked model table is returned.If
access_estimatoris provided, threshold filtering is ignored.
Code
>>> orch.topkmodel() ## returns leaderboard of top K models >>> orch.topkmodel(access_estimator = n) ## returns choosen model from nth rank(1st - nth)
- auto_explain(n_classes: int | None = None, size: tuple | None = None, model: str = 'best')
Generate SHAP-based explanations for trained models.
Provides automatic model interpretability using SHAP (SHapley Additive Explanations) for both regression and classification tasks. Supports TreeExplainer and LinearExplainer, along with visualization tools such as summary plots and bar plots.
- Parameters:
n_classes (int) – Number of unique output classes. Required for classification tasks.
size (tuple, optional) – Figure size used for resizing Decision Tree visualizations.
model (str, default='best') –
Specifies which model to explain.
'best': Uses the top-performing estimator.
- Returns:
Generates SHAP visualizations such as summary plots and bar plots. For classification tasks,
n_classesmust be specified.- Return type:
None
Notes
Uses SHAP TreeExplainer for tree-based models.
Uses SHAP LinearExplainer for linear models.
Supports both regression and classification workflows.
Outputs include summary plots and feature importance visualizations.
Code
>>> orch.auto_explain() ## for best model explaination >>> orch.auto_explain(n_classes=class_ids) ## classification type best model explaination >>> orch.auto_explain(model=model) ## custom model based explaination