Preprocessing module

class cherrypick.preprocessing.Preprocessor(df, duplicate: Literal['drop'] | None = None)

Bases: object

Split dataset into training and testing sets.

Parameters:

df (pandas.DataFrame) – Input dataset containing features and target variable.
duplicate (str) – Removes duplicate values row-wise if duplicate = 'drop', else returns original value.

Examples

>>> preprocess = Preprocessor(df=df) ## Initialization of Preprocessor

fill_null(type: Literal['mean', 'median', 'mode'], columns: str) → DataFrame

Method consisting several Imputers for handling Null values in a given dataset.

Parameters:

columns (str) – Desired column name of a input variable for imputer processing.
type ({'mean', 'median', 'mode'}) –
determines the type of Imputation techniques.
- mean - Takes mean of all samples in a feature and substitutes the NaN value.
- median - Takes median of all samples in a feature and substitutes the NaN value, aggresive and robust to Outliers.
- mode - Takes the most frequently occuring value of a sample in a feature and substitute the NaN value, used for categorical features.

Examples

>>> preprocess.fill_null(type='mean', columns=column) ## Fill nulls by taking mean of all values in specified column
>>> preprocess.fill_null(type='median', columns=column) ## Fill nulls by taking median of all values in specified column
>>> preprocess.fill_null(type='mode', columns=column) ## Fill nulls by taking mode of all values in specified categorical column

>>> print(df.isna().sum()) ## returns 0 as all nulls are handeled using Imputers

Returns:: Cleaned dataset with all NaN values handeled with mean, mode and median imputation statistical techniques.
Return type:: pd.DataFrame

collinear(threshold: float, show: Literal[True, False] = False, method: Literal['spearman', 'pearson'] = 'pearson')

Handle multicollinearity by identifying highly correlated features.

This method detects correlated features based on a specified correlation threshold and statistical method.

Parameters:

threshold (float) – Correlation threshold above which features are considered collinear.
method ({'spearman', 'pearson'}, default='pearson') –
Type of correlation method used:
- spearman : Captures monotonic relationships using rank correlation
- pearson : Measures linear correlation between features
show ({True, False}, default=False) – Shows correlation matrix with list of collinear features if True, else just returns the list of collinear features

Returns:

List of correlated features recommended for removal.

Return type:

list

Notes

Helps reduce multicollinearity in datasets.
Improves model stability and interpretability.

Examples

>>> preprocess.collinear(threshold=0.85, type='pearson', show=true)

encoder(type: Literal['onehot', 'label'], train_data: tuple, test_data: tuple, column: str, encoder_dir: str)

Method consisting OneHotEncode and LabelEncoder for handling any Categorical feature in a given dataset and convert it into numeric data.

Parameters:

column (str) – Desired column name of a input variable for imputer processing.
type ({'onehot', 'label'}) –
determines the type of encoder to perform encodings.
- Onehot - Uses One Hot Encoding on a categorical column.
  returns Sparse matrix containing 1’s and 0’s
- label - Uses LabelEncoder on target(output) column if there exist non-numeric categorical data.
  returns Matrix containing 1 to n, based upon the number of class
train_data (tuple) – Arguement requires Training data i.e X_train and y_train.
test_data (tuple) – Arguement requires Training data i.e X_test and y_test.
encoder_dir (str) – File directory where the encoder needs to persist.

Examples

>>> preprocess.encoder(train_data = train, test_data = test, column = 'feature1', type = 'onehot', encoder_dir='encoder_') ## Onehot encodings for Input Features, returns Sparse matrix
>>> preprocess.encoder(train_data = train, test_data = test, column = 'target', type = 'label', encoder_dir='encoder_') ## label encodings for Output/target feature

Returns:

For OneHotEncoder returns encoded X_train and X_test.

For LabelEncoder returns encoded y_train and y_test.

Return type:

pd.DataFrame