Preprocessing module

class cherrypick.preprocessing.Preprocessor(df, duplicate: Literal['drop'] | None = None)

Bases: object

Split dataset into training and testing sets.

Parameters:
  • df (pandas.DataFrame) – Input dataset containing features and target variable.

  • duplicate (str) – Removes duplicate values row-wise if duplicate = 'drop', else returns original value.

Examples

>>> preprocess = Preprocessor(df=df) ## Initialization of Preprocessor
fill_null(type: Literal['mean', 'median', 'mode'], columns: str) DataFrame

Method consisting several Imputers for handling Null values in a given dataset.

Parameters:
  • columns (str) – Desired column name of a input variable for imputer processing.

  • type ({'mean', 'median', 'mode'}) –

    determines the type of Imputation techniques.

    • mean - Takes mean of all samples in a feature and substitutes the NaN value.

    • median - Takes median of all samples in a feature and substitutes the NaN value, aggresive and robust to Outliers.

    • mode - Takes the most frequently occuring value of a sample in a feature and substitute the NaN value, used for categorical features.

Examples

>>> preprocess.fill_null(type='mean', columns=column) ## Fill nulls by taking mean of all values in specified column
>>> preprocess.fill_null(type='median', columns=column) ## Fill nulls by taking median of all values in specified column
>>> preprocess.fill_null(type='mode', columns=column) ## Fill nulls by taking mode of all values in specified categorical column
>>> print(df.isna().sum()) ## returns 0 as all nulls are handeled using Imputers
Returns:

Cleaned dataset with all NaN values handeled with mean, mode and median imputation statistical techniques.

Return type:

pd.DataFrame

collinear(threshold: float, show: Literal[True, False] = False, method: Literal['spearman', 'pearson'] = 'pearson')

Handle multicollinearity by identifying highly correlated features.

This method detects correlated features based on a specified correlation threshold and statistical method.

Parameters:
  • threshold (float) – Correlation threshold above which features are considered collinear.

  • method ({'spearman', 'pearson'}, default='pearson') –

    Type of correlation method used:

    • spearman : Captures monotonic relationships using rank correlation

    • pearson : Measures linear correlation between features

  • show ({True, False}, default=False) – Shows correlation matrix with list of collinear features if True, else just returns the list of collinear features

Returns:

List of correlated features recommended for removal.

Return type:

list

Notes

  • Helps reduce multicollinearity in datasets.

  • Improves model stability and interpretability.

Examples

>>> preprocess.collinear(threshold=0.85, type='pearson', show=true)
encoder(type: Literal['onehot', 'label'], train_data: tuple, test_data: tuple, column: str, encoder_dir: str)

Method consisting OneHotEncode and LabelEncoder for handling any Categorical feature in a given dataset and convert it into numeric data.

Parameters:
  • column (str) – Desired column name of a input variable for imputer processing.

  • type ({'onehot', 'label'}) –

    determines the type of encoder to perform encodings.

    • Onehot - Uses One Hot Encoding on a categorical column.

      returns Sparse matrix containing 1’s and 0’s

    • label - Uses LabelEncoder on target(output) column if there exist non-numeric categorical data.

      returns Matrix containing 1 to n, based upon the number of class

  • train_data (tuple) – Arguement requires Training data i.e X_train and y_train.

  • test_data (tuple) – Arguement requires Training data i.e X_test and y_test.

  • encoder_dir (str) – File directory where the encoder needs to persist.

Examples

>>> preprocess.encoder(train_data = train, test_data = test, column = 'feature1', type = 'onehot', encoder_dir='encoder_') ## Onehot encodings for Input Features, returns Sparse matrix
>>> preprocess.encoder(train_data = train, test_data = test, column = 'target', type = 'label', encoder_dir='encoder_') ## label encodings for Output/target feature
Returns:

For OneHotEncoder returns encoded X_train and X_test.

For LabelEncoder returns encoded y_train and y_test.

Return type:

pd.DataFrame