Preprocessing module
- class cherrypick.preprocessing.Preprocessor(df, duplicate: Literal['drop'] | None = None)
Bases:
objectSplit dataset into training and testing sets.
- Parameters:
df (pandas.DataFrame) – Input dataset containing features and target variable.
duplicate (str) – Removes duplicate values row-wise if
duplicate = 'drop', else returns original value.
Examples
>>> preprocess = Preprocessor(df=df) ## Initialization of Preprocessor
- fill_null(type: Literal['mean', 'median', 'mode'], columns: str) DataFrame
Method consisting several Imputers for handling Null values in a given dataset.
- Parameters:
columns (str) – Desired column name of a input variable for imputer processing.
type ({'mean', 'median', 'mode'}) –
determines the type of Imputation techniques.
mean- Takesmean of all samplesin a feature and substitutes the NaN value.median- Takesmedian of all samplesin a feature and substitutes the NaN value, aggresive and robust to Outliers.mode- Takes themost frequently occuring value of a samplein a feature and substitute the NaN value, used for categorical features.
Examples
>>> preprocess.fill_null(type='mean', columns=column) ## Fill nulls by taking mean of all values in specified column >>> preprocess.fill_null(type='median', columns=column) ## Fill nulls by taking median of all values in specified column >>> preprocess.fill_null(type='mode', columns=column) ## Fill nulls by taking mode of all values in specified categorical column
>>> print(df.isna().sum()) ## returns 0 as all nulls are handeled using Imputers
- Returns:
Cleaned dataset with all NaN values handeled with mean, mode and median imputation statistical techniques.
- Return type:
pd.DataFrame
- collinear(threshold: float, show: Literal[True, False] = False, method: Literal['spearman', 'pearson'] = 'pearson')
Handle multicollinearity by identifying highly correlated features.
This method detects correlated features based on a specified correlation threshold and statistical method.
- Parameters:
threshold (float) – Correlation threshold above which features are considered collinear.
method ({'spearman', 'pearson'}, default='pearson') –
Type of correlation method used:
spearman: Captures monotonic relationships using rank correlationpearson: Measures linear correlation between features
show ({True, False}, default=False) – Shows correlation matrix with list of collinear features if True, else just returns the list of collinear features
- Returns:
List of correlated features recommended for removal.
- Return type:
list
Notes
Helps reduce multicollinearity in datasets.
Improves model stability and interpretability.
Examples
>>> preprocess.collinear(threshold=0.85, type='pearson', show=true)
- encoder(type: Literal['onehot', 'label'], train_data: tuple, test_data: tuple, column: str, encoder_dir: str)
Method consisting OneHotEncode and LabelEncoder for handling any Categorical feature in a given dataset and convert it into numeric data.
- Parameters:
column (str) – Desired column name of a input variable for imputer processing.
type ({'onehot', 'label'}) –
determines the type of encoder to perform encodings.
- Onehot - Uses One Hot Encoding on a categorical column.
returns Sparse matrix containing 1’s and 0’s
- label - Uses LabelEncoder on target(output) column if there exist non-numeric categorical data.
returns Matrix containing 1 to n, based upon the number of class
train_data (tuple) – Arguement requires Training data i.e X_train and y_train.
test_data (tuple) – Arguement requires Training data i.e X_test and y_test.
encoder_dir (str) – File directory where the encoder needs to persist.
Examples
>>> preprocess.encoder(train_data = train, test_data = test, column = 'feature1', type = 'onehot', encoder_dir='encoder_') ## Onehot encodings for Input Features, returns Sparse matrix >>> preprocess.encoder(train_data = train, test_data = test, column = 'target', type = 'label', encoder_dir='encoder_') ## label encodings for Output/target feature
- Returns:
For OneHotEncoder returns encoded X_train and X_test.
For LabelEncoder returns encoded y_train and y_test.
- Return type:
pd.DataFrame