Standardizing Formats
Ensure data follows a consistent format (e.g., date as YYYY-MM-DD, text in lowercase) to make it easier to analyze and compare.
Identifying Outliers
Detect values that deviate significantly from the rest of the data — they might be errors or rare but valid cases.
Fixing Inconsistent Labels
Aligning categories like NY, New York, new york → New York. Fixing typos or case inconsistencies.
Parsing and Splitting Columns
Splitting full names into first/last, extracting structured info from unstructured fields (e.g., postal code from address).
Dedupe Rows
Remove duplicate entries to avoid skewing analysis or double-counting data.
Noise Removal
Removing HTML, emojis, punctuation, stop words, smoothing noisy time series.
Imputation
Fill in missing values so you can work with a complete dataset.
- Mean/Median/Mode Imputation: Replace missing values with the average, median, or most frequent value.
- Constant Imputation: Use a fixed value like 0 or “unknown” to fill missing entries.
- KNN Imputation: Estimate missing values using similar records based on feature proximity.
- Regression Imputation: Predict missing values using a regression model trained on other features.
- MICE (Multiple Imputation by Chained Equations): Iteratively models each feature with missing values using the others.
- Forward/Backward Fill: Propagate last known or next known values — commonly used in time series.
- Interpolation: Fill gaps by estimating values between known points (linear, spline, etc.).
- Kalman Filters / ARIMA: Advanced statistical models for time series imputation.
- MissForest: Random Forest-based imputation capturing nonlinear relationships.
- RNN / Autoencoder Imputation: Neural network-based methods for structured or sequential data.
- Indicator + Imputation: Flag missingness with a new variable before imputing, preserving signal.