Standardizing Formats

Ensure data follows a consistent format (e.g., date as YYYY-MM-DD, text in lowercase) to make it easier to analyze and compare.

Identifying Outliers

Detect values that deviate significantly from the rest of the data — they might be errors or rare but valid cases.

Aligning categories like NY, New York, new york → New York. Fixing typos or case inconsistencies.

Splitting full names into first/last, extracting structured info from unstructured fields (e.g., postal code from address).

Remove duplicate entries to avoid skewing analysis or double-counting data.

Removing HTML, emojis, punctuation, stop words, smoothing noisy time series.

Fill in missing values so you can work with a complete dataset.

Mean/Median/Mode Imputation: Replace missing values with the average, median, or most frequent value.
Constant Imputation: Use a fixed value like 0 or “unknown” to fill missing entries.
KNN Imputation: Estimate missing values using similar records based on feature proximity.
Regression Imputation: Predict missing values using a regression model trained on other features.
MICE (Multiple Imputation by Chained Equations): Iteratively models each feature with missing values using the others.
Forward/Backward Fill: Propagate last known or next known values — commonly used in time series.
Interpolation: Fill gaps by estimating values between known points (linear, spline, etc.).
Kalman Filters / ARIMA: Advanced statistical models for time series imputation.
MissForest: Random Forest-based imputation capturing nonlinear relationships.
RNN / Autoencoder Imputation: Neural network-based methods for structured or sequential data.
Indicator + Imputation: Flag missingness with a new variable before imputing, preserving signal.