Standardizing Formats

Ensure data follows a consistent format (e.g., date as YYYY-MM-DD, text in lowercase) to make it easier to analyze and compare.

Identifying Outliers

Detect values that deviate significantly from the rest of the data — they might be errors or rare but valid cases.

Fixing Inconsistent Labels

Aligning categories like NY, New York, new york → New York. Fixing typos or case inconsistencies.

Parsing and Splitting Columns

Splitting full names into first/last, extracting structured info from unstructured fields (e.g., postal code from address).

Dedupe Rows

Remove duplicate entries to avoid skewing analysis or double-counting data.

Noise Removal

Removing HTML, emojis, punctuation, stop words, smoothing noisy time series.

Imputation

Fill in missing values so you can work with a complete dataset.

  • Mean/Median/Mode Imputation: Replace missing values with the average, median, or most frequent value.
  • Constant Imputation: Use a fixed value like 0 or “unknown” to fill missing entries.
  • KNN Imputation: Estimate missing values using similar records based on feature proximity.
  • Regression Imputation: Predict missing values using a regression model trained on other features.
  • MICE (Multiple Imputation by Chained Equations): Iteratively models each feature with missing values using the others.
  • Forward/Backward Fill: Propagate last known or next known values — commonly used in time series.
  • Interpolation: Fill gaps by estimating values between known points (linear, spline, etc.).
  • Kalman Filters / ARIMA: Advanced statistical models for time series imputation.
  • MissForest: Random Forest-based imputation capturing nonlinear relationships.
  • RNN / Autoencoder Imputation: Neural network-based methods for structured or sequential data.
  • Indicator + Imputation: Flag missingness with a new variable before imputing, preserving signal.