When we are dealing with documents in the space of words, meaning that words are the features of the document, we need a feature engineer mechanism in order to represent the document numerically.

The different most common ways of representations are:

  • Set of words (we loose the order and multiplicity);
  • Bag of Words (BoW) (we keep the multiplicity but we loose the order);
  • Bag of N-grams (the more general case of bag-of-words);
  • Word2Vec that is a more advanced representation that uses Neural Language Models.

nlp