When we are dealing with documents in the space of words, meaning that words are the features of the document, we need a feature engineer mechanism in order to represent the document numerically.
The different most common ways of representations are:
- Set of words (we loose the order and multiplicity);
- Bag of Words (BoW) (we keep the multiplicity but we loose the order);
- Bag of N-grams (the more general case of bag-of-words);
- Word2Vec that is a more advanced representation that uses Neural Language Models.