There may be some words that are frequent across all the documents in the collection, and for this reason should be weighted less; and some other that are only frequent inside a certain document, and so they should be more important. This weighting scheme implements this approach by multiplying the frequency by an factor:
Where is the number of documents containing the word . We can see how this factor is effective: if the word appears in most of the documents, then and so . If the word appears less times, the factor will be higher.
Note
The at the denominator is needed in order to avoid division by . This won’t happen if the the vocabulary is used for the documents from which it was built, but sometimes it could happen that we build from a collection of documents, and then use the same to create the embedding for another collection of documents. In that case there may be words in that don’t appear in any document of the new collection, and so .