TF-IDF

There may be some words that are frequent across all the documents in the collection, and for this reason should be weighted less; and some other that are only frequent inside a certain document, and so they should be more important. This weighting scheme implements this approach by multiplying the frequency by an $i df$ factor:

$i df (w_{j}) = lo g (\frac{N}{1 + n _{j}})$

Where $n_{j}$ is the number of documents containing the word $w_{j}$ . We can see how this factor is effective: if the word appears in most of the documents, then $n_{j} \approx N$ and so $lo g (1) = 0$ . If the word appears less times, the factor will be higher.

Note

The $1$ at the denominator is needed in order to avoid division by $0$ . This won’t happen if the the vocabulary $V$ is used for the documents from which it was built, but sometimes it could happen that we build $V$ from a collection of documents, and then use the same $V$ to create the embedding for another collection of documents. In that case there may be words in $V$ that don’t appear in any document of the new collection, and so $n_{j} = 0$ .

nlp

Quartz 4

Explorer

TF-IDF

Graph View

Backlinks