In [1]:
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
import re
In [2]:
normalizer = WordNetLemmatizer()

treebank_wordnet_pos = {
    'J': 'a', # adjective
    'V': 'v', # verb
    'N': 'n', # noun
    'R': 'r', # adverb
}
def get_wordnet_pos(treebank_pos, default='n'):
    return treebank_wordnet_pos.get(treebank_pos[0], default)

def preprocess_text(txt):
    txt    = re.sub(r'\W+', ' ', txt).lower()
    tokens = word_tokenize(txt)

    return [
        normalizer.lemmatize(token[0], get_wordnet_pos(token[1]))
        for token in pos_tag(tokens)
    ]
In [3]:
corpus = [
    "This is a sample sentence!",
    "This is my second sentence.",
    "Is this my third sentence?"
]
corpus_cleaned = [" ".join(preprocess_text(doc)) for doc in corpus]
corpus_cleaned
Out[3]:
['this be a sample sentence',
 'this be my second sentence',
 'be this my third sentence']

From scratch

In [4]:
import pandas as pd
import numpy as np

Term frequencies

Term frequency indicates how often each word appears in the document. The intuition for including term frequency in the tf-idf calculation is that the more frequently a word appears in a single document, the more important that term is to the document.

tf(t,d) = count of t in document / number of words in document
In [5]:
from sklearn.feature_extraction.text import CountVectorizer
In [6]:
vectorizer       = CountVectorizer()
term_frequencies = vectorizer.fit_transform(corpus_cleaned).toarray()
term_frequencies
Out[6]:
array([[1, 0, 1, 0, 1, 0, 1],
       [1, 1, 0, 1, 1, 0, 1],
       [1, 1, 0, 0, 1, 1, 1]])
In [7]:
# Visualize term_frequencies
df_tf = pd.DataFrame(
    term_frequencies.T,
    index=vectorizer.get_feature_names(),
    columns=corpus_cleaned
)
df_tf
Out[7]:
this be a sample sentence this be my second sentence be this my third sentence
be 1 1 1
my 0 1 1
sample 1 0 0
second 0 1 0
sentence 1 1 1
third 0 0 1
this 1 1 1

Inverse document frequencies

The inverse document frequency component of the tf-idf score penalizes terms that appear more frequently across a corpus. The intuition is that words that appear more frequently in the corpus give less insight into the topic or meaning of an individual document, and should thus be deprioritized.

We can calculate the inverse document frequency for some term t across a corpus using

idf(t) = log(n/occurrence of t in documents) + 1

Smoothing idf: As we cannot divide by 0, the constant "1" is added to the numerator and denominator of the idf as if an extra document was seen containing every term in the collection exactly once, which prevents zero divisions:

smoothed_idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1

The important take away from the equation is that as the number of documents with the term t increases, the inverse document frequency decreases (due to the nature of the log function).

In [8]:
n_samples, n_features = term_frequencies.shape
doc_frequency         = term_frequencies.sum(axis=0)

inverse_doc_frequency = np.log(
    (1 + n_samples) / (1 + doc_frequency)
) + 1

inverse_doc_frequency
Out[8]:
array([1.        , 1.28768207, 1.69314718, 1.69314718, 1.        ,
       1.69314718, 1.        ])
In [9]:
# Inverse Document Frequency using sklearn
from sklearn.feature_extraction.text import TfidfTransformer

transformer = TfidfTransformer(norm=None, smooth_idf=True)
transformer.fit(term_frequencies)
inverse_doc_frequency = transformer.idf_
inverse_doc_frequency
Out[9]:
array([1.        , 1.28768207, 1.69314718, 1.69314718, 1.        ,
       1.69314718, 1.        ])
In [10]:
# Visualize inverse_doc_frequency
df_itf = pd.DataFrame(
    inverse_doc_frequency,
    index=vectorizer.get_feature_names(),
    columns=['idf'])
df_itf
Out[10]:
idf
be 1.000000
my 1.287682
sample 1.693147
second 1.693147
sentence 1.000000
third 1.693147
this 1.000000

TF-IDF

tf-idf(t, d) = tf(t, d) * idf(t)
In [11]:
df_tf * df_itf.values
Out[11]:
this be a sample sentence this be my second sentence be this my third sentence
be 1.000000 1.000000 1.000000
my 0.000000 1.287682 1.287682
sample 1.693147 0.000000 0.000000
second 0.000000 1.693147 0.000000
sentence 1.000000 1.000000 1.000000
third 0.000000 0.000000 1.693147
this 1.000000 1.000000 1.000000

TF-IDF using sklearn

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
In [13]:
vectorizer   = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(corpus_cleaned).toarray()
tfidf_scores
Out[13]:
array([[1.        , 0.        , 1.69314718, 0.        , 1.        ,
        0.        , 1.        ],
       [1.        , 1.28768207, 0.        , 1.69314718, 1.        ,
        0.        , 1.        ],
       [1.        , 1.28768207, 0.        , 0.        , 1.        ,
        1.69314718, 1.        ]])
In [14]:
pd.DataFrame(
    tfidf_scores.T,
    index=vectorizer.get_feature_names(),
    columns=corpus_cleaned
)
Out[14]:
this be a sample sentence this be my second sentence be this my third sentence
be 1.000000 1.000000 1.000000
my 0.000000 1.287682 1.287682
sample 1.693147 0.000000 0.000000
second 0.000000 1.693147 0.000000
sentence 1.000000 1.000000 1.000000
third 0.000000 0.000000 1.693147
this 1.000000 1.000000 1.000000

Search algorithm

In [15]:
the_raven = '''
Once upon a midnight dreary, while I pondered, weak and weary,
 Over many a quaint and curious volume of forgotten lore,
 While I nodded, nearly napping, suddenly there came a tapping,
 As of some one gently rapping, rapping at my chamber door.
 “‘Tis some visiter,” I muttered, “tapping at my chamber door—
                          Only this, and nothing more.”
 Ah, distinctly I remember it was in the bleak December,
 And each separate dying ember wrought its ghost upon the floor.
 Eagerly I wished the morrow;—vainly I had sought to borrow
 From my books surcease of sorrow—sorrow for the lost Lenore—
 For the rare and radiant maiden whom the angels name Lenore—
                          Nameless here for evermore.
 And the silken sad uncertain rustling of each purple curtain
 Thrilled me—filled me with fantastic terrors never felt before;
 So that now, to still the beating of my heart, I stood repeating
 “‘Tis some visiter entreating entrance at my chamber door—
 Some late visiter entreating entrance at my chamber door;—
                          This it is, and nothing more.”
 Presently my soul grew stronger; hesitating then no longer,
 “Sir,” said I, “or Madam, truly your forgiveness I implore;
 But the fact is I was napping, and so gently you came rapping,
 And so faintly you came tapping, tapping at my chamber door,
 That I scarce was sure I heard you “—here I opened wide the door;——
                          Darkness there and nothing more.
 Deep into that darkness peering, long I stood there wondering, fearing,
 Doubting, dreaming dreams no mortal ever dared to dream before;
 But the silence was unbroken, and the darkness gave no token,
 And the only word there spoken was the whispered word, “Lenore!”
  This I whispered, and an echo murmured back the word, “Lenore!”—
                          Merely this, and nothing more.
 Back into the chamber turning, all my soul within me burning,
 Soon I heard again a tapping somewhat louder than before.
 “Surely,” said I, “surely that is something at my window lattice;
 Let me see, then, what thereat is, and this mystery explore—
 Let my heart be still a moment and this mystery explore;—
                          ‘Tis the wind and nothing more!”
 Open here I flung the shutter, when, with many a flirt and flutter,
 In there stepped a stately raven of the saintly days of yore;
 Not the least obeisance made he; not an instant stopped or stayed he;
 But, with mien of lord or lady, perched above my chamber door—
 Perched upon a bust of Pallas just above my chamber door—
                          Perched, and sat, and nothing more.
 Then this ebony bird beguiling my sad fancy into smiling,
 By the grave and stern decorum of the countenance it wore,
 “Though thy crest be shorn and shaven, thou,” I said, “art sure no craven,
 Ghastly grim and ancient raven wandering from the Nightly shore—
 Tell me what thy lordly name is on the Night’s Plutonian shore!”
                          Quoth the raven “Nevermore.”
 Much I marvelled this ungainly fowl to hear discourse so plainly,
 Though its answer little meaning—little relevancy bore;
 For we cannot help agreeing that no living human being
 Ever yet was blessed with seeing bird above his chamber door—
 Bird or beast upon the sculptured bust above his chamber door,
                         With such name as “Nevermore.”
 But the raven, sitting lonely on the placid bust, spoke only
 That one word, as if his soul in that one word he did outpour.
 Nothing farther then he uttered—not a feather then he fluttered—
 Till I scarcely more than muttered “Other friends have flown before—
 On the morrow he will leave me, as my hopes have flown before.”
                          Then the bird said “Nevermore.”
 Startled at the stillness broken by reply so aptly spoken,
 “Doubtless,” said I, “what it utters is its only stock and store
 Caught from some unhappy master whom unmerciful Disaster
 Followed fast and followed faster till his songs one burden bore—
 Till the dirges of his Hope that melancholy burden bore
                         Of “Never—nevermore.”
 But the raven still beguiling all my sad soul into smiling,
 Straight I wheeled a cushioned seat in front of bird, and bust and door;
 Then, upon the velvet sinking, I betook myself to thinking
 Fancy unto fancy, thinking what this ominous bird of yore—
 What this grim, ungainly, ghastly, gaunt and ominous bird of yore
                         Meant in croaking “Nevermore.”
 This I sat engaged in guessing, but no syllable expressing
 To the fowl whose fiery eyes now burned into my bosom’s core;
 This and more I sat divining, with my head at ease reclining
 On the cushion’s velvet lining that the lamplght gloated o’er,
 But whose velvet violet lining with the lamplight gloating o’er,
                          She shall press, ah, nevermore!
 Then, methought, the air grew denser, perfumed from an unseen censer
 Swung by Angels whose faint foot-falls tinkled on the tufted floor.
 “Wretch,” I cried, “thy God hath lent thee—by these angels he hath sent
 thee
 Respite—respite and nepenthe from thy memories of Lenore;
 Quaff, oh quaff this kind nepenthe and forget this lost Lenore!”
                           Quoth the raven, “Nevermore.”
 “Prophet!” said I, “thing of evil!—prophet still, if bird or devil!—
 Whether Tempter sent, or whether tempest tossed thee here ashore,
 Desolate yet all undaunted, on this desert land enchanted—
 On this home by Horror haunted—tell me truly, I implore—
 Is there—is there balm in Gilead?—tell me—tell me, I implore!”
                           Quoth the raven, “Nevermore.”
 “Prophet!” said I, “thing of evil—prophet still, if bird or devil!
 By that Heaven that bends above us—by that God we both adore—
 Tell this soul with sorrow laden if, within the distant Aidenn,
 It shall clasp a sainted maiden whom the angels name Lenore—
 Clasp a rare and radiant maiden whom the angels name Lenore.”
                           Quoth the raven, “Nevermore.”
 “Be that word our sign of parting, bird or fiend!” I shrieked, upstarting—
 “Get thee back into the tempest and the Night’s Plutonian shore!
 Leave no black plume as a token of that lie thy soul hath spoken!
 Leave my loneliness unbroken!—quit the bust above my door!
 Take thy beak from out my heart, and take thy form from off my door!”
                          Quoth the raven, “Nevermore.”
 And the raven, never flitting, still is sitting, still is sitting
 On the pallid bust of Pallas just above my chamber door;
 And his eyes have all the seeming of a demon’s that is dreaming,
 And the lamp-light o’er him streaming throws his shadow on the floor;
 And my soul from out that shadow that lies floating on the floor
                          Shall be lifted—nevermore!
'''

raven = the_raven.split('.')
len(raven)
Out[15]:
23
In [16]:
raven[0]
Out[16]:
'\nOnce upon a midnight dreary, while I pondered, weak and weary,\n Over many a quaint and curious volume of forgotten lore,\n While I nodded, nearly napping, suddenly there came a tapping,\n As of some one gently rapping, rapping at my chamber door'
In [17]:
raven_cleaned = [" ".join(preprocess_text(txt)) for txt in raven]
raven_cleaned[0]
Out[17]:
'once upon a midnight dreary while i ponder weak and weary over many a quaint and curious volume of forgotten lore while i nod nearly nap suddenly there come a tapping a of some one gently rap rap at my chamber door'
In [18]:
# Build tf-idf lookup table
vectorizer   = TfidfVectorizer(norm=None)
tfidf_scores = vectorizer.fit_transform(raven_cleaned).toarray()

df_tfidf = pd.DataFrame(
    tfidf_scores.T,
    index=vectorizer.get_feature_names()
)
df_tfidf
Out[18]:
0 1 2 3 4 5 6 7 8 9 ... 13 14 15 16 17 18 19 20 21 22
above 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 4.772589 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.000000 2.386294 0.0 2.386294 2.386294
adore 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.000000 3.484907 0.0 0.000000 0.000000
again 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 3.484907 0.000000 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000
agree 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000
ah 0.0 0.0 3.079442 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.000000 3.079442 0.0 0.000000 0.000000 0.0 0.000000 0.000000
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
wrought 0.0 0.0 3.484907 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000
yet 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 3.079442 0.000000 0.0 0.000000 0.000000
yore 0.0 0.0 0.000000 0.0 0.0 0.000000 0.0 0.000000 3.079442 0.0 ... 0.0 0.0 6.158883 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000
you 0.0 0.0 0.000000 0.0 0.0 10.454720 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000
your 0.0 0.0 0.000000 0.0 0.0 3.484907 0.0 0.000000 0.000000 0.0 ... 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.000000

423 rows × 23 columns

In [19]:
# Get most relevant docs for "the bird"
terms  = 'the bird'.split(' ')
search = None

for term in terms:
    if search is None:
        search  = df_tfidf.loc[term]
    else:
        search += df_tfidf.loc[term]

search = search.sort_values(ascending=False)
search
Out[19]:
15    8.469860
9     7.533669
22    6.522068
16    6.522068
21    6.446658
6     5.435057
19    5.359646
10    5.284236
3     4.348046
8     4.348046
18    3.185624
13    3.185624
2     2.174023
4     2.174023
5     2.174023
11    2.174023
14    2.174023
7     1.087011
12    1.087011
17    1.087011
20    1.087011
1     0.000000
0     0.000000
Name: the, dtype: float64
In [20]:
from IPython.display import display, HTML
In [21]:
i = 0
for idx in search.index[:5]:
    i += 1
    html = raven[idx] \
            .replace('\n', '<br>')

    for term in terms:
        html = html.replace(term, '<marked>' + term +'</marked>')

    display(HTML('<style>marked{background:lightskyblue}</style>'
                 + '<h3>Result ' + str(i) + '</h3>'
                 + html))

Result 1


But the raven still beguiling all my sad soul into smiling,
Straight I wheeled a cushioned seat in front of bird, and bust and door;
Then, upon the velvet sinking, I betook myself to thinking
Fancy unto fancy, thinking what this ominous bird of yore—
What this grim, ungainly, ghastly, gaunt and ominous bird of yore
Meant in croaking “Nevermore

Result 2


Then this ebony bird beguiling my sad fancy into smiling,
By the grave and stern decorum of the countenance it wore,
“Though thy crest be shorn and shaven, thou,” I said, “art sure no craven,
Ghastly grim and ancient raven wandering from the Nightly shore—
Tell me what thy lordly name is on the Night’s Plutonian shore!”
Quoth the raven “Nevermore

Result 3


And the raven, never flitting, still is sitting, still is sitting
On the pallid bust of Pallas just above my chamber door;
And his eyes have all the seeming of a demon’s that is dreaming,
And the lamp-light o’er him streaming throws his shadow on the floor;
And my soul from out that shadow that lies floating on the floor
Shall be lifted—nevermore!

Result 4


This I sat engaged in guessing, but no syllable expressing
To the fowl whose fiery eyes now burned into my bosom’s core;
This and more I sat divining, with my head at ease reclining
On the cushion’s velvet lining that the lamplght gloated o’er,
But whose velvet violet lining with the lamplight gloating o’er,
She shall press, ah, nevermore!
Then, methought, the air grew denser, perfumed from an unseen censer
Swung by Angels whose faint foot-falls tinkled on the tufted floor

Result 5


“Be that word our sign of parting, bird or fiend!” I shrieked, upstarting—
“Get thee back into the tempest and the Night’s Plutonian shore!
Leave no black plume as a token of that lie thy soul hath spoken!
Leave my loneliness unbroken!—quit the bust above my door!
Take thy beak from out my heart, and take thy form from off my door!”
Quoth the raven, “Nevermore