World's most popular travel blog for travel bloggers.

How to extract best illustrative sentences from documents for a given word?

, , No Comments
Problem Detail: 

I want to extract "best" sentences from the given (html) documents. Let me illustrate this with an example. I have a word "determine". Now I have hundreds of html files having paragraphs and sentences that contain this word. Out of those sentences, I want to collect and score only those that are capable of describing or illustrating the meaning of "determine". Lets go further, following are the sentences I have collected from the web (from google search for the word, vocabulary.com, merriam-webster.com, yourdictionary.com) for the word "determine":

  1. It will be her mental attitude that determines her future.
  2. Officials are working with state police to determine the cause of a deadly bus crash.
  3. They are unable to accurately determine the ship's position at this time.
  4. The program officers were determined to do better on that front.
  5. And how did you determine that?

In my case, I would like my algorithm to filter out (4) and (5) since they seem to be incomplete sentences in the sense they won't be able to better "illustrate" the word "determine". I won't entertain (1) as well since this sentence is not stronger enough to help the reader visualize the use of the word "determine", if we assume the reader is reading the word for the first time. Now for the remaining sentences (2) and (3), I would like to rate them to find their "usefulness".

For sure this is NLP problem and requires extensive research work to achieve. But still I am new to this domain, and it would be extremely helpful if anyone could point me towards the related works. I believe this forum is the good one to put such queries related to computer science.

Asked By : sangam

Answered By : John Frederick Chionglo

Luhn, H.P. (1958, April). The automatic creation of literature abstracts. IBM Journal of Research and Development, 2(2), 159-165.

  • Describes the main elements of automatically creating literature abstracts. Significant sentences are candidates for inclusion into an abstract. The measure of sentence significance is based on significant words, and the proximity of significant words within sentences. Word significance is based on its frequency within the document; the word-frequency diagram depicts the set of words to consider as significant. Word significance also applies to the creation of indexes for search and retrieval.

Consider also the presentation I created when I took a graduate course: "Term Association and Thesaurus Construction"; you might find the references there useful too.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/51204

3200 people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback