World's most popular travel blog for travel bloggers.

[Solved]: Huffman Code VS Hu–Tucker Code

, , No Comments
Problem Detail: 

Before I'll ask my question, let me start with my understanding of the definitions, to prevent myself with further confusion, as well as giving some background.

Huffman Code is the binary-code induced from a binary tree, constructed by Huffman's Algorithm.
Hu–Tucker Code is the binary-code induced from an alphabetical search tree.
According to Wikipedia (see the paragraph on Optimal alphabetic binary trees (Hu–Tucker coding)):

In the standard Huffman coding problem, it is assumed that any codeword can correspond to any input symbol. In the alphabetic version, the alphabetic order of inputs and outputs must be identical. Thus, for example, $A = \left\{a,b,c\right\}$ could not be assigned code $H\left(A,C\right) = \left\{00,1,01\right\}$, but instead should be assigned either $H\left(A,C\right) =\left\{00,01,1\right\}$ or $H\left(A,C\right) = \left\{0,10,11\right\}$. This is also known as the Hu–Tucker problem, after T. C. Hu and Alan Tucker, the authors of the paper presenting the first linearithmic solution to this optimal binary alphabetic problem, which has some similarities to Huffman algorithm, but is not a variation of this algorithm. These optimal alphabetic binary trees are often used as binary search trees.

My question is, what are the applications of such trees? (alphabetic binary tree)
I tried to search online, but couldn't find a satisfying answer.
I also read the introduction in Hu & Tucker's paper on the subject: Optimal Computer Search Trees and Variable-Length Alphabetical Code, but I couldn't figure out exactly the use of such tree from their example.

I can very well understand the need of a compact, optimal prefix code, induced by an optimal tree (i.e. Huffman Code); this can be used for compression, but what is the use of alphabetical binary trees?

Asked By : so.very.tired

Answered By : Pseudonym

Let me give you a real-world example, that's very similar to something I wrote once.

Let's say you're implementing a library catalogue system. A library catalogue is conceptually a collection of documents (perhaps in MARC format). A user of this system might enter a query, as in any search engine, and get a set of documents in return. The user would like to be able to sort the result set by some field (e.g. title or author), and display the result set a screenful at a time.

Sorting is a well-understood problem. However, suppose this is a big library, and a search returns 100,000 relevant documents. Clearly the user is not going to look through all of them! In fact, the user might only look at the first couple of screens of results (say, 50-100 documents) and realise that their query was too broad, and so refine it further.

Moreover, accessing the sort key for a document requires parsing the document. True, you could extract the possible sort keys into a form where you didn't require parsing MARC (or, even worse, SGML/XML), although that would duplicate data. And besides, these are strings we're talking about. They are variable length, which makes memory and disk management difficult.

So you could try a fixed-size format. You could take, say, the first K characters from every title for some predetermined K, and store it in an array on disk, indexed by document number. Then you could first sort the documents by those string prefixes (i.e. something like a bucket/radix sort), and any documents which fall in the same bucket could then be sorted by extracting the "real" sort key from the documents.

The nice thing about this is you don't need to fully sort the result set. Because the user is paging through the set, you only have to completely sort the first few screenfuls, and just retain enough bucket information to sort the others if the user decides to page through that far.

So that's an improvement, but how do you set K? A lot of titles start with the letters "The ", and that's using 32 bits of information for very little discriminatory power. In fact, you'd probably be surprised how many periodicals are called "The International Journal of X", or similar boilerplate, and some searches are likely to return a lot of documents with similar titles just like that.

One possible solution is to use an order-preserving code. Compress all titles using that code, and store first 64 bits (or some other fixed amount) of the compressed title in an on-disk array. This has quite a few practical advantages: parts of the title which have very little discriminatory power get very short code words (so you don't waste space on irrelevant detail), you can sort by it because it's order-preserving, and the keys are fixed-length (so they're easy to manage in an efficient way).

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/49480

0 comments:

Post a Comment

Let us know your responses and feedback