I'm trying to implement the Pastry Distributed Hash Table, but some things are escaping my understanding. I was hoping someone could clarify.
Disclaimer: I'm not a computer science student. I've taken precisely two computer science courses in my life, and neither dealt with anything remotely complex. I've worked with software for years, so I feel I'm up to the implementation task, if I could just wrap my head around the ideas. So I may just be missing something obvious.
I've read the paper that the authors published [1], and I've made some good progress, but I keep getting hung up on this one particular point in how the routing table works:
The paper claims that
A node's routing table, $R$, is organized into $\lceil \log_{2^b} N\rceil$ rows with $2^b - 1$ entries each. The $2^b - 1$ entries at row $n$ of the routing table each refer to a node whose nodeId shares the present node's nodeId in the first n digits, but whose $n + 1$th digit has one of the $2^b - 1$ possible values other than the $n + 1$th digit in the present node's id.
The $b$ stands for an application-specific variable, usually $4$. Let's use $b=4$, for simplicity's sake. So the above is
A node's routing table, $R$, is organized into $\lceil \log_{16} N\rceil$ rows with $15$ entries each. The $15$ entries at row $n$ of the routing table each refer to a node whose nodeId shares the present node's nodeId in the first n digits, but whose $n + 1$th digit has one of the $2^b - 1$ possible values other than the $n + 1$th digit in the present node's id.
I understand that much. Further, $N$ is the number of servers in the cluster. I get that, too.
My question is, if the row an entry is placed into depends on the shared length of the key, why the seemingly random limit on the number of rows? Each nodeId has 32 digits, when $b=4$ (128 bit nodeIds divided into digits of b bits). So what happens when $N$ gets high enough that $\lceil\log_{16} N\rceil > 32$? I realise it would take 340,282,366,920,938,463,463,374,607,431,768,211,457 (if my math is right) servers to hit this scenario, but it just seems like an odd inclusion, and the correlation is never explained.
Furthermore, what happens if you have a small number of servers? If I have fewer than 16 servers, I only have one row in the table. Further, under no circumstances would every entry in the row have a corresponding server. Should entries be left empty? I realise that I'd be able to find the server in the leaf set no matter what, given that few servers, but the same quandary is raised for the second row--what if I don't have a server that has a nodeId such that I can fill every possible permutation of the nth digit? Finally, if I have, say, four servers, and I have two nodes that share, say, 20 of their 32 digits, by some random fluke... should I populate 20 rows of the table for that node, even though that is far more rows than I could even come close to filling?
Here's what I've come up with, trying to reason my way through this:
- Entries are to be set to a null value if there is not a node that matches that prefix precisely.
- Empty rows are to be added until enough rows exist to match the shared length of the nodeIds.
- If, and only if, there is no matching entry for a desired message ID, fall back on a search of the routing table for a nodeId whose shared length is greater than or equal to the current nodeId's and whose entry is mathematically closer than the current nodeId's to the desired ID.
- If no suitable node can be found in #3, assume this is the destination and deliver the message.
Do all four of these assumptions hold up? Is there somewhere else I should be looking for information on this?
- Pastry: Scalable, decentralized object location and routing for large-scale peer-to-peer systems by A. Rowstrong and P. Druschel (2001) -- download here
Asked By : Paddy Foran
Answered By : AJed
The idea of a routing table in Pastry (and all structured P2P networks) is to minimize its size, while guaranteeing a quicker routing.
The routing algorithm of Pastry goes as follows:
Step A. A node u searches for an object A by firstly looking it up in its leaf set. Step B. If it was not available, then the query is forwarded to a known node that shares "a number of prefixes with $A$ that is at least larger than what node u shares with A". Step C. If such a record is not found, then the query is forwarded to a node in the leaf set that is numerically closest to $A$.
This is why a node $u$ stores addresses of nodes organizes its table as follows:
1.Each record in row $i$ of the routing table of node $u$ is a node identifier that shares $i$ prefix bits with the identifier of $u$.
2.The $(i + 1)^{th}$ bit of the records of a row $i$ is unique and is taken from the set $\{0, …, 2^{b} – 1\}$.
Example in a typical scenario: if u address is 1111 and object $A$ has identifier 4324: then here is what will happen: (we assume that it is of the base of 4. (i.e. addresses are from [1-4][1-4][1-4][1-4]).
Node $u$ shares 0 prefix with object $A$. Therefore, it looks in row 0. According to rule 2 above, node $u$ stores addresses of nodes 1XXX, 2XXX, 3XXX, 4XXX, where X is a "dont-care" value. The closest among these nodes to $A$ is 4XXX. - Let's say this 4XXX is actually 4013. Then $u$ forwards to $u _1$ with address 4013. Now you are going to repeat the same thing again at node $u _1$ with address 4013.
To make it simpler, here is again an example of how it will go in 4013. $u _1$ will first look for the size common prefix between 4013 and 4324 which is 1. So it goes to row 1, which contain values such that 41XX, 42XX, 43XX, 44XX. The closes among them to $A$ is 43XX. - if this was 4331 then it will be forwards toward it.
The maximum number of hops here in is 4 hops (XXXX) ! in Pastry terms, it is $log_{2^b}$. So it is reduced as $b$ increases. But the size of the rows which are $2^b$ will increases ! -- so the authors said that $b$ = 4 is a good balance !
Practical scenarios are usually not typical as that. There may be situations in which there are not many nodes in the network. this is why we follow step C above. - However, what you need to guarantee to make this algorithm correct is that each node be connected to the closest two nodes to it (in term of identifiers). This will form a ring of ordered nodes [e.g. 1->3->4->9->10->11->1]
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/3138
0 comments:
Post a Comment
Let us know your responses and feedback