World's most popular travel blog for travel bloggers.

Doubts abouts join implementation

, , No Comments
Problem Detail: 

I'm studying advanced database and I can not understand some concepts related to the join. I hope I'm in the right place to post my doubts, otherwise I'm sorry.

I realized that to make the equijoin there are several families. In particular, the family that is based on the Cartesian product includes three main methods: nested loops, nested blocks and nested scans.

The nested loop method iterates on the outer relation R (tipically the smaller one) and for each tuple of R, it scans the internal relation S and applies the join condition. Then it just use the join definition: Cartesian product + selection. The cost of this process is:

C = |R| + |R| * |S|

You can get an improvement if instead of considering the tuples, consider the pages. In this case the cost is:

C = P_R + |R| * P_S

Why? I do not understand.

From what I remember of Operating Systems, the logical memory is divided into pages and the physical memory in frames, where the pages are of the same size of the frames. This is done to permit the non-contiguous allocation of files.

A tuple instead of what corresponds? I guess as a record (one line) of a table, but in this context what is it?

The nested blocks method is a further improvement because in main memory maintains a block of tuples of R. The improvement is obtained by scanning S only once for each group of tuples of R. A version of this charging algorithm in the tuples of R memory, then proceeds to S and examines the memory to find the S tuples that match any tuple of R. This reduces the number of necessary S scans. In this case the cost is:

C = P_R + P_R * P_S

I must admit that I did not understand very well this algorithm. Someone could explain it better?

Thus a block is a set of tuples? And a set of tuples form one page? I'm very confused.

Maybe some sketch could help me, I have not found that showed the tuples.

Thank you all

If I understand correctly, this is how I imagine the logical memory, physical memory, pages and frames: enter image description here

Nested loops (tuple)

Again, as I understand it: enter image description here

I apologize but I have not yet understood the nested loop (pages) and nested block :(

What do you mean with "We can achieve some improvement checking the tuples stored on one page of relation S for each record in relation R"?

And with "In Block nested approach each block in the inner relation S is read once for each block in the relation R which gives you the desired cost"?

I feel a bit stupid, sorry.

Asked By : Buster

Answered By : Rick Decker

First, let's establish some variables: we'll have $n_r$ tuples of relation $r$ and these tuples will be grouped into $b_r$ blocks; similarly the relation $s$ will have $n_s$ tuples, grouped into $b_s$ blocks. We'll assume that a single disk access can retrieve one block.

Nested-loop join. In this implementation we work on a per-tuple basis. The algorithm looks like this:

for each tuple t_r of r    for each tuple t_s of s       test the pair (t_r, t_s) to see if it meets the join condition       if it does, include the joined pair in the result 

To get the tuples $t_r$ in the outer loop, we get a block of tuples in one disk access, so the outer loop will require $b_r$ disk accesses to eventually get all the tuples of $r$. For each of the $n_r$ iterations of the outer loop, we need to compare a $t_r$ with all the $t_s$ tuples in $s$, so each iteration of the inner loop will take $b_s$ disk accesses. We have $n_r$ iterations of the inner loop for a total of $n_r\cdot b_s$ more disk accesses, so this algorithm will require $\mathbf{b_r+n_r\cdot b_s}$ disk accesses.

In your example, $n_r=5, b_r=b_s=2$, so we'd need $2+5\cdot 2= 12$ accesses.

Block nested-loop join. Here we do the same comparisons as before, only on a per-block basis, like this:

for each block B_r of r    for each block B_s of s       for each tuple t_r in block B_r          for each tuple t_s in block B_s             test the pair (t_r, t_s) to see if it meets the join condition             if it does, include the joined pair in the result 

Now the outer loops will require $b_r$ disk accesses, as before, and for the next loop we'll require $b_s$ accesses for each of the outer $b_r$ blocks, for a total of $b_r\cdot b_s$ accesses, giving us a total of $\mathbf{b_r+b_r\cdot b_s}$ accesses. Note that the two innermost loops will require no accesses, since all their actions can be done in memory.

With $n_r=5, b_r=b_s=2$, we'd need just $2+2\cdot 2= 6$ accesses.

It's worth mentioning that these are worst-case values, assuming that only one block of each relation can fit in the disk-to-memory buffer. Of course, if we had enough memory to store all of $r$ and $s$, then we'd only need $b_r+b_s$ disk accessses, to read both $r$ and $s$ into memory.

Best Answer from StackOverflow

Question Source :

3200 people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback