LOADING

加载过慢请开启缓存 浏览器默认开启

Externel Sorting

2024/12/27 ADS

文章目录

👇

External Sorting

Introduction

When data are too large to fit into internel memory, external sorting is a technique to sort them. It involves dividing the data into smaller chunks, sorting each chunk independently, and then merging the sorted chunks to obtain the final sorted result.

Examples

Defination
  • $B$ is the fixed-size of a block of data that can be read from the external storage.
  • $M$ is the number of blocks that the internal memory can solve once at a time.
  • $p$ is the number of passes needed to sort the data.
  • $b$ is the total I/O cost, which equals to the number of blocks.

Now we have a internal memory with $M=3,B=1$.
And the data with size $N=13$ is originally stored in a tape $T_1$ in the external storage.

2-way merge algorithm

  • step 1: We first load in the first three blocks of $T_1$ into the internal memory and sort them.
  • step 2: Then we write the sorted data (a run) back to a new tape $T_2$.
  • step 3: We load the second three blocks of $T_1$ into the internal memory and sort them.
  • We write the sorted data back to a new tape $T_3$.
  • step 4: We repeat step 2 and 3 for the remaining blocks of $T_1$ and then regard $T_1$ as empty.
  • step 5: We merge and sort the first pair of run in $T_2$ and $T_3$ and write it back to the empty tape $T_1$.
  • step 6: We merge and sort the second pair of run in $T_2$ and $T_3$ and write it back to the a new tape $T_4$.
  • step 7: We repeat step 5 and 6 for the remaining pairs of runs in $T_2$ and $T_3$ and then regard $T_2$ and $T_3$ as empty.
  • step 8: We repeat step 5 , 6 and 7 until all the data are sorted and stored in a single tape $T$.
    The number of runs after step 4 is $r=\lceil\frac{N}{M}\rceil=5$.
    The number of operations that merge two tapes is $t=\lceil\log_2\lceil\frac{N}{M}\rceil\rceil=3$.
    The number of passes is $p=1+t=4$.
    The cost of I/O is $b=2\frac{N}{B}\times p=104$.

What we concern are

  • seek time (number of passes)
  • time to merge $N$ records from input buffers to the output buffer
  • time to internally sort $M$ records
  • time to read or write one block of records
    So our goal is to
  • minimize the number of passes
  • minimize the number of run merges
  • minimize the number of run generation
  • minimize the number of buffer handling for parallel operation

Reduce number of passes

Use a $k$-way merge algorithm to merge the runs in parallel.

So the number of passes $p=1+\lceil\log_k\lceil\frac{N}{M}\rceil\rceil$.
It do decrease the number of passes, but since this algorithm requires $2k$ tapes, it is not good enough.

Reduce the number of tapes

Try to use 3 tapes to solve 2-way merge problem.

If we simply use the old way, we may stuck here.
So, if we are given two tapes with number of runs $r_2=5$ and $r_3=3$,

rather than put the even pairs and odd pairs into two tapes respectively, we can put the first 3 pairs into an empty tape $T_1$ and do nothing to the remained 2 runs.

With this strategy, we can reduce the number of tapes from $2k$ to $k+1$.
If we review the processure of 2-way merge in reverse order,we can find that the number of runs in the longest tape is:
$$F_0=0,F_1=1,F_2=1,F_3=2,F_4=3\cdots F_N=F_{N-1}+F_{n-2}$$
It is a Fibonacci sequence!!
So we can do our best to assign the number of runs in each tape to the Fibonacci sequence.
It also has a similar property in k-way merge algorithm,that is:
$$F_{N}^k=\begin{cases}0,\quad 0\leq N\leq k-2\\1,\quad N=k-1\\\sum_{i=1}^{k}F_{N-i}^k,\quad N\geq k\end{cases}$$

Handle the buffers for parallel operation

  • Buffer: When we are doing parallel writing and reading, we need to store both input and output data in the buffer.

When the $k$ rises, the number of input buffers will increase,the size of buffer will decrease,the size of block will decrease,which end up with increasing the seeking time.

In general, for a $k$-way merge we need $2k$ input buffers and $2$ output buffers for parallel operations.

Beyond a certain $k$ value, the I\O time would actually increase despite the decrease in the number of passes being made. The optimal value for $k$ clearly depends on disk parameters and the amount of internal memory available for buffers.

Can we get a longer run each time?

Yes,apart from selecting $M$ elements and sort then, we can do as follows:

  • step 1: Select first $M$ elements and sort them.
  • step 2: Write the lowest one to the tape.
  • step 3: Add the next element to the remained $M-1$ elements and sort them.
  • step 4: Write the lowest one of those who are not greater than the previous one to the tape.
  • step 5: Repeat until there is no element left or all the $M$ elements in the buffer are greater than the previous one.Then one tape is built.
  • step 6: Repeat step 1 to step 5 for the remaining elements to build more runs.

It is proved that the average length of a run $L_{avg}=2M$.

How to minimize merging time?

We can use a Huffman Tree to minimize the number of merging times.

For example,we are given 4 runs with length $l_1=2,l_2=4,l_3=5,l_4=15$.The best sequence to merge them is in the following order: