External Sort

Only available on StudyMode
  • Topic: Harshad number, Merge sort, Sorting algorithm
  • Pages : 16 (4433 words )
  • Download(s) : 501
  • Published : September 17, 2011
Open Document
Text Preview
Sorting

CS 102
File Structures & File Organizations

Sorting – arranging the items in a list in ascending or descending order by a key value. Applicable for all file organizations, not just sequential Why sort ? to make a report, to merge files in queries, to merge files in master file maintenance, to make searches easier, to prioritize, etc.

Chapter 05

External Sorting Algorithms

Internal vs External Sorts
Internal Sort – sorting items entirely in main memory ICS 2, ICS 3, CS 101 External Sort – sorting files in secondary storage using main memory CS 102 Why external sort ? Some files may be too large to fit in main memory

Some Terminologies
A Pass – an iteration that goes through the items (or records) of a list (or file) once to include reading it from file, processing it in main memory and writing it to file. A Run – a grouping of some items of a list. Usually a run starts as a block of records but eventually increases in size. Size of a Run – the number of items in a run. Usually no less than the blocking factor. A Merge – combining lists into one

The Algorithms
External Sort Algorithms 2-way Sort Merge Balanced 2-way Sort Merge Balanced k-way Sort Merge Polyphase Sort Merge Overview :

2-Way Sort Merge
A simple 2-way Sort Merge repeatedly merges 2 smaller sorted components of a file into a sorted bigger component of the file. The algorithm Phase 1 : The Sort Phase Phase 2 : The Merge Phase

The Sort Phase
Phase 1 : The Sort Phase Divide the records of a file into several runs, internal sort the records in a run, and distribute the runs “evenly” to two external files file_1 and file_2

The Merge Phase
Phase 2 : The Merge Phase For each pair of runs, one from file_1 and another from file_2, merge the pair resulting in a longer run. Store the new resulting run in a third external file file_3

Redistribute the runs evenly in file_3 to file_1 and file_2 Repeat Phase 2 until all records are in one long run.

Tips for Efficiency
As much sorting in main memory must be performed using internal sort because file accesses are slower than main memory accesses. The size of a run must be as large as available space in main memory, limited by other data that must also be in main memory. Each file must be on a separate device (such as tapes or disks) to allow easy access during the merge phase. The original file and file_3 may be assigned to the same device. The output will be in file_3.

Algorithm Simulation (1)
File : 50 110 95 10 100 36 153 40 120 60 70 130 22 140 80 File Size = 15 records Size of Run = 3 initially (Usually a large number ≥ blocking factor) Number of Runs = 5 initially Sort Phase : Pass 1: Group records into size of run 50 110 95 - 10 100 36 - 153 40 120 - 60 70 130 - 22 140 80 3 records are in main memory at a time Do an internal sort and distribute File 1: 50 95 110 – 40 120 153 – 22 80 140 File 2: 10 36 100 – 60 70 130 File 3: empty

Algorithm Simulation (2)
File 1: 50 95 110 – 40 120 153 – 22 80 140 File 2: 10 36 100 – 60 70 130 File 3: empty Merge Phase : Pass 2: Merge : 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 36 50 95 100 110 – 40 60 70 120 130 153 – 22 80 140 Pass 3: Distribute: 3 records are in main memory at a time File 1: 10 36 50 95 100 110 – 22 80 140 File 2: 40 60 70 120 130 153 File 3: empty

Algorithm Simulation (3)
File 1: 10 36 50 95 100 110 – 22 80 140 File 2: 40 60 70 120 130 153 File 3: empty Pass 4: Merge : 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 36 40 50 60 70 95 100 110 120 130 153 – 22 80 140 Pass 5: Distribute: 3 records are in main memory at a time File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3: empty

Algorithm Simulation (4)
File 1: 10 36 40 50 60 70 95 100 110 120 130 153 File 2: 22 80 140 File 3: empty Pass 6: 2 sets of 3 records are in main memory at a time File 1: empty File 2: empty File 3: 10 22 36 40 50 60...
tracking img