What do you think of search algorithms

Algorithms. Search algorithms

Transcript

1 Algorithms Search algorithms Search in tables The standard case. As described in the introduction, the records to be searched are numbers. For example, an array could look like this: If I now search for the key 56, the search function must return 3 as index. Looking for 44 it could either return 5 or 6. Sometimes it is important to find the first occurrence of a key, in which case we would have to insist on 5. In addition, it is of course possible that a key cannot be found at all. The functions presented signal this by returning the index -1. There are two important algorithms, namely the linear and the binary search: Linear search If you do not know anything about the arrangement of the elements in the array - e.g. B. whether they might have been pre-sorted - you have no choice but to try the elements one by one until you have either found the element you are looking for, or until none are left. In order not to have to check every time whether you are already at the end of the array, we use a little trick: At the end of the array we add an element that corresponds exactly to the value we are looking for - this way we get a value every time found. Of course, you must then check whether the value found was a "real" hit or your own end mark. By the way, this end mark is called "Sentinel", which may bring back memories for some C64 veterans :-) To my embarrassment, I have to admit that this trick made me make an exception to the rule in the first program - it should remain the only one. Because the table variable is declared as follows: Table: Array [1..TableSize + 1] of Byte; The extension is of course used to accommodate the Sentinel. The LinSearch function receives the search key as a parameter and returns the index found: function LinSearch (Key: Byte): LongInt; Table [TableSize + 1]: = Key; Index: = 1; while Table [Index] <> Key do Inc (Index); if Index> TableSize then Index: = -1;

2 LinSearch: = index; As you can see, in the first step the sentinel is set, then the index is initialized to 1 and then increased until the character is found. Finally, it is checked whether the index is outside the valid range, which indicates that the Sentinel had to abort the search. In this case the index is corrected to -1 because, as agreed above, this is the value for "not found". It is obvious that the algorithm always finds the first occurrence of the key. If you don't like the Sentinel (as I said, there might be some C64 freaks), you can leave it out, but then have to expand the While line to check whether the array is already over. So this is linear searching (also called sequential searching). It doesn't have a very good reputation, but that doesn't change the fact that even the cleverest programmer has no choice if he doesn't have detailed information about the table. Binary search To speed up the search, it is advantageous if the elements in the table are sorted. This can be achieved either by one of the sorting methods described, or by inserting new elements in the right place. Both require additional effort, but save time searching. How do you use the order in the table? You just look for the middle element of the table and check whether it is larger or smaller than the element you are looking for. If it is larger or equal, you only need to search in the lower half of the table; if it is smaller, only in the upper half. At some point the size of the search area has shrunk to 1, and then either the element has been found or it does not exist. The following algorithm uses the variables L (inks) and R (real) to define the search area and Middle to mark the middle. If we want to continue searching above the middle, we simply drag L to Middle + 1, if we want to continue searching in the lower part, simply R to Middle: function BinSearch (Key: Byte): LongInt; L, R: LongInt; Middle: LongInt; L: = 1; R: = TableSize + 1; while L

3 Text Search This section is not about numbers, but about texts. The task is to find a shorter text (also called a "pattern") in a longer one, i.e. to determine its position. The text could e.g. For example: the farmer is building the farm If I am now looking for the pattern 'farm', position 20 must be returned; However, if I only search for 'construction', it will be found at position 1. If the pattern you are looking for does not occur at all, this must somehow be signaled as an error. Simple pattern search The first possibility is to implement the search very simply: You start at the first position in the text and see whether the first character matches the first character of the pattern. If so, continue with the second character and so on until either the end of the pattern is reached (i.e. it has been fully recognized) or there is a discrepancy between the pattern and the text. In this case, the same is repeated from position 2 and so on until the pattern is found or the text has been searched to the end without success. function Search (p, t: String): LongInt; Index, Sub Index: = 0; SubIndex: = 0; while (Index <= Length (t) - Length (p)) and (SubIndex

4 1. The character does not appear in the pattern. In this case, the pattern can be shifted to the right beyond this symbol. 2. The character is further to the front of the design (the last occurrence in the design is considered). In this case we can move the pattern to the right so that the character in the pattern is below the correct character in the text. 3. The symbol is further back in the design. In this case we can only move the pattern 1 to the right. Then the pattern is searched again from back to front. This is done until either the pattern has arrived at the end of the text (without being found), or the check has successfully passed through to the beginning of the pattern (i.e. it was found). The question now remains how to efficiently find the last position of a character in the pattern. This is usually done using a table in which the corresponding position is entered for each character and then only looked up later. The entry can be done in a simple pass, from front to back, whereby one occurrence of a character overwrites the previous occurrence. The position -1 stands for "not available". function BMSearch (Pat, Txt: String): LongInt; PosTable: Array [0..255] of LongInt; PatLen: LongInt; TxtLen: LongInt; Pat Position: LongInt; b: byte; if Pat = '' then BMSearch: = 0; exit; PatLen: = Length (Pat); TxtLen: = Length (Txt); for b: = 0 to 255 do PosTable [b]: = -1; for PatIndex: = 1 to PatLen do PosTable [Ord (Pat [PatIndex])]: = PatIndex; Index: = 0; while (PatIndex> 0) and (Index <= TxtLen - PatLen) do PatIndex: = PatLen; while (Pat [PatIndex] = Txt [Index + PatIndex]) and (PatIndex> 0) do Dec (PatIndex); if PatIndex> 0 then Position: = PosTable [Ord (Txt [Index + PatIndex])]; if Position = -1 then Inc (Index, PatIndex) else if Position> PatIndex then Inc (Index, 1) else Inc (Index, PatIndex - Position); if PatIndex = 0 then BMSearch: = Index + 1 else BMSearch: = 0;

5 Sorting algorithms The problem is simple and clearly defined: there is an array with elements that can be sorted according to their key, but whose current arrangement is not known (that of course also means that they could theoretically already be sorted or exactly in the reverse order). The sorting algorithm is supposed to put the elements in the correct order. One simple question, many answers: Bubblesort Bubblesort is one of the simplest sorting algorithms. The idea behind it is as follows: In the first pass, the array is run through from the end to the beginning and in each step the current element is compared with the next. If the rear one is smaller, the two are swapped, so that larger elements remain on the way and the smallest moves further forward. Once at the beginning, the smallest element of the array is at position 1, which is where it should go. In the next run, the second smallest element is searched for and passed through to the front. Logically, this run does not have to go all the way to the beginning, but only to index 2. In the third run, you only need to go to index 3 and so on until the process is completed. One can imagine that the smaller elements are "lighter" and therefore rise like bubbles - hence the name Bubblesort. Without further ado, here is the algorithm: const TableSize = 20; // array upper limit type TableArray: array [1..tablesize] of Byte; procedure BubSort (Table: TableArray); Run: LongInt; Index: LongInt; x: byte; for Run: = 2 to TableSize - 1 do for Index: = TableSize downto Run do if Table [Index]

6 Selection sort is also a very simple algorithm. The principle behind "sorting by selecting" is to go through the array from front to back and find the right element for each location. So first you look at the element at index 1. You then go through the array element by element and remember which is the smallest. This is then exchanged for the first one so that the smallest element is at the very front. In the next pass, one looks at index 2, again searches the array for the smallest element, whereby index 1 is of course left out, and puts the smallest at index 2. In the third pass, the first two elements are left out, and so on. It goes on and on until you have finally filled the penultimate element correctly. For obvious reasons the last element is already correct ... In Pascal it looks like this: procedure SelSort (Table: TableArray); Elem: LongInt; Smallest: byte; SmallInd: LongInt; for Elem: = 1 to TableSize - 1 do SmallInd: = Elem; Smallest: = Table [SmallInd]; for Index: = Elem + 1 to TableSize do if Table [Index] Elem then Table [SmallInd]: = Table [Elem]; Table [Elem]: = Smallest; In the Smallest variable, the algorithm notes the size of the smallest element so far, and its position in SmallInd. Both are first initialized on Elem, i. H. on the position to be filled. Then the rest of the array is searched, and if a smaller element comes up, the two are updated. Two peculiarities are to be considered with the swap process: On the one hand it is only carried out if a new position has actually been found; on the other hand, we do not need a buffer because the value of the new element is still in Smallest. Selection sort is faster than bubble sort, - in general one can assume a factor of 2 - but not particularly well suited for larger arrays. Insertionsort Insertionsort stands for "sorting by inserting". The underlying idea is simple: A part at the beginning of the array, it is assumed, is already sorted (at the beginning this part is of course 0 elements!). Now the first element is taken from the unsorted part and inserted in the correct place in the sorted part. The sorted part naturally grows by one element, but remains sorted, whereas the unsorted part shrinks by one element.

7 Our algorithm starts with the first element and then goes through to the last. For each element he now does the following: From his position he moves to the beginning as long as the elements - we are in the sorted part - are still larger than or equal to the element in question. In doing so, he pushes each element that he crosses one step back, so that he always pushes a gap in front of him. If he then finds a smaller candidate, he does not push the gap further, but puts his element into it. In this way, the algorithm combines the search for the correct position with moving the elements above. procedure InsSortLinear (Table: TableArray); Elem: LongInt; x: byte; for Elem: = 2 to TableSize do x: = Table [Elem]; Index: = elem; while (Index> 1) and (Table [Index - 1]> = x) do Table [Index]: = Table [Index - 1]; Dec (index); Table [Index]: = x; As you can see, the element is initially stored in x. Then move it until either an element is found that is smaller than x, or until you have reached index 1. In the latter case, you don't seem to have a smaller element in the sorted part and place the element in position 1. To find the correct position, this version of insertion sort uses a linear search. The idea is not far of speeding up the process by using a binary search. Moving the "underlying" elements must then be done separately. procedure InsSortBinary (Table: TableArray); Elem: LongInt; L, R: LongInt; Middle: LongInt; x: byte; for Elem: = 2 to TableSize do x: = Table [Elem]; L: = 1; R: = elem; while L x then R: = Middle else if Table [Middle]

8 Unfortunately, Insertionsort is not necessarily the high-flyer in this improved version either. Although it is the fastest of the "slow" methods, it clearly lags behind the faster algorithms. Shell sort For some, shell sort represents the ideal compromise between simplicity and speed, for others it is of no interest because faster processes have long been known. The only sorting algorithm named after its inventor still has a large following. The idea is as follows: The array is first divided into several groups, which are determined by the fact that there is the same distance between their members.Assuming you use the distance 3, then the elements 1, 4, 7, ... belong in a group, as well as 2, 5, 8, ... and 3, 6, 9, ... These groups are now individually sorted, then the distance is reduced and the procedure repeated. This is done until the distance is 1, so that in the last step there is no more subdivision. The question arises, according to which method the groups are sorted. The answer is not unlike Bubblesort, but a bit difficult to recognize. For each step, the variable Gap indicates the distance between the elements. The variable Index then runs from Gap + 1 to the end of the table. A sort run is started for each value of Index, with the index as the upper limit - which group is sorted depends on the index. The variable j is initialized to the next smallest element of the group (i.e. index gap), then it is compared with the next higher element and, if necessary, swapped. Then j goes one step down again (step size gap), and so on, until j <0. If Index has to hold the nth element of a group, all elements (of this group) are sorted below it. When the index has run through to the end, all groups are sorted. The only question left is which rules should be used to reduce the distance. In the example below it is always halved; However, measurements have shown that Shellsort is faster if the distances are always odd (heaven knows why), which is very easy to take care of. The step sizes would then be z. B. 63, 31, 15, 7, 3, 1. The Pascal source code itself is probably easier to understand than all the complicated explanations: procedure ShlSort (Table: TableArray); Gap: LongInt; x: LongInt; j: LongInt; Gap: = TableSize div GapDivisor; while Gap> 0 do for Index: = Gap + 1 to TableSize do j: = Index - Gap; while j> 0 do if Table [j] <= Table [j + Gap] then j: = 0 else x: = Table [j]; Table [j]: = Table [j + Gap];

9 Table [j + Gap]: = x; Dec (j, gap); Gap: = Gap div GapDivisor; Shellsort shows a remarkable speed, so that it could almost be counted among the "fast" algorithms. Unfortunately, it doesn't cope with very large arrays at all - these days you could say it scales poorly (be careful, bastard anglicism from hell!). Quicksort If a method is already called Quicksort, it arouses certain expectations, and you will not be disappointed: Quicksort is the fastest algorithm in most cases. What's behind it? The idea is quite simple: you pick any element of the array - for example the middle one - and divide the array into two groups: one with the elements that are larger than the selected one, and one with the smaller (or the same) elements. These two halves are then handed over to Quicksort in confidence. So it is a recursive function. Well, and at some point the arrays are only 1 element large and are trivially sorted. In order not to have to create new arrays every time, Quicksort works with a left and right border. To do this, of course, all elements that are larger than the one selected have to be moved to the lower part and the others to the upper part. This is what Quicksort actually does: procedure QSort (Left, Right: LongInt); i, j: LongInt; x, w: byte; i: = left; j: = right; x: = Table [(Left + Right) div 2]; repeat while Table [i] x do Dec (j); if i <= j then w: = Table [i]; Table [i]: = Table [j]; Table [j]: = w; Inc (i); Dec (j); until i> j; if Links i then QSort (i, Right); The selected element is first saved in x - in this case the middle. i and j are running indices that are initially positioned on the left and right. Then they march towards each other from both sides until they find an element that does not belong on their side. If you have not already overtaken yourself at this point (in which case it would be correct again), swap the two elements and continue. This repeats itself until the two have reached or overtaken each other. Then QSort is called recursively, once for the lower part, if one

10 such exist at all with more than one element, once for the upper part (ditto). Since the recursion does not take place in the innermost loop, it is relatively harmless, but of course it can be on the stack. Therefore, an iterative implementation is sometimes recommended, where you manage the stack yourself; Unfortunately, this is not particularly trivial. As mentioned at the beginning, Quicksort is usually the fastest search method. In the worst case, however, when the selected x is the largest (or smallest) element of the array every time, Quicksort unfortunately falls to the level of Bubblesort. In most situations this does not matter, but you have to be aware of it in real-time applications. Heapsort Heapsort is also a very fast algorithm, but on average not quite as fast as Quicksort. On the other hand, even in the worst case, it does not suffer from such a sag as this, so that Heapsort is particularly recommended for applications in which maximum times have to be adhered to. The way it works is unfortunately not entirely trivial (I think). You have to imagine that the elements of the array are actually stored in a binary tree (by the way, more information about trees can be found in the section on dynamic data structures). But do not worry, the pointer orgy that is common for binary trees is not coming, because there is a trick with which you can display (balanced) binary trees directly in the array. To do this, let's first imagine that element 1 of the array is the root of the tree. And further applies: Every element of the array, e.g. B. i, is a node of the tree, and its two successors are the elements 2 i and 2 i + 1. The following figure shows how the nodes are in the array: Such a tree is called a "heap" = Heap) if each node is larger than its successors. The rear half of the array - no matter what state it is in - is always a heap because the nodes no longer have any successors. If the back of the array (half or more) is a heap, I can expand it one element forward. So that the result is still a heap, I have to "sink" the element at the new node in the tree (to sift = seven), always swapping it with the larger of its successor nodes - until both successor nodes are smaller are.

11 This process is carried out by the Sift procedure; we will need it more often: procedure Sift (L, R: LongInt); i, j: LongInt; x: byte; i: = L; j: = 2 * L; x: = Table [i]; if (j Table [j]) then Inc (j); while (j <= R) and (Table [j]> x) do Table [i]: = Table [j]; i: = j; j: = j * 2; if (j Table [j]) then Inc (j); Table [i]: = x; As you can see, Sift gets a left border, where it starts, and a right border, where it ends - even if the array might go further (it's still a little unclear what this is good for: patience!). The value of the element to be "donated" is stored in x, i is the current node and j the successor node - either 2 i or, if greater, 2 i + 1. The first part of Heapsort is exactly as described by half starting to expand the heap forward until the whole array is a heap. The second part, the actual sorting, is actually no longer difficult to understand: The first element of the array is the largest element because it is the root of a heap. So it belongs at the end of the array. Let's just put it there by exchanging it with the element there. This is now at position 1. We let it sink in with the help of Sift - of course only up to the penultimate element, we want to leave the last one alone - and then have the maximum again at position 1 (because the nodes of the second level are also the Maxima of their Unterheaps are). We now swap this with the element in the penultimate position, let the new number 1 sink in, etc. etc. When we have occupied the whole array, we are done, and because the array has been filled from below with the largest elements in each case, it is sorted - voilà! The heapsort algorithm in Pascal, the two parts are separated by a blank line: procedure HSort; L, R: LongInt; x: byte; L: = (TableSize div 2) + 1; R: = TableSize; while (L> 1) do Dec (L); Pen (L, R); for Index: = R downto 1 do x: = Table [1]; Table [1]: = Table [Index]; Table [Index]: = x;

12 pen (1, index - 1); Now it is also clear why Sift is also aware of a right limit. One possible improvement is to write the Sift code directly into the heapsort algorithm. This shouldn't be a big problem, just be careful with the variable names. Quite a lot of effort for a sorting algorithm, you could say. On the other hand, once you understand it is not so bad. And the trick of putting binary trees into arrays is also something you can remember. Bucketsort Up until now it sounded as if Quicksort and Heapsort were the fastest search methods, it might seem surprising, but the rarely encountered bucket sort promises to be even faster. There is - of course - a catch: Bucketsort requires additional space. The greater the number of possible keys, the more so. Why this is so becomes immediately clear. The principle is simple: In an additional array called a bucket, there is a "compartment" for every possible key. Then the array table is traversed from front to back, and each element is placed in its compartment. Then you only need to copy the contents of the compartments one after the other back into the array; The subjects should of course be in the correct order. The array is then sorted. Unfortunately, a number of problems arise. B. Do I store multiple items in one field of the bucket array? These elements do not necessarily have to be the same, they just have the same sorting key! There are several solutions to this, but the most obvious one is to use a linked list. On the other hand, you often want to sort strings, but even strings of length 10 have 2610 = possibilities. I would rather not start with standard strings of length 255 ... Sorting LongInts with their 32-bit length should not be much fun either. One way out is to split the key into its individual elements and sort them several times; In the case of strings, the last character is sorted first, then the penultimate character, and so on. Numbers can be sorted according to individual bits. To clarify the algorithm, I want to take a much simpler example: Bytes are sorted, so the range within which they can take values ​​is 0 to 255. A strong simplification is that the numbers only consist of the key can, and two elements with the same key are indistinguishable. Therefore, they do not need to be stored in the compartments - the index of the compartment is already the value, and the compartment simply counts how often the relevant number has already occurred. This is the number of times it is repeated when it is written back to the array. And this is what the procedure looks like: procedure BuckSort; Buck for BuckIndex: = 0 to 255 do Bucket [BuckIndex]: = 0; for Index: = 1 to TableSize do Inc (Bucket [Table [Index]]); Index: = 1;

13 BuckIndex: = 0; while Index <= TableSize do while Bucket [BuckIndex] = 0 do Inc (BuckIndex); Table [Index]: = BuckIndex; Dec (Bucket [BuckIndex]); Inc (index); In this version, Bucketsort is pretty fast. Another application where it shines is e.g. B. Sorting strings by first letter. Overall, however, the speed advantage takes its toll in the form of all sorts of inconveniences that are spared with Quicksort and Heapsort. This is also the reason why Bucketsort is mostly only used in special cases. Code and texts were kindly made available to us by Sebastian Koppehel (. Delphi-Source.de Basics Copyright Martin Strohal and Johannes Tränkle