# How does the hash-based text comparison work

## Hash based files

Transcript

1 hash-based files

2 Hashing Hash methods make it possible to find the address of a data record based on the value of a field Idea: Use of a hash function that maps the value of a search key into a range of container numbers in order to find the page with the data record In the ideal case: the hash function directly calculates the address of the data set h: {s 1, S 2,, S n} A, h (si) = address of the ith data set. Such functions are difficult to find: all possible search key values ​​must be known from the start It is impossible to maintain bijectivity for large files

3 hashing containers (buckets) Solution: collisions are allowed h (si) = h (sj), ij, h hash function key value hash function h container number Applies to two keys S 1 and S 2, that h (s 1) = h ( s 2), S 1 and S 2 are called synonymously. Formally, a hash function is a mapping h: SB, where S is a key set and B is a numbering of the n containers.Normally the number of possible elements in the key set is much larger than the number the container (S> B) the hash function is not injective, but should distribute the elements of S evenly over B.

4 Problems that occur with hashing Distribution problem After we have selected the hashing algorithm, we have no control over the distribution of the data in the disk space clustering problem if the records are not distributed evenly (too many records in one container and very few in others) overflow problem if the containers are not large enough, overflow can occur

5 Hash function Requirements for a good hash function: Fast evaluation Minimizes the number of collisions (distributes the data records evenly in the container) Let us assume that we want to distribute data records in 41 containers The probability: to distribute the 1st data record in an empty container = 41 / 41 distribute the 2nd data record in an empty container = 40/41 distribute the 3rd data record in an empty container = 39/41 ... distribute the first 8 data records in different containers = (41/41) * (40 / 41) * (39/41) * (38/41) * * (34/41) = <50%

6 Choice of a hash function Methods that are used to define a hash function: division method, center square method, multiplicative method, etc. Typical hash functions take into account the bit representation of the search key in order to calculate the hash value, e.g. For a string search key, you can add the binary representations of all characters and choose the sum as a parameter for the hash function

7 Choosing a hash function Division method h (k) = k mod N, where N is the number of containers If you choose N = 2 d, the last d bits of k are ultimately considered as a hash value is not close to a power of two) to ensure a good spread (affects all bits) Middle-square method Calculate the square of the search key value and choose a few digits from the center of the square

8 Choosing a hash function Multiplicative method 1. The key value k is multiplied by a number A 2. The whole part of the result from step 1 is cut off the result is mapped into the interval [0,1] 3. The result from step 2 is displayed with multiplied by the number of containers m and rounded down. Formally, the following applies: h (k) = m (k A mod 1) = m (k A k A) A good choice: A = (5 1) / 2 = or A = ( 3 5) / 2 =

9 Hash function - example search key value Toyota We take the first two characters To and calculate the alphabetical position Hash functions: Division method with N = mod 97 = 75 Middle square method: = take two middle digits Multiplicative method: 99 (mod 1) = 32 Why don't we use it directly 2015 as hash value? 4 digits possible values ​​the table with the hash values ​​would be pretty empty. In the example above, if we need 100 hash values, an overflow can occur

10 Strategies for collision handling / overflow handling Using open addressing in the event of a collision according to a fixed rule, looking for alternative free space in the hash table Using linked lists, each container contains a pointer to the overflow list Using a second hash function (double hashing), the second hash function is applied to the result of the first for a new one To get an address Pointers instead of storing data records are stored in the hash address: All pointers to synonymous data records Containers of addresses Pointers to the first data record (which then contains a pointer to the next, etc.) Linked lists of addresses

11 Static hashing A container consists of a primary page and possibly one or more overflow pages The number of primary pages is fixed from the start and the pages are stored sequentially on the hard disk (and never released). Given N containers, ranging from 0 to N-1 are numbered, then k is assigned to the container h (k) mod N

12 Static hashing with independent lists All synonymous data records are stored in a linked list. The hash file contains a list of N data records; Each data record is the head of a list of synonyms. The order of the data records in the hash file can be as follows: The order of the insertions Ascending order of the search key values ​​Descending order of the search frequency

13 Static hashing with interleaved lists No overflow pages Insert a data record with key value k: If the slot at address h (k) is free, then save the data record. If the slot is not free, then: Search from bottom to top for the first free slot and save the data set Add the slot to the end of the list that contains the slot h (k) Example

14 Static hashing with interleaved lists Deleting a data record with key value k: If the slot at address h (k) is free. Error message If the slot is not free, then: 1. Find and delete the data record (using the pointer) 2. Search, with the help of the pointers, a data record r with h (kr) = h (k) If there is such a data record, then move it to the current slot 3. Repeat step 2 for the new empty slot or copy the pointer of the empty one Slots in the previous slot in the list (if there is one)

15 Static hashing with interleaved lists Example: delete the data record with key value 23

16 Static hashing with open addressing The hash file only contains data entries (no pointers to other pages) A ​​free entry is searched for in the table for colliding keys. Linear probing: h (k), h (k) +1, h (k) +2 ,, N-1, 0 ,, h (k) -1

17 Static hashing with open addressing Insertion of a data record with key value k: If the slot at address h (k) is free, then save the data record.If the slot is not free, then look for a free slot at the addresses: h (k) +1, h (k) +2 ,, N-1, 0 ,, h (k) -1 Good for 75% occupancy Example:

18 Static hashing with open addressing Deletion of a data record with key value k: Problem: If you delete e.g. h 0 from the sequence h 0, h 1, h 2, h 2 can no longer be found Solutions: A. Replace the entry to be deleted with a guard (special code character). All operations are then adjusted: the search looks over the guard as if there was a valid value there. With an insert operation, the guard can be replaced by a new entry.

19 Static hashing with open addressing B. Delete the record and move the other records. Let i, j and p be addresses such that: i is the address of the data set we want to delete. There are no free slots between i and j h (kj) = h (kp) the data set at address j should be at address p There are the following cases: i j 0 pjpip N-1 i

20 Static hashing - summary Hash function distributes the data records over N containers (number is fixed) Static hashing is not efficient for a real database.A hash table that has been created cannot be expanded efficiently.If many insert operations are expected, there are two options: From the start A lot of space is reserved for the table, a lot of free space is free, as the primary pages are never released.Overflow chains that grow longer and longer can only be eliminated by changing the hash function and time-consuming reorganization of the table Hashing

21 Extensible hashing problem: The containers (primary pages) are full Solution: The file is reorganized and the number of containers doubled. Reading and writing of all pages is expensive, but idea: Use a directory of containers A new data record would have to be entered in an already full container so it is split up, no changes are necessary for the other containers and no overflow page is necessary. The container directory is much smaller than the entire file, duplication is much cheaper

22 Expandable hashing It is important how the hash function is adapted.The value h (x) is represented in binary and only one prefix of this binary representation is taken into account h (x) = dp, where dp is the binary representation, divided into two d indicates the position of the container in Directory an (p is currently not used) The size of d is called the global depth t. The local depth t of a container indicates how many bits of the key are actually used for this container

23 Extensible hashing If a container is full and needs to be split up, it is split up on the basis of a further bit of the previously unused part p If the global depth is not sufficient to be able to enter the reference to the new container, the directory must be doubled The directory is doubled if, after a container has been divided, the local depth is greater than the global depth

24 Extensible hashing - example of data pages To find the container for x, consider the last t bits from h (x) t = t = 2 h (5) = 5 = 101b in the container refers from 01 Global depth Local depth Directory container A Tray B Tray C Tray D

25 Extensible hashing - example Insert k: h (k) = 20 = 10100b Container 00 Double directory Local depth Global depth Container A Local depth Global depth Container A Container B Container B Container C Container C Directory Container D Container D Container A2 The division of Tray A Directory of Tray A2 The division of Tray A

26 Extensible hashing - example When inserting h (k) = 20 = 10100b: The last 2 bits 00 tell us that k belongs in container A or A2 The last 3 bits tell us in which of the two containers it belongs Global depth t the Number of bits needed to locate the container before insertion t = 2, after insertion t = 3 Local depth t of a container the number of bits actually used in the example t = 2 or t = 3 Since after insertion t > t is duplication of directory

27 Expandable hashing If the directory fits in the main memory, then an equality query can be answered with hard disk access, since there are no overflow pages.Otherwise, the respective directory page has to be loaded from memory, and a total of two page accesses are then required.Many data records with the same hash value can cause problems cause If data is deleted, it is possible to merge containers again or even to halve the directory. Compared to static hashing, it saves storage space (adapts dynamically to the storage space requirement)

28 Dynamic hashing The idea is the same as for extensible hashing, but a different type of directory structure is used. Directory structure in extensible hashing an array with 2 d containers, where d is the global depth. Directory structure in dynamic hashing directory tree

29 Dynamic hashing

30 Linear hashing idea: allows a hash file to grow and shrink dynamically without using a directory structure. This scheme uses a family of hash functions h 0, h 1, the range of values ​​of a function h i + 1 is twice as large as the range of the Predecessor function hi if hi maps an index entry to one of N containers, then h i + 1 forms the entry to one of 2N containers from overflow pages are used Allows a certain degree of flexibility when deciding when a container is divided The transition from a hash function hi to h i + 1 corresponds to doubling the directory for expandable hashing

31 Linear hashing - example size of the container: 4 Level: 0 (number of doubling cycles) Next container to be doubled: 0 Insert the following values: 37 = h (100000) 44 (101100) 36 (100100) 01 9 (1001) 25 ( 11001) 5 (0101) (1110) 18 (10010) 10 (1010) 30 (11110) (11111) 35 (100011) 7 (0111) 11 (1011)

32 Linear hashing - example size of containers: 4 Level: 0 Next container to be doubled: 0 Insert the following values: 37 =, 43 = h (100000) 44 (101100) 36 (100100) 01 9 (1001) 25 (11001) 5 (0101) 37 (100101) (1110) 18 (10010) 10 (1010) 30 (11110) (11111) 35 (100011) 7 (0111) 11 (1011) 43 (101011)

33 Linear hashing - example size of the container: 4 Level: 0 Next container to be doubled: 1 Insert the following values: 29 = h 1 h (100000) 01 9 (1001) 25 (11001) 5 (0101) 37 (100101) ( 1110) 18 (10010) 10 (1010) 30 (11110) (11111) 35 (100011) 7 (0111) 11 (1011) 43 (101011) (101100) 36 (100100)

34 Linear hashing - example size of the container: 4 Level: 0 Next container to be doubled: 2 -> etc. Until there is no more overflow h 1 h (100000) (1001) 25 (11001) (1110) 18 (10010) 10 (1010) 30 (11110) (11111) 35 (100011) 7 (0111) 11 (1011) 43 (101011) (101100) 36 (100100) (0101) 37 (100101) 29 (11101)

35 Hash-based indexes Advantages: Unbeatable when it comes to equality queries SELECT * FROM R WHERE A = k Faster access to data if you already know certain information (search key value) Other query operations that perform a lot of equality checks benefit from hash indexes

36 Hash-based indexes Disadvantages: There can only be one hash index on a search key (you have to select a hash method) The sequential order of the data records in the storage space has no meaning There can be blocks of empty slots in a file Area inquiries No support for inquiries where you know the value of a field other than the search key. Not recommended if the search key values ​​change frequently

37 Hash-based indexes in SQL Server In SQL Server, hash-based indexes can only be created for memory-optimized tables (see) SQL Server has a hash function that is used for all hash indexes. The hash function is deterministic: if the input key value is the same, it consistently outputs the same bucket slot. The hash function is balanced: the distribution of the index key values ​​to hash buckets / containers corresponds to a Poisson distribution (no uniform distribution). Strategy for collision handling using linked lists. Each container contains pointers to overflow lists

38 Hash-based indexes in SQL Server - example CREATE TABLE SupportIncidentRating_Hash (SupportIncidentRatingId int not null identity (1,1) PRIMARY KEY NONCLUSTERED, RatingLevel int not null, SupportEngineerName nvarchar (16) not null, INDEX ix_hash_supportengineName HUCKerET_COUNT (BUCKerET_COUNT) (SupportEngine =)) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_ONLY);

39 Hash-Based Indexes in SQL Server Too few bins has the following disadvantages: More hash conflicts of unique key values. Each unique value is forced to use the same container with a different unique value. The average chain length per container increases. The longer the container chain, the slower the search for equality in the index. Too many bins has the following disadvantages: If the number of bins is too high, there may be more empty bins. Empty bins affect the performance of full index scans. If these run regularly, consider a number of containers that approximates the number of unique index key values. Empty containers occupy memory, but each container only occupies 8 bytes.

40 Hash-Based Indexes in SQL Server Monitor statistics for chains and empty buckets: SELECT * FROM sys.dm_db_xtp_hash_index_stats; Empty buckets: 33% is a good target value. Chains in buckets: chain length 1 is ideal, up to 10 is still okay

41 Hash-Based Indexes in SQL Server The performance of a hash index is: Excellent if the WHERE clause gives an exact value for each column in the hash index key. Bad if the WHERE clause is looking for a range of values ​​in the index key.Bad if the WHERE clause gives a specific value for the first column of a two-column hash index key, but no value is given for the second column of the key. See for several examples.