Dictionary

Python Dictionary 的好用大家都知道,反過來,我覺得用 Python 實作一個 Dictionary 是個複習資料結構的好方法。

瞄一下

瞄一下cpython怎麼寫,這邊有詳細解釋。

Hash table 優點

  • O(1) insertion
  • O(1) get
  • O(1) delete

但是要在 laod factor 在一定範圍內才能給出時間複雜度 O(1)的但書。

Load Factor & dynamic resizing

https://en.wikipedia.org/wiki/Hash_table#Resizing_by_copying_all_entries

$$Load Factor = n/k $$

$$ k : bucket 數目,也叫 table size$$

$$ n : 資料筆數,k 通常略大於 n,留下(k-n)筆空白欄位。$$

實務上控制 Load factor 介於 0.3 至 0.7 ,時間 O(1)的同時 k 還小。

Dictionary Node

class DictionaryNode:
    """
    Node of a DictionaryLinkedList.
    """

    def __init__(self, key, value, me_hash):
        self.key = key
        self.value = value
        self.me_hash = me_hash
        self.collided = False

    def __repr__(self):
        return self.key.__repr__() + ": " + self.value.__repr__()

me_hash

key 的 hash 值,cpython get 的時候比對會用到

(ep->me_key == key ||
(ep->me_hash == hash && unicode_eq(ep->me_key, key)))

詳細原因可能要對編碼比較熟才看得懂,我存著 resize 就不需要重算一次。

collided

判斷是否 collide 過的 flag,若 delete 時 collided 為 True,告訴大家這 node 沒有值但需要往後尋找,若沒有 collide 過則代表沒有 collide 過,返回 indexError。

Dictionary Class

class Dictionary:
    """
    Dictionary are implemented by HashTable
    Using open addressing method.
    Hash using siphash. I can not find siphash
    """

    def __init__(self):
        self.used_entry = 0
        self.buckets = [None for x in range(0, 8)]

初始大小為 8 ,python 沒有辦法 override {} 的讀取, KVP 只能慢慢塞。

另外存取 used_entry 以免 len()需要跑 O(n)的 loop,insert/del 要記得 maintain。

Hash

仿照 cpython 用 siphash,據作者所述是一個快速的 hash ,用過 md5 也是可以的。

pip install siphash
from siphash import siphash24

Dictionary 的 me_hash()

def me_hash(self, key):
    """
    hash of key
    """
    return siphash24(b"0123456789ABCDEF", (str(key).encode("utf-8"))).hash()

Dynamic Resizing

bitmask (&) 運算

當 k (table size) 是 2 的 m 次方,則 hash(n) mod k 可以用 hash(n) & bitmask 取代,其中 bitmask 為 m 個 1 組成。

bitmask (&) in Python terminal

>>> def hash(key): return siphash24(b'0123456789ABCDEF',(str(key).encode('utf-8'))).hash()
...
>>> n = 'key_string'
>>>
>>> # m = 3, k = 8
...
>>> hash(n) & 0b111 == hash(n) & 7
True
>>> hash(n) & 0b111 == hash(n) % 8
True
>>>
>>> # m = 4, k = 16
...
>>> hash(n) & 0b1111 == hash(n) & 15
True
>>> hash(n) & 0b1111 == hash(n) % 16
True
>>>

Resizing 程式碼

def resize(self):
    """
    Should change the size if load_factor > 2/3 or load_factor < 2/3
    Should do nothing if load_factor between 1/3 and 2/3
    """

    def proper_size(n, k):
        pro_size = 8 if n <= 2 else 2 ** (int(n * 1.5)).bit_length()
        return pro_size

    used = self.used_entry
    old_size = len(self.buckets)
    new_size = proper_size(used, old_size)
    if old_size == new_size:
        return
    new_buckets = [None for x in range(0, new_size)]
    for i in range(old_size):
        if self.buckets[i] and (self.buckets[i].key is not None):
            key_me_hash = self.buckets[i].me_hash
            # & for bitwise operation (bitmask)
            entry = key_me_hash & (new_size - 1)
            while new_buckets[entry]:
                new_buckets[entry].collided = True
                entry = entry + 1 if entry < (new_size - 2) else 0
            new_buckets[entry] = self.buckets[i]
            new_buckets[entry].collided = False
    self.buckets = new_buckets

Magic Methods

實作 python 內建的 Magic Methods

# __setitem__
a[b] = c
# __getitem__
a[b]
# __len__
len(a)
# __delitem__
del a[b]

Open Addressing

這裡用的是最簡單的 open addressing,entry 被佔據就找下一個,比較簡單,但容易有連續一整攤都被 occupied 的情形發生,如果有興趣可以實作其他的 open addressing 算法。

Magic Methods 程式碼

    def __repr__(self):
        s = ''
        for i in range(len(self.buckets)):
            if self.buckets[i]:
                s = s + self.buckets[i].__repr__() + ', '
        return {} if not s else "{{{0}}}".format(s[:-3])

    def __getitem__(self, key):
        key_me_hash = self.me_hash(key)
        l = len(self.buckets)
        entry = key_me_hash & (l-1)
        while self.buckets[entry]:
            if self.buckets[entry].key == key:
                return self.buckets[entry].value
            if not self.buckets[entry].collided:
                raise KeyError(key)
            entry = entry+1 if entry < (l-2) else 0
        raise KeyError(key)

    def __setitem__(self, key, value):
        self.used_entry += 1
        self.resize()

        l = len(self.buckets)
        key_me_hash = self.me_hash(key)
        entry = key_me_hash & (l-1)
        while self.buckets[entry]:
            self.buckets[entry].collided = True
            entry = entry+1 if entry < (l-2) else 0
        self.buckets[entry] = DictionaryNode(key=key,
                                             value=value,
                                             me_hash=key_me_hash)

    def __delitem__(self, key):
        key_me_hash = self.me_hash(key)
        l = len(self.buckets)
        entry = key_me_hash & (l-1)
        while self.buckets[entry]:
            if self.buckets[entry].key == key:
                self.buckets[entry].key = None
                self.buckets[entry].value = None
                self.buckets[entry].me_hash = None
                return
            if not self.buckets[entry].collided:
                raise KeyError(key)
            entry = entry+1 if entry < (l-2) else 0
        raise KeyError(key)

    def __len__(self):
        counter = 0
        for i in range(len(self.buckets)):
            if self.buckets[i]:
                counter += 1
        return counter

完整程式碼

Github

結論

Python 自己內建的資料結構,很多都是直接用 c 寫的,速度上快很多,如果沒必要,還是直接用就好了…

References

$$ $$