Tries

How to efficiently store and represent strings using a trie data structure.

Jun 01, 2024

Hi Friends,

Welcome to the 74th issue of the Polymathic Engineer newsletter.

This week, we discuss tries, a data type that efficiently represents and stores a set of words. Thanks to their properties, they have many practical applications, such as autocomplete word systems, spell checkers, and word game solvers.

The outline is as follows:

Introduction to tries
Trie nodes
How to implement trie APIs
Comparison with other data structures

Introduction

A trie is a tree data structure whose nodes store the letters of an alphabet. The term "trie" comes from the word retrieval because you can retrieve words by traversing a tree path.

The pronunciation of tries is like "try" to distinguish it from other "tree" structures.

The basic properties of a trie are:

Each path from the root to the leaves forms a word.
All a node's descendants share a common prefix associated with that node.

Each node stores letters as links (or references) to other nodes. There is one link for each possible alphabetic value, so the number of child nodes in a trie depends entirely upon the size of the alphabet.

In the case of the English alphabet, the total number of children for each node is 26.

Since words share common prefixes, it is necessary to understand where a word in the trie ends. A standard solution is to associate a boolean field to each node, specifying whether the node corresponds to the end of a word.

Graphically, a trie can be conveniently represented as a tree where each edge is labeled with a single character.

The first letter of each word is represented by an edge connecting the root to one of its children.

The end of each word in the trie is represented by colored nodes, which are the endpoints of the edges labeled with their last character.

Trie Node

Each tree node contains two kinds of information: a boolean flag and a set of links or references to other nodes.

The flag marks whether a word ends in this node. Initializing the flag to false is convenient since most trie nodes do not represent a word's end.

The links can be commonly implemented using an array with as many elements as the alphabet letters. This solution is usually very fast but has the downside of taking up a lot of memory with empty links.

The alternative is to use a map with characters as the key and the link to the next node as the value. This is also the preferred way in dynamic programming languages like Python.

class TrieNode:
   def __init__(self):
       self.children = {}
       self.isEnd = False

class Trie:
    def __init__(self):
        self.root = TrieNode()

APIs Implementation

A trie provides two basic operations: inserting a new word and searching for a given word.

To insert a new word, you must iterate simultaneously through its characters and the trie nodes. Starting from the root node and the first character of the new word c, you check whether the root contains a valid link corresponding to c.

If it is, you can directly move to the node pointed by the link and consider the next character of the new word as well. Otherwise, you must create a valid link to initiate a new node.

When the iteration through the characters of the given word ends, you also need to set the boolean flag of the current tree node to true to indicate that the word ends there.

def insert(self, word: str) -> None:
    root = self.root
    for c in word:
        if c not in root.children:
            root.children[c] = TrieNode()
        root = root.children[c]
    root.isEnd = True

To search a given word, you proceed similarly by iterating over its characters and moving along the nodes and edges of the trie.

The difference is that you don't create anything. If you try to move along an edge that doesn't exist, you return false.

When the iteration through the characters of the given word ends, you also need to check the boolean flag of the current node to see if an already inserted word ends there.

You can omit this last step if you only want to verify the existence of a given prefix instead of an entire word.

def search(self, word: str) -> bool:
    root = self.root
    for c in word:
        if c not in root.children:
            return False
        root = root.children[c]
    return root.isEnd

Regardless of the number of words stored in the trie, the time complexity for both the operation is always O(W) when W is the length of the input word.

Comparison with other data structures

Other data structures, like balanced trees and hash tables, allow you to store a dictionary of words and search for a word. So what are the advantages of using tries?

First of all, as we mentioned in the introduction, such data structures are inefficient at enumerating words in lexicographical order or working with common prefixes.

Instead, a trie allows you to answer the following questions efficiently:

Are there any words that start with a given prefix?
How many words begin with a given prefix?
What are all the words that start with a given prefix?

Another reason to use a trie instead of a hash table is potential memory advantages.

When several strings have a common prefix, a trie reuses the same nodes and edges for all of them, saving space relative to a hash table that continuously increases its size.

Moreover, a hash table needs to deal with hash collisions, and this could deteriorate its search time complexity. Of course, in the worst case, when all stored strings start with a different letter, there are no memory optimizations.

Food for thoughts

A things that should always be clarified during the preliminary calls with the recruiter or HR is the salary range. If the expectations were different, the whole interview process would be a waste of time.

LLMs, at first glance, do look like the AI from science fiction. You can have conversations with them, and they exhibit knowledge of many things. But the more you understand how they work, the more this illusion falls apart as their flaws become apparent. Link

The Polymathic Engineer

Discussion about this post

Ready for more?