Geospatial indexes
How to efficiently store and retrieve spatial data: Quadtrees data structures and Geohash encoding.
Hi Friends,
Welcome to the 59th issue of the Polymathic Engineer newsletter.
This week, we discuss specific data structures databases use to retrieve data efficiently based on their geographic location.
Such data structures are called geospatial indexes and are commonly used in applications that manage 2D spaces, like Google Maps or other mapping services.
The outline will be as follows:
what are geospatial indexes
quadtrees
geohash
posts that made me think
Geospatial Indexes
Indexing is a technique used extensively by databases to optimize data retrieval and searching. The benefit of using an index is that it can locate a specific piece of data without scanning through all the existing ones.
The problem is that the traditional indexing approach is not working well with data recorded with its geographical location. Such data is called geospatial data, and you can think of it as an ordinary database record with additional columns specifying the latitude and longitude of the place where this data is associated.
Since geospatial data has two dimensions, it’s tricky to implement a query to retrieve all the records close to a specific place efficiently. Indeed, building indexes on those columns would not significantly improve the performance of the query because the number of records returned for each dimension could still be huge.
Geospatial indexes are data structures that address this problem, enabling you to perform spatial queries efficiently, such as finding all records within a certain distance from a point or falling within a specific geographic area.
There are two main categories of geospatial indexes: tree-based and hash-based. We will check a specific example for each category in the following sections.
Quadtrees
A quadtree is a tree-based data structure that efficiently manages 2D spaces like maps. For example, Google Maps includes it as part of its SDK library to locate landmarks and points of interest in a given area.
Each node of a quadtree divides the 2D space into four quadrants. The root node represents the entire map. Then, the division in nodes continues recursively until each quadrant meets specific criteria.
A standard stopping criterion is that a quadrant contains a certain number of points of interest.
Finding points of interest becomes very efficient with quadtrees:
1. Build the quadtree in memory
2. Start searching from the root node, moving down to the relevant leaf node
3. Include the points of interest of the leaf node in the result
4. Expand the search to neighboring quadrants if the points are not enough
The footprint of each node is small. Nodes need only to store the pointers to the 4 children plus the top-left and bottom-right coordinates of the quadrant. This means that a quadtree can fit in the memory of a single server.
However, constructing a quadtree with a large dataset is an expensive operation that can take up to several minutes.
So, even if a quadtree would fit in the memory of a single server, it's a good idea to have different servers handle updates without downtime. Having different servers with copies of the quadtree allows you to update and restart them one at a time.
Some advantages of using quadtrees are:
Adaptive Resolution. Quadtrees divide the space into smaller regions based on the data distribution. This means that areas with higher data densities can have finer resolution without affecting sparser areas. This leads to more efficient use of space and potentially better query performance in those areas.
Range Queries. Quadtrees are particularly well-suited for range queries and nearest-neighbor searches because they can quickly eliminate large areas that do not intersect with the query region.
The main drawback of quadtrees is that implementing them can be quite complex due to the need to manage a hierarchical tree structure and potentially balance it for optimal performance.
Geohash
Geohash is a convenient way of expressing a location using a short alphanumeric string, with greater precision obtained with longer strings. The main idea is to encode 2D geographical locations into 1D strings of letters and digits.
The algorithm works by splitting the map into equal-sized units called cells. Each cell has a specific hash assigned to it and is recursively divided into smaller and smaller subcells.
At each recursion step, the hash assigned to each subcell is appended to the hash assigned to the parent cell containing it. The process stops when the size of a subcell reaches the precision desired.
A property of the Geohash algorithm is that nearby locations generally have similar prefixes. However, the other way around is not always true. Two places located at the boundary of two cells can be very close but have no shared prefix.
Some advantages of using geohash are:
Simplicity and Uniformity. Geohash encoding is straightforward, resulting in a single-string representation for each spatial data.
Compatibility with Existing Systems. Since geohashes are represented as strings, they can be easily stored and indexed using standard database and text processing tools.
Efficient updates. Updating the data by adding or removing locations is easy since it only requires adding or removing indexed records.
The main drawback of geohash is that the precision of each cell is static and cannot be dynamically adjusted.
Posts that made me think
People look for entertainment more than education. They prefer to watch someone on YouTube living their dream than learn how to achieve it.
I always suggest to junior or aspirant developers or people to spend most of their time learning programming fundamentals. Once you learn to program well, you'll have access to every programming language invented.
I always reserve time to upskill. But that would be less effective without the right mindset. Try to learn as much as you can from every task you work on. For example, understanding why things are implemented in a certain way and thinking about how to improve them.
You can't use good skills to cover up bad work habits. If you lose your dedication, you'll lose momentum. And people who would pay you for your skills will no longer trust you.