
Here is a video showing how the leaf blocks are visited to find all 2D (latitude/longitude) points inside the London, UK polygon based on the PlanetOSM data: In the 1D case, the query shape is simply a numeric range whereas in the 2D and 3D cases, it is a geo-spatial shape (circle, ring, rectangle, polygon, cube, etc.). There are k-d tree variants that can support removing values, and rebalancing, but Lucene does not need these operations because of its write-once per-segment design.Īt search time, the same recursion takes place, testing at each level whether the requested query shape intersects the left or right sub-tree of each dimensional split, and recursing if so.
#Improved overall apache lucene searching performance full
In the 1D case, this is simply a full sort of all values, divided into adjacent leaf blocks. However, unlike an ordinary k-d tree, a block k-d tree stops recursing once there are fewer than a pre-specified (1024 in our case, by default) number of points in the cell.Īt that point, all points within that cell are written into one leaf block on disk and the starting file-pointer for that block is saved into an in-heap binary tree structure. At index time, they are built by recursively partitioning the full space of N-dimensional points to be indexed into smaller and smaller rectangular cells, splitting equally along the widest ranging dimension at each step of the recursion.

This means you will finally be able to use Lucene to efficiently index and range-filter anything that can be encoded as fixed-length, ordered byte, such as IPv6 InetAddress, BigInteger, BigDecimal, etc., along with 2D and 3D (and higher!) geo-spatial indices, and times-series values.īlock k-d trees are a simple yet powerful data structure. The k-d tree variant we implemented is the block k-d tree which is specifically designed for efficient IO, such that most of the data structure resides in on-disk blocks, with a small in-heap binary tree index structure to locate the blocks at search time. This feature replaces the now deprecated numeric fields and numeric range query since it has better overall performance and is more general, allowing up to 8 dimensions (versus 1) and up to 16 bytes (versus the 8 byte limit today) per dimension.


As of this writing, Elasticsearch has not yet exposed points, but I expect that will change soon. Coming in Lucene's next major release (6.0) is a new feature called dimensional points, using the k-d tree geo-spatial data structure to offer fast single- and multi-dimensional numeric range and geo-spatial point-in-shape filtering.
