Spatial
Calcite is aiming to implement OpenGIS Simple Features Implementation Specification for SQL, version 1.2.1, a standard implemented by spatial databases such as PostGIS and H2GIS.
We also aim to add optimizer support for spatial indexes and other forms of query optimization.
Introduction
A spatial database is a database that is optimized for storing and query data that represents objects defined in a geometric space.
Calcite’s support for spatial data includes:
- A GEOMETRY data type and
sub-types including
POINT
,LINESTRING
andPOLYGON
-
Spatial functions (prefixed
ST_
; we have implemented about 35 of the 150 in the OpenGIS specification)
and will at some point also include query rewrites to use spatial indexes.
Enabling spatial support
Though the GEOMETRY
data type is built-in, the functions are not enabled by
default. You need to add fun=spatial
to the JDBC connect string to enable
the functions. For example, sqlline
:
$ ./sqlline
> !connect jdbc:calcite:fun=spatial "sa" ""
SELECT ST_PointFromText('POINT(-71.064544 42.28787)');
+-------------------------------+
| EXPR$0 |
+-------------------------------+
| {"x":-71.064544,"y":42.28787} |
+-------------------------------+
1 row selected (0.323 seconds)
Query rewrites
One class of rewrites uses
Hilbert space-filling curves.
Suppose that a table
has columns x
and y
denoting the position of a point and also a column h
denoting the distance of that point along a curve. Then a predicate involving
distance of (x, y) from a fixed point can be translated into a predicate
involving ranges of h.
Suppose we have a table with the locations of restaurants:
CREATE TABLE Restaurants (
INT id NOT NULL PRIMARY KEY,
VARCHAR(30) name,
VARCHAR(20) cuisine,
INT x NOT NULL,
INT y NOT NULL,
INT h NOT NULL DERIVED (ST_Hilbert(x, y)))
SORT KEY (h);
The optimizer requires that h
is the position on the Hilbert curve of
point (x
, y
), and also requires that the table is sorted on h
.
The DERIVED
and SORT KEY
clauses in the DDL syntax are invented for the
purposes of this example, but a clustered table with a CHECK
constraint
would work just as well.
The query
SELECT *
FROM Restaurants
WHERE ST_DWithin(ST_Point(x, y), ST_Point(10.0, 20.0), 6)
can be rewritten to
SELECT *
FROM Restaurants
WHERE (h BETWEEN 36496 AND 36520
OR h BETWEEN 36456 AND 36464
OR h BETWEEN 33252 AND 33254
OR h BETWEEN 33236 AND 33244
OR h BETWEEN 33164 AND 33176
OR h BETWEEN 33092 AND 33100
OR h BETWEEN 33055 AND 33080
OR h BETWEEN 33050 AND 33053
OR h BETWEEN 33033 AND 33035)
AND ST_DWithin(ST_Point(x, y), ST_Point(10.0, 20.0), 6)
The rewritten query contains a collection of ranges on h
followed by the
original ST_DWithin
predicate. The range predicates are evaluated first and
are very fast because the table is sorted on h
.
Here is the full set of transformations:
Description | Expression |
---|---|
Test whether a constant rectangle (X, X2, Y, Y2) contains a point (a, b) Rewrite to use Hilbert index |
ST_Contains(ST_Rectangle(X, X2, Y, Y2), ST_Point(a, b))) h BETWEEN C1 AND C2 OR … OR h BETWEEN C2k AND C2k+1 |
Test whether a constant geometry G contains a point (a, b) Rewrite to use bounding box of constant geometry, which is also constant, then rewrite to Hilbert range(s) as above |
ST_Contains(ST_Envelope(G), ST_Point(a, b)) ST_Contains(ST_Rectangle(X, X2, Y, Y2), ST_Point(a, b))) |
Test whether a point (a, b) is within a buffer around a constant point (X, Y) Special case of previous, because buffer is a constant geometry |
ST_Contains(ST_Buffer(ST_Point(a, b), D), ST_Point(X, Y)) |
Test whether a point (a, b) is within a constant distance D of a constant point (X, Y) First, convert to buffer, then use previous rewrite for constant geometry |
ST_DWithin(ST_Point(a, b), ST_Point(X, Y), D)) ST_Contains(ST_Buffer(ST_Point(X, Y), D), ST_Point(a, b)) |
Test whether a constant point (X, Y) is within a constant distance D of a point (a, b) Reverse arguments of call to ST_DWithin , then use previous rewrite |
ST_DWithin(ST_Point(X, Y), ST_Point(a, b), D)) ST_Contains(ST_Buffer(ST_Point(X, Y), D), ST_Point(a, b)) |
In the above, a
and b
are variables, X
, X2
, Y
, Y2
, D
and G
are
constants.
Many rewrites are inexact: there are some points where the predicate would return false but the rewritten predicate returns true. For example, a rewrite might convert a test whether a point is in a circle to a test for whether the point is in the circle’s bounding square. These rewrites are worth performing because they are much quicker to apply, and often allow range scans on the Hilbert index. But for safety, Calcite applies the original predicate, to remove false positives.
Acknowledgements
Calcite’s OpenGIS implementation uses the JTS Topology Suite. Thanks for the help we received from their community.
While developing this feature, we made extensive use of the PostGIS documentation and tests, and the H2GIS documentation, and consulted both as reference implementations when the specification wasn’t clear. Thank you to these awesome projects.