Motivation
Happy new year everyone! Took a little break and back at it. The goal of this post is to introduce the extension of Postgres called PostGIS for those who interested in knowing more geospatial data. We will be covering the following topics:
crush guide on explaining spatial data for non-GIS folks like me
what type of field is used to store spatial data in a database settings?
What's the data in the GIS domain.
There are two types of data in GIS at the top level, one is vector data and the other is raster data.
Raster data is a fancy way of saying image data (stored as array), and vector data is describing features on the image with points, lines, and polygons (a collection of tuples) for example,
a river can be represented as a line (assuming it's fairly straight)
a bunch of hospitals can be represented as points (from airplane point's of view )
a city boundary can be represented as a polygon (multiple points connected together) or a lake
You can describe the real-world with both vector data and raster data, as illustrated below,
However, vector data and raster data are not mutually exclusive, you can have a raster data with vector data on top of it. For example, you can have a satellite image of a city like Toronto, and you can have a bunch of points on top of it to represent the location of shawarmas as points, super congested highway as lines, as illustrated below,
Let's compare the two types of data in GIS domain,
feature | Vector Data | Raster Data |
description | a series of points (x,y) or collection of points to represent points, lines, polygons. Representing data with vertices | image, made of pixels representing data with grid |
storage data type | collection of arrays of tuple | as an 2d array |
data size | small, just an array of points | large, since it's stored as array with high-resolution |
Just like how we have .csv
, .json
, .parquet
to store data, in GIS world we have .shp
, .geojson
, .kml
to store vector data, and we have .tif
, .jpg
, .png
to store raster data. Those mentioned format is just the tip of iceberg for complete view of all formats, please see here for vector formats, and raster formats.
Now, as a data engineer, we are interested in how to load data to the database, and how to transform, query those data to support. Before all of these, you need to understand the supported geo data formats in database.
As for raster, although you could store it in a database, but it's typically recommended to store in a separate blob storage with a link to the database without performance penalities. So, we won't be talk about any raster data in the context of database and will be focusing on vector data in this post.
How vector data is stored in postgres + PostGIS
PostGIS, is built on top of Postgres, and it's an extension to Postgres to support spatial data. You can play around by getting a docker container, or just get a postgres database and do the following command to install
CREATE EXTENSION postgis;
SELECT PostGIS_full_version();
SELECT * FROM pg_available_extensions WHERE name LIKE 'postgis%';
Just like SQL has SQL standards like SQL-86, SQL-2019, SQL-2013, geospatial data also has standards and the most popular ones is the simple feature access (SFA) by Open Geospatial Consortium (OGC) and PostGIS is compliant with it.
The SFA standard defines the following data types in database,
geometry
geographic (topic of another day)
we will be focusing on geometry data type in this post. I have decomposed the geometry data type into the following categories as illustrated below,
You can see the Point, Polygon, LineString, LineRing we discussed earlier are all atomic geometry types, and the rest are collection types. Collection types are just a collection of atomic types, for example, a MultiPoint is just a collection of points, and a MultiPolygon is just a collection of polygons.
For example, let's say we want to investigate some radiation intensity at some points in city with nuclear power plant, we can create a table with the following schema,
-- Create a table with point and radiation_intensity columns
CREATE TABLE sample_points (
id SERIAL PRIMARY KEY,
name VARCHAR(255),
location GEOGRAPHY(Point, 4326),
radiation_intensity FLOAT
);
-- Insert some sample data
INSERT INTO sample_points (name, location, radiation_intensity)
VALUES
('Point A', ST_GeographyFromText('POINT(-73.9857 40.7484)'), 50.5),
('Point B', ST_GeographyFromText('POINT(-74.006 40.7128)'), 40.2),
('Point C', ST_GeographyFromText('POINT(-73.9819 40.7674)'), 60.8),
('Point D', ST_GeographyFromText('POINT(-73.9804 40.7382)'), 55.3);
It will generate something like this
id | name | coordinates | radiation_intensity |
1 | Point A | POINT(-73.9857 40.7484) | 50.5 |
2 | Point B | POINT(-74.006 40.7128) | 40.2 |
3 | Point C | POINT(-73.9819 40.7674) | 60.8 |
4 | Point D | POINT(-73.9804 40.7382) | 55.3 |
Now, there are more to it. When a table with Geometry data type being created, it will invoke two triggers and updates in two tables called geometry_columns
and spatial_ref_sys
. The geometry_columns
table is used to store the metadata of the geometry column, and the spatial_ref_sys
table is used to store the spatial reference system (SRS) of the geometry column.
spatial_ref_sys table
What's the spatial reference system (SRS)? It's fancier version of cartesian coordinate system. Since we are using vector to represent the world, we have to have a reference coordinate system. It became complex since we have to represent the world in 3D, but it's projected to 2D. There are many different coordinate systems, and you can see the list of coordinate systems here, and you can see the coordinate system of the points we inserted earlier is in WGS84 coordinate system, which is a common coordinate system used in GPS.
Column Name | Description |
srid | Spatial Reference Identifier (SRID). PK for the table |
auth_name | Name of the organization defining the SRID. |
auth_srid | The SRID used by the defining organization. |
srtext | Human-readable representation of the spatial reference system. |
proj4text | Proj4 projection string representing the spatial reference system. |
srtype | Type of spatial reference system (e.g., 'GEOGCS' for geographic coordinate system). |
Unluckily, many authorities did the same things and created their own coordinate system, and it's a mess. Some popular authorities are,
auth_name | Authority |
EPSG | European Petroleum Survey Group (EPSG) |
ESRI | Environmental Systems Research Institute (ESRI) |
SR-ORG | Spatial Reference Organization (SR-ORG) |
NAVD88 | North American Vertical Datum of 1988 |
NAD83 | North American Datum 1983 |
CRS84 | Common identifier for WGS84 coordinate reference system |
IGNF | Institut National de l'Information Géographique et Forestière (IGNF) |
IAU2000 | International Astronomical Union (IAU) |
CUSTOM | Custom-defined spatial reference systems |
In order to pin-point where you data is, you need to know a tuple of (auth_name, auth_srid)
. Most common standard is the WGS84 (world geodic system 1984), which corresponds to ('EPSG', 4326)
as a pair of auth_name and auth_srid.
geometry_columns table
It's just a meta data table to store the geometry column information. The explanation of all columns are listed here,
Column Name | Description |
f_table_catalog | Catalog (database) name containing the spatial table. |
f_table_schema | Schema name containing the spatial table. |
f_table_name | Name of the table containing the spatial column. |
f_geometry_column | Name of the spatial column within the table. |
coord_dimension | The coordinate dimension (2 for 2D, 3 for 3D). |
srid | Spatial Reference Identifier (SRID) of the geometry, Foreign key, for connecting spatial_ref_sys table |
type | Geometry type (e.g., 'POINT', 'LINESTRING'). |
It's not a big deal, but it's good to know that it exists.
Summary
In this post, we covered
a crash course on GIS data
how vector data is stored in PostGIS
PostGIS is pretty much the go-to database for GIS data because of its open-source nature, its compliance with SFA standard and most importantly the rich support from community with a bunch of geo-processing functions built-in to make your life easier. If you are a full-stack developer, the default stack to go to is PostGIS + GeoDjango or GeoDjango Rest + Leaflet or React Leaflet.
If you are a data engineer who wants to utilize the power the spatial data in your OLAP, your best luck would be BigQuery. It makes sense cuz google has to do a lot of maps, google earth, maps etc.geospatial data in a nutshell