Testing Performance Of Materialized Views For Web Mapping From A Normalized Database
/ jmapping
If you don’t know what materialized views are in PostgreSQL you can read all about them in the official docs. For those of you that don’t want to read the docs I’ll sum it up here.
Materialized views are essentially static snapshots of data created from a normal view-like query. Rather than a full view query that usually involves multiple table joins for every select statement accessing it a materialized view results in a simple single table-like select with none of the joins used to create the initial materialized views. It’s important to note that these are not tables… They are different. For example, unlike a normal table you can refresh the contents of a materialized view with a simple command:
It’s also important to note some basic pros, cons, and ideal use cases for materialized views before moving on.
Pros
They can be very fast compared to a normal view depending on how complex the original view is.
They don’t rely on indexes being current.
Can save on CPU and memory footprints because of the reduction of query complexity for basic access to that data.
Cons
They can be slow to build initially depending on complexity and size of data.
Data captured in the materialized view can get out of date if not refreshed when appropriate for the applications using it.
Results in a bit of data redundancy in the system although it isn’t true redundancy because they utilize view-like queries to extract a snapshot of a normalized database.
Results in larger storage footprint as they create static snapshots of data.
Some ideal use cases
When needing to access data in a normalized database that requires lots of table joins with relatively large data sets that would otherwise result in slow queries.
When the underlying data behind a view is updated at a rate that is significantly lower than the demand to do simple reads on that view.
The Scenario That Got Me Writing About This
At TerraFrame we build and maintain the open source data engine RunwaySDK which utilizes an ontological approach to managing data (among other things). This basically boils down to managing data as objects with well defined human like relationships between those objects. From a developer perspective you can interact with location data in ways that make sense to a human while navigating a location relationship network. For example, you can easily determine what ancestors, children, or siblings a location might have…. Ohh, and you can do this even if your data doesn’t have any geometry data.
So, to enable this level of flexibility we have a programmatic abstraction that necessitates a highly normalized database. When we build mapping based apps on RunwaySDK we often access map layer information through database views. These views are very complex in order to ensure accuracy, ensure the data is current when data is being updated regularly through api’s or other users, and to maintain adherence to the ontological paradigm. The problem is that rendering map layers from these views can be slower than desired.
Despite what I said above, why use views you might ask?
Well, GeoPrism which is our primary open source application that utilizes RunwaySDK can have data of all types and sizes as well as multiple users modifying and/or viewing the data simultaneously. Additionally, users can create, filter, and style layers on the fly. Ensuring performant access to this constantly changing data can be a challenge and views offer the most obvious way to access regularly changing data in a normalized database. One potential opportunity for performance enhancement in this situation can be found by using materialized views. After a basic implementation we wanted to test some scenarios to determine their usefulness for improving performance of map layer tile caching as well as basic programmatic access in an app as dynamic as GeoPrism.
On To The Tests!
The use case I am testing for is pretty simple. We use GeoServer as a map server to render map layers from a database view (more details below). The query GeoServer makes to the database to generate the layer service is basically a simple SELECT statement. So for these tests we will be looking at two primary actions.
The performance of constructing the view and materialized view objects.
The performance of a basic SELECT statement used to access the view and materialized view.
A standard view in GeoPrism that would be used by GeoServer to create map layers looks like this:
As you can see the query powering a standard map layer is pretty complex. Keep in mind this is accessing raw user data, mapping it against locations stored as geo-ontologies, handling language localization for multi-language deployments, and aggregating user data up to whatever geographic hierarchy the user wants. So, there is a lot of value being provided through this complex query.
The query plan for this view is similarly complicated and reflects to process being run every time the view is accessed:
Lets convert this standard view to a materialized view.
To convert this to a materialized view we simply modify the query declaration slightly:
This will create the materialized view which is now a snapshot of data. This means the above query is only run once unless explicitly refreshed or re-created. Now every time we want access to this view the underlying query will be much more simple. Here is the query plan for accessing the new materialized view with “SELECT * FROM the_table:
If you haven’t used the query analysis tools in PostgreSQL to view query plans like this it’s OK. I’ll spare the technical details. The important thing to note is that the cost of the query went from 10377.29..10636.76 which is pretty slow and has a lot of uncertainty to 0.00..8.50 which is MUCH better.
OK… Now we have some confidence that the query is going to be faster when hit directly with a basic SELECT statement. This is a practical test for my use case because this is how GeoServer will be accessing the database to provide map services. After running the SELECT queries manually some interesting but not so surprising results emerged.
Basic Test Results:
The Simple Query = SELECT * FROM the_table;
Resulting Row Count = 381 (pretty small)
Time to drop standard view = 14 ms
Time to create standard view = 15 ms
Time to SELECT results from standard view = 43 ms
Time to drop materialized view = 12 ms
Time to create materialized view = 51 ms
Time to SELECT results from materialized view = 31 ms
That’s about a 28% decrease in SELECT time
BUT
about a 286% increase in CREATE time
This is a very basic test on a pretty small dataset but I think it points to what we already suspected and hinted at above. To gain performance improvements from materialized views there is an up-front cost that must be paid. The question is if that cost is worth the gain for the use case. In our case the answer is still maybe. More testing is needed. Some of the underlying tables queried are quite large but the resulting output is not. I plan to run this test again when I integrate some data that will test results in the 10’s – 100’s of thousands or records in the near future.