Why is spatial data so hard to deal with in a database, and what can be done about it?
Spatial data is, quite literally, everywhere. A sense of place is intrinsic to almost everything taking place across the planet. Despite this, being able to understand and contextualise spatial data at scale remains a huge challenge due to the demands it places on conventional database systems.
In this post I’ll give a brief overview of spatial data, what its main features and challenges are, and how (and why) GeoSpock built a database specifically to address those issues.
Everything happens somewhere
At the most basic level, nearly all forms of data recording physical world activities, operations and environmental conditions are spatial in nature. A sense of place is even implicit in contextualising data from completely static, unmoving sources. Think, for example, about a reading of 22°C recorded by a fixed position temperature sensor. 22°C could be:
- Completely unremarkable (inside an office or shopping building);
- Alarmingly warm (at the South Pole); or
- Surprisingly cool (the middle of the Sahara desert)
It all depends on where that reading was taken!
The relative position of the sensor compared to others is also important in understanding its significance – is the reading a local anomaly or part of a wider trend?
The context, interest and overall value of the data recorded by the temperature sensor are intrinsically linked to its position in space. The same is true for the vast majority of other activity or event based data recorded in the physical world today – if something happened, chances are it happened somewhere!
Sources of spatial data
In the last decade, the emergence of the Internet of Things has driven a new wave of data acquisition, much of it generated by automated sensors in the physical environment.
Like the temperature sensor in the example above, every sensor connected within the IoT has a physical location which is vital in understanding and placing the data it gathers in context. Additionally, although historically the development of the IoT has focussed on the deployment and networking of static sensors and the largely time series data it creates, huge volumes of spatial data are being generated by dynamic, moving, connected devices.
Some of the most ubiquitous objects in the world produce spatial data at a vast scale. Phones, wearables and (increasingly) vehicles record and transmit spatial information in the form of GPS or similar signals for use in all manner of applications and analyses.
Data recorded by these mobile, connected devices are often incredibly granular in nature – sample intervals of a single second or less are not uncommon – and thus quickly accumulate huge volumes. At the top end of the scale, a single autonomous vehicle can produce up to 5 TB of data in a single hour of use!
These vast accumulations of high frequency spatial data are an incredibly rich source of information – providing all the information of a typical time series dataset, with added, vital, spatial context. However, this richness also makes spatial data at scale a daunting source of complexity and challenge.
The challenges of multidimensionality
The reason spatial data is both so useful and so complex lies in its multidimensional nature. To fully identify and contextualise a data point generated by a sensor or device in the physical environment requires knowledge of:
- The unique device identifier that created the data sample;
- The time that data sample was recorded; and
- The location of the device at the time the data point was recorded, in X, Y (and possibly Z) dimensions!
Without any one of these points, it is impossible to fully contextualise the data in question.
To understand why, imagine you are waiting for a taxi. You’d have a hard time arranging the pickup if you knew where to wait but not when to be there, or when to be ready but not where you should wait! Equally, if you were in a busy area but you didn’t know the registration plate, or at least some other identifiable information, you might not know which taxi is assigned to you!
Of course, pickup is only the start of the story. Every journey also encompasses a trip route and drop off, and any other additional data that might be collected along the way. For a taxi that could be the number of passengers, information on the vehicle (type, colour, speed, brake force), or transaction details such as the fare and any time or location based surcharges. For other devices it might be environmental readings of pollution and temperature, or application information such as what app a person is using in a given location.
Data recording just a day’s worth of taxi journeys in NYC! (data and visualisation from https://kepler.gl/)
In the real world, there are tens of thousands of taxi rides made every day, millions of vehicle journeys taken, and billions of other activities performed. In all cases, knowledge of all three key dimensions shown above is vital to our full understanding of any related data produced.
A database is not just a data store
What does all this have to do with databases, I hear you cry? Well, the complex, multi-dimensional nature of spatial data places incredible demands on the techniques used within databases to manage and make data available for onward analysis. To understand why, it’s important to remember that an effective database isn’t just a store of data, but also makes that data accessible for use in other applications.
Querying makes the (data analytics) world go round
The act of interrogating a database to find subsets of data with specific characteristics is known as querying.
In some shape or form, querying is the critical start point for nearly all data analytics – a question is posed to the database and the database ‘answers’ by returning data matching the criteria of the query. Along the way, how data is stored and signposted within the database plays a crucial role in governing how long it takes to return a result.
Searching a database in which data is jumbled, random and unstructured will take a long time to find an answer, as a query will be required to scan the entire database to make sure it returns all matching data points. Imposing order on the data within the database can help dramatically cut query response times by reducing the amount of irrelevant data that must be searched through. The need to impose order on data becomes increasingly important as data volumes rise and it becomes more and more inefficient to search through reams and reams of irrelevant data to find the necessary answer.
A real world database example
To illustrate the example further, let’s take the real world analogy of a library. If books in the library were completely unordered, the task of finding anything specific would be extremely arduous. Even if you knew exactly what you were looking for, you would still have to search every single book, one by one, to find the one you wanted. That might be ok if the library only has 10 books – but what about ten thousand, or ten million, or 170 million like the Library of Congress?
Clearly, in order to remain useful as an accessible store of data, a system is needed to help keep time searching in the library manageable. A library can easily impose order on books based on their characteristics. Books can be ordered by the surname of the author, by its genre, or both. In this way, if I wanted to find books by a certain author in a specific genre, I could go straight to the relevant portion of the relevant aisle, and would only have to search a fraction of the total number of books in the library. Even better, if the library provided handy maps or signposts within it to help guide me to the right point, I’d be able to get there even faster. My time spent ‘querying’ in the library would be vastly reduced.
Lisa knows the value of a good index! (from giphy.com)
The importance of indexes
The system providing structured signposting or maps of a database to improve data accessibility is known as an index – and is common in both physical and digital organisation systems. In reality, libraries use the Dewey Decimal Classification index to help book seekers find what they are looking for quicker, and many textbooks employ indexes to help aid navigation within the pages of individual volumes. In the digital space, people make use of the indexes of web pages compiled by search engines every day to navigate their way around the world wide web.
Let’s consider indexing in the context of spatial data generated by connected devices. As discussed, spatial data is particularly challenging to deal with because precise definitions require knowledge across the multiple-dimensions of space, time and device. This also makes spatial data highly voluminous (big!). Data extents can increase due to increasing numbers of transmitting sensors, over time as new data samples accumulate, and over space as larger areas are encompassed by the dataset. That means it’s imperative to have an efficient way of searching the data, so as not to get lost in an ever expanding sea of data points.
Indexing across space, time and device
Whilst indexing is a widely employed concept within database design, when it comes to spatial data not all indexes are equal.
Indexes which only examine one or two data dimensions may be perfectly applicable for data like transaction records or purely time series events, but have limited utility on spatial data where multiple dimensions are needed for precise characterisation. This results in the queries on the database being unable to zoom straight in to relevant subsets, leading to large amounts of often redundant data needing to be scanned in each and every query.
Just as in the library, these full table scans quickly become highly inefficient and costly as data volumes increase (and as already noted, spatial data is often very large in volume!). Therefore, to be able to search spatial data efficiently, I need a multidimensional index which matches the space, time and device properties of the data itself.
How GeoSpock approaches indexing
At GeoSpock, our database is designed around the provision of this multidimensional index. When data is ingested into the database, a single index is created containing information on space, time and device characteristics. This provides the information needed to enable the database to facilitate high performance querying on complex spatial data.
Due to the use of indexing to bypass irrelevant portions of the data when searching, the amount of data scanned for any given query is governed solely by the dimensions of the area of interest. This means that database performance is retained even when overall volumes increase. This allows datasets with very large extents, for example encompassing nations or even the entire globe, to be efficiently queried for local insights without the need for partitioning, segregating or siloing data into smaller, isolated and discontinuous pieces.
Using parallelism to further accelerate querying
Whilst indexing across space, time and device is the primary driver of our ability to utilise spatial data at scale effectively at GeoSpock, there are also other complementary techniques which can be used to further improve performance.
The great promise of cloud computing has been the ability to parallelise workloads at scale. By harnessing the ‘power of the cloud’ it is possible to distribute the work of scanning the database into many chunks which can be run simultaneously in parallel, reducing the absolute time taken for a query (if not the number of CPU hours required) in the same way as getting a friend to help you out when looking for that library book. Additionally, a new wave of ‘accelerated analytics’ databases throw the power of GPUs and other exotic processing hardware at the problem, improving query response times by increasing the speed at which data is scanned.
Both of these techniques have their uses; GeoSpock already employs cloud based parallelisation to help speed up query performance even further. However, parallelisation and GPU acceleration remain essentially brute force approaches to the problem. While undeniably faster, both still suffer from the same inherent scalability issues as traditional systems – increasing the data volume housed within the database requires an increasing effort to be put into scanning that data. You may get to the end quicker, but you still did a lot of unnecessary (and often costly) work on the way. Whilst these approaches have their benefits, indexing is a much more elegant and efficient way of addressing the root cause of the problem in the first place.
Impacts beyond the database
Because of the complexity of spatial data, many organisations are only just waking up to its full value potential.
The ones that have already done so often spend large amounts of time and energy performing data wrangling tasks simply to get their data into a usable state for further analysis. Even then they often put up with long query times and highly limited, rigid analytics which can’t be modified without redesigning the entire workflow or creating an additional bespoke database instance.
This rigidity is not just annoying and costly, it has a fundamentally negative impact on our relationship with data. Slow, expensive workflows make data exploration a chore, rather than a source of inspiration, and raise barriers to the innovation that comes from building new applications on the insights generated by swift data access. As more and more of the world gets connected, this problem will only become more acute.
Spatial data, as one of the most complex and rapidly scaling types of data, epitomises this issue, and is why it remains central to our motivation and philosophy at GeoSpock.
Mark Porter is Marketing Manager at GeoSpock