In 2015 GeoSpock was still a young startup. There were about 4 people working out of Steve (the founder)’s living room, and what would become our next-gen geospatial database was initially being used to power “GiFi” – a location-based dating app.
The database was built upon Google Cloud Platform’s noSQL “Datastore.”
The team had been using the free tier for a while which gave us 50k read operations a day. This was great as the company did not have the backers it has these days and it was hard to justify any kind of spending.
The GiFi app — now sadly defunct, but the underlying technology lives on
This couldn’t last though – the application was starting to receive more traffic and automated testing was being added. Pretty soon, 50k requests per day would no longer cut it. Despite the increased traffic, the company was not pulling in money from the app, so something needed to be done to reduce costs.
Other databases were considered, including perhaps hosting our own on physical servers kept at Steve’s house.
The basement where it all began
However with such a small engineering team that would be a large time investment, especially with how tied into Google’s AppEngine the database was at that time. With some further review of Datastore’s pricing section someone noticed the following:
Small datastore operations include calls to allocate datastore ids or keys-only queries. These operations are free.
This of course caught our attention as free is exactly what we want our datastore to be.
We first needed to decide if this was useful information. Could we make use of a database ignoring the concept of values and only use the keys? This lead to further questions:
Can we store data in the key?
Looking into the documentation it turned out that the max key size was 6KB. Not huge, but we weren’t storing a huge amount of data in each row – just a few fields and short values.
The key itself is just a string, and we were only using about 50 bytes for our geospatial indexing. This left plenty of space to append extra information. We could simply map the data as key-values and encode them into JSON to be appended to the end of the key.
Can we still query the key?
Our custom indexing at that time included geographical location, time and other searchable dimensions as string combinations and using techniques such as bit interpolation and term-reordering. This would then be queried using prefix searches to return all results matching a set of dimensions including a geographical area. As this change only appended additional information to the end of the key it did not change our query pattern at all.
Does it perform well?
Back in those days our bottlenecks were around query patterns rather than the backing technology. We were still dealing with GB scale databases, not the 100s TBs we now need to query. Because the number of rows queried and returned was the same, the performance difference wasn’t noticeable with larger keys. It’s possible that if we were to go back and do the same now and benchmarked the performance, we might find the change unacceptable. There was no noticeable change to the user experience of the App though, which was our main concern at the time.
Great! What we now had was a Key-Value store database which we could read from as much as we wanted while keeping our GAE bill for that component at zero.
How far we’ve come since then…
We also upsized out of the basement — at least until 2020 happened!
We’re a long way from 2015 now – lots of things have changed, others haven’t.
We moved away from using Google and AppEngine. We needed more performance and decided building on instances that we have full control over was a better choice. For a long time we were using Amazon’s EMR and HBase to make use of the S3 storage solutions they had for cost saving, but sadly we ran into limits there too (particularly for HBase and EMR).
More recently we have made the move to a fully custom and containerised solution, and have set our eyes on having another GCP offering in the future (along with other platforms such as Azure and even “on-prem”).
For all the changes though, we still like to engineer to make things both more performant and cheaper to run. We know data will only get bigger and costlier to handle over time so cost-performance remains one of our core considerations in building our systems, whatever platform we end up building on.
James Greeman is Lead Software Engineer at GeoSpock