Overview

Installation

pip install tablecache[extras,of,your,choice]

Extras

tablecache comes with a few optional extras that pull in additional dependencies. Without any, there are no concrete implementations of DB and storage access, so you probably want at least 2 of these.

postgres: Adds submodule postgres which provides a DbAccess implementation for Postgres using asyncpg
local: Adds submodule local which provides a StorageTable implementation storing records in local data structures (i.e. in Python)
redis: Adds submodule redis which provides a StorageTable implementation storing records in Redis
prometheus: Write metrics using prometheus_client.
test: extra dependencies for testing tablecache
dev: extra dependencies for developing tablecache
docs: extra dependencies to build this documentation

Purpose

Suppose you have a relational database that’s nice and normalized (many tables), and you need to access the result of a large join. The DB can’t combine its indexes on the individual tables in a way that makes querying the join fast. But there would be a performant way to access ranges of records, if only you could easily materialize the join and then index that directly.

tablecache can take your big query and put the denormalized results, or a subset of them, in faster storage. You can define one or more ways to index these records for fast access to whole ranges of them. The classic example is timestamped data, where you want to keep the past few days in cache, and be able to quickly get records from a time range.

The contents of the cache can be adjusted, i.e. old ones expired and new ones loaded. Any records not cached will be transparently fetched from the DB, so manual queries against the DB after a cache miss are not necessary.

Cached records can also be invalidated. This needs to be done when the data in the DB changes in order for the cache to continue reflecting the DB state. But the cache only needs to be told which records have changed, the actual refresh is done automatically.

Non-goals

tablecache is not a database, only a performance layer for difficult-to-index result sets. It reflects the underlying DB which is the single source of truth. The storage used for caching is meant to be disposable. You’re supposed to be able to delete it all and deploy a new instance which loads the cache fresh from the DB.

The DB is not actively watched for changes. Once loaded, without any further interaction the cache will continue to return the same records, even if they change in the DB. It will load new records and refresh changed ones, but it needs to be told which ones (not how they’ve changed, it will go to the DB for that).

tablecache doesn’t implement caching strategies, i.e. it doesn’t decide which records should be in cache, this needs to be done externally. But again, once that decision has been made, records are evicted and loaded automatically.

tablecache is not a high-level tool that figures out how to index and query your data based on a declarative description (like a relational DB), and it’s not a low-level tool that gives you full control over how data is managed (like a use-case-specific custom cache). It is intended to sit in the middle of the abstraction spectrum, requiring the user to decide how data is accessed and indexed (by providing an Indexes implementation), and which records should be kept in cache in the first place, while taking over some of the tedious tasks like transparently handling cache misses by querying the DB and keeping track of invalid records.

Limitations

Each record must be uniquely identifiable by a primary key. These can be anything hashable though and are calculated in the user-supplied implementation of Indexes.primary_key(), so multicolumn primary keys are possible by e.g. returning tuples.

Indexing is done by associating each record with one or more scores (one per index) and storing them in a sorted data structure that makes it fast to access ranges of scores. Sets of records can be accessed quickly if they have similar scores, e.g. it’s possible to index a timestamp and then get all records in a time range. Indexing more than one attribute is also possible by interleaving scores, but requires some tweaking. However, only one index can be used per read operation, you cannot combine multiple ones.

Currently, everything tablecache does is single-threaded and not thread-safe. This implies that each instance of CachedTable owns its storage exclusively, and multiple instances will each need their own copy of the data. A feature where one instance manages the storage while others access it read-only as read replicas is feasible, but not supported.

tablecache is designed with asyncio in mind. Using traditional blocking IO libraries may not work well.

Usage

In order to use tablecache, you need to

create a DbAccess, e.g. a PostgresAccess
create a StorageTable, e.g. a LocalStorageTable
implement Indexes
put them all together in a CachedTable

Then you can CachedTable.load() your table, CachedTable.get_records() from it using the indexes you defined, CachedTable.adjust() it to change which records are cached, and CachedTable.invalidate_records() to inform the cache of changes in the underlying data.

See also the examples for a guide on how to put this all together.

Indexes

Your Indexes implementation is where you define your indexes and tie them all together. An instance of the class is also used to keep track of the records that are currently in cache.

You need to define

Indexes.IndexSpec: A specification of how to query a particular one of your indexes. Must be a subclass of Indexes.IndexSpec and an inner class of your Indexes. You need to add all the data required for a query.
Indexes.index_names: A property returning a set of available index names.
Indexes.score(): A method that calculates the score of a record for a given index.
Indexes.primary_key(): A method that extracts the primary key from a record.
Indexes.storage_records_spec(): A method that takes an IndexSpec and returns a StorageRecordsSpec, a way to specify a set of records in storage.
Indexes.db_records_spec(): Like Indexes.storage_records_spec(), but returning a DbRecordsSpec, a way to specify the same set of records in the DB.
Indexes.prepare_adjustment(): A method that takes an IndexSpec and returns an Adjustment, which contains information on which records to expire from and load into the cache in order to attain the state specified in the IndexSpec. You can return your own Adjustment subclass to include extra data.
Indexes.commit_adjustment(): A method that takes an Adjustment previously returned by prepare_adjustment and commits the changes. The Adjustment will have had callbacks called for expired and new records, through which you can track which records are now cached.
Indexes.covers(): A method that takes an IndexSpec and returns whether all the records specified are currently available in cache.

Accessing and updating data

Before anything can happen, you need to load your CachedTable. This method, along with get_records, get_first_record and adjust, takes one of your IndexSpecs as an argument. As a convenience, these methods also take arbitrary args and kwargs, which will be passed to the IndexSpec constructor to create one.

Records can be fetched with CachedTable.get_records() or CachedTable.get_first_record() (which is just a convenience wrapper around the former). Whenever the specified records are available in cache (according to Indexes.covers()) and haven’t been invalidated, they are fetched from cache. Otherwise, they are fetched from the DB.

CachedTable.adjust() can be used to change the set of records that are kept in storage. This will internally call Indexes.prepare_adjustment() for specs on the records to expire and load, perform the changes while calling the Adjustment’s observe_expired and observe_loaded callbacks, and then commit them using Indexes.commit_adjustment(). Afterwards, Indexes.covers() should reflect the new state. Adjustments are done without blocking read operations, while providing a consistent view of the data.

Note

CachedTable.load() also uses the adjustment mechanism (Indexes.prepare_adjustment() etc.), so that the Indexes can observe all the records that are initially loaded.

CachedTable.invalidate_records() can be used to inform the cache that the data in the underlying DB has changed. By default, records that are invalidated are guaranteed to be fetched from the DB before they are returned the next time. This refresh is done lazily, i.e. only when a request comes in for a record that has been invalidated (requests for records that are still valid are served without refresh). This default behavior can be overridden by passing force_refresh_on_next_read=False, in which case fetches will continue to serve the invalid records until a manual refresh is triggered. This can be useful when immediate correctness isn’t crucial, but avoiding refreshes blocking read operations is. Invalid records can be refreshed manually using CachedTable.refresh_invalid(). This is also possible when force_refresh_on_next_read=True.

CachedTable.invalidate_records() is a bit more complex since it takes more than one IndexSpec. That’s because updates to records may change them in a way that also changes their scores in a particular index. So you have to actually specify how to find the old records currently in the cache, and how to find the new records in the DB (these IndexSpecs may be the same, but don’t have to be). Additionally, you may specify multiple IndexSpecs each for old and new records, one for each of your indexes. This gives the cache the information whether a record that is requested is invalid and a refresh is necessary before serving a read. Without this, everything still works correctly, but indexes without information are marked as dirty and will unconditionally trigger a refresh on the next read against them.

Note

All records specified in CachedTable.invalidate_records() must be present in cache (according to Indexes.covers()), or a ValueError is raised. invalidate_records is only meant to update records that were within the range of the indexes and have changed, not add completely new ones. While it will work when the new records’ scores are all in the covered range, the adjustment mechanism (i.e. Indexes.prepare_adjustment() etc.) is not used and the Indexes will not be informed of the new records. This may or may not be ok, depending on your implementation.

Available implementations

These implementations of DbAccess and StorageTable are available as submodules when selecting the appropriate extras:

`DbAccess`: `tablecache.postgres`

Simple Postgres access available with the postgres extra. Uses asyncpg and specifies records via a query string and an args tuple.

`StorageTable`: `tablecache.local`

A StorageTable implementation storing records in local Python data structures, available with the local extra. Uses sortedcontainers for indexes.

This implementation is probably the better choice over the Redis implementation. Having the data in Python makes it possible to just put references into the index lists, meaning fewer indirections. This implementation also supports any kind of number as scores, including arbitrarily large integers.

`StorageTable`: `tablecache.redis`

A StorageTable implementation storing records in a Redis instance, available with the redis extra. Uses redis.asyncio.

Scores must be representible as 64-bit floats (Redis’ sorted set is used). Records are stored in Redis as byte strings, which means they must be encoded using Codecs that come with the module. The Redis instance backing the cache must be configured to not expire keys (this is the default), or data will be lost.

Logging

The library logs messages with logger names in the tablecache namespace (i.e. with logger names matching tablecache.*). These inform mostly about a table being loaded, adjusted, or refreshed.

Metrics

If the prometheus extra is selected, metrics are written using the prometheus_client library. A server to make them accessible has to be started outside the library using e.g.

prometheus_client.start_http_server(your_prometheus_port)

Metric names are in the tablecache namespace, i.e. all match tablecache_*. The CachedTable writes the following metrics:

tablecache_cached_table_reads_total: Total number of reads on the table. Labels:
- table_name
- type: One of cache_miss (read from DB), cache_hit (read from storage), or cache_hit_with_refresh (read from storage, but a refresh from DB was triggered before)
tablecache_cached_table_refreshes_total: Total number of refreshes on the table. Labels: table_name.
tablecache_cached_table_adjustments_total: Total number of adjustments on the table. Labels: table_name.
tablecache_cached_table_adjustment_expired_total: Total number of records that were expired during adjustments on the table. Labels: table_name.
tablecache_cached_table_adjustment_loaded_total: Total number of records that were loaded during adjustments on the table. Labels: table_name.

The LocalStorageTable writes the following metrics:

tablecache_local_table_records_total: The number of records currently in storage. Labels:
- table_name
- type: One of regular (normal records that are accessible), scratch (scratch records that are not merged yet), or scratch_delete (records that exist but have been marked for deletion in scratch space).