Overview
Installation
pip install tablecache[extras,of,your,choice]
Extras
tablecache
comes with a few optional extras that pull in additional
dependencies. Without any, there are no concrete implementations of DB and
storage access, so you probably want at least 2 of these.
postgres
: Adds submodulepostgres
which provides aDbAccess
implementation for Postgres usingasyncpg
local
: Adds submodulelocal
which provides aStorageTable
implementation storing records in local data structures (i.e. in Python)redis
: Adds submoduleredis
which provides aStorageTable
implementation storing records in Redisprometheus
: Write metrics using prometheus_client.test
: extra dependencies for testingtablecache
dev
: extra dependencies for developingtablecache
docs
: extra dependencies to build this documentation
Purpose
Suppose you have a relational database that’s nice and normalized (many tables), and you need to access the result of a large join. The DB can’t combine its indexes on the individual tables in a way that makes querying the join fast. But there would be a performant way to access ranges of records, if only you could easily materialize the join and then index that directly.
tablecache
can take your big query and put the denormalized results, or a
subset of them, in faster storage. You can define one or more ways to index
these records for fast access to whole ranges of them. The classic example is
timestamped data, where you want to keep the past few days in cache, and be
able to quickly get records from a time range.
The contents of the cache can be adjusted, i.e. old ones expired and new ones loaded. Any records not cached will be transparently fetched from the DB, so manual queries against the DB after a cache miss are not necessary.
Cached records can also be invalidated. This needs to be done when the data in the DB changes in order for the cache to continue reflecting the DB state. But the cache only needs to be told which records have changed, the actual refresh is done automatically.
Non-goals
tablecache
is not a database, only a performance layer for
difficult-to-index result sets. It reflects the underlying DB which is the
single source of truth. The storage used for caching is meant to be disposable.
You’re supposed to be able to delete it all and deploy a new instance which
loads the cache fresh from the DB.
The DB is not actively watched for changes. Once loaded, without any further interaction the cache will continue to return the same records, even if they change in the DB. It will load new records and refresh changed ones, but it needs to be told which ones (not how they’ve changed, it will go to the DB for that).
tablecache
doesn’t implement caching strategies, i.e. it doesn’t decide which
records should be in cache, this needs to be done externally. But again, once
that decision has been made, records are evicted and loaded automatically.
tablecache
is not a high-level tool that figures out how to index and query
your data based on a declarative description (like a relational DB), and it’s
not a low-level tool that gives you full control over how data is managed (like
a use-case-specific custom cache). It is intended to sit in the middle of the
abstraction spectrum, requiring the user to decide how data is accessed and
indexed (by providing an Indexes
implementation), and which
records should be kept in cache in the first place, while taking over some of
the tedious tasks like transparently handling cache misses by querying the DB
and keeping track of invalid records.
Limitations
Each record must be uniquely identifiable by a primary key. These can be
anything hashable though and are calculated in the user-supplied implementation
of Indexes.primary_key()
, so multicolumn
primary keys are possible by e.g. returning tuples.
Indexing is done by associating each record with one or more scores (one per index) and storing them in a sorted data structure that makes it fast to access ranges of scores. Sets of records can be accessed quickly if they have similar scores, e.g. it’s possible to index a timestamp and then get all records in a time range. Indexing more than one attribute is also possible by interleaving scores, but requires some tweaking. However, only one index can be used per read operation, you cannot combine multiple ones.
Currently, everything tablecache
does is single-threaded and not thread-safe.
This implies that each instance of CachedTable
owns its storage
exclusively, and multiple instances will each need their own copy of the data.
A feature where one instance manages the storage while others access it
read-only as read replicas is feasible, but not supported.
tablecache
is designed with asyncio
in mind. Using traditional blocking IO
libraries may not work well.
Usage
In order to use tablecache
, you need to
create a
DbAccess
, e.g. aPostgresAccess
create a
StorageTable
, e.g. aLocalStorageTable
implement
Indexes
put them all together in a
CachedTable
Then you can CachedTable.load()
your table,
CachedTable.get_records()
from it using the indexes you defined,
CachedTable.adjust()
it to change which records are cached, and
CachedTable.invalidate_records()
to inform the cache of changes in
the underlying data.
See also the examples for a guide on how to put this all together.
Indexes
Your Indexes
implementation is where you define your indexes and
tie them all together. An instance of the class is also used to keep track of
the records that are currently in cache.
You need to define
Indexes.IndexSpec
: A specification of how to query a particular one of your indexes. Must be a subclass ofIndexes.IndexSpec
and an inner class of yourIndexes
. You need to add all the data required for a query.Indexes.index_names
: A property returning a set of available index names.Indexes.score()
: A method that calculates the score of a record for a given index.Indexes.primary_key()
: A method that extracts the primary key from a record.Indexes.storage_records_spec()
: A method that takes anIndexSpec
and returns aStorageRecordsSpec
, a way to specify a set of records in storage.Indexes.db_records_spec()
: LikeIndexes.storage_records_spec()
, but returning aDbRecordsSpec
, a way to specify the same set of records in the DB.Indexes.prepare_adjustment()
: A method that takes anIndexSpec
and returns anAdjustment
, which contains information on which records to expire from and load into the cache in order to attain the state specified in theIndexSpec
. You can return your ownAdjustment
subclass to include extra data.Indexes.commit_adjustment()
: A method that takes anAdjustment
previously returned byprepare_adjustment
and commits the changes. TheAdjustment
will have had callbacks called for expired and new records, through which you can track which records are now cached.Indexes.covers()
: A method that takes anIndexSpec
and returns whether all the records specified are currently available in cache.
Accessing and updating data
Before anything can happen, you need to load
your CachedTable
. This method, along with get_records
, get_first_record
and adjust
,
takes one of your IndexSpec
s as an argument.
As a convenience, these methods also take arbitrary args and kwargs, which will
be passed to the IndexSpec
constructor to
create one.
Records can be fetched with CachedTable.get_records()
or
CachedTable.get_first_record()
(which is just a convenience wrapper
around the former). Whenever the specified records are available in cache
(according to Indexes.covers()
) and haven’t been invalidated, they
are fetched from cache. Otherwise, they are fetched from the DB.
CachedTable.adjust()
can be used to change the set of records that
are kept in storage. This will internally call
Indexes.prepare_adjustment()
for specs on the records to expire and
load, perform the changes while calling the
Adjustment
’s
observe_expired
and
observe_loaded
callbacks, and
then commit them using Indexes.commit_adjustment()
. Afterwards,
Indexes.covers()
should reflect the new state. Adjustments are done
without blocking read operations, while providing a consistent view of the
data.
Note
CachedTable.load()
also uses the adjustment mechanism
(Indexes.prepare_adjustment()
etc.), so that the Indexes
can observe all the records that are initially loaded.
CachedTable.invalidate_records()
can be used to inform the cache that
the data in the underlying DB has changed. By default, records that are
invalidated are guaranteed to be fetched from the DB before they are returned
the next time. This refresh is done lazily, i.e. only when a request comes in
for a record that has been invalidated (requests for records that are still
valid are served without refresh). This default behavior can be overridden by
passing force_refresh_on_next_read=False
, in which case fetches will continue
to serve the invalid records until a manual refresh is triggered. This can be
useful when immediate correctness isn’t crucial, but avoiding refreshes
blocking read operations is. Invalid records can be refreshed manually using
CachedTable.refresh_invalid()
. This is also possible when
force_refresh_on_next_read=True
.
CachedTable.invalidate_records()
is a bit more complex since it takes
more than one IndexSpec
. That’s because
updates to records may change them in a way that also changes their scores in a
particular index. So you have to actually specify how to find the old records
currently in the cache, and how to find the new records in the DB (these
IndexSpec
s may be the same, but don’t have to
be). Additionally, you may specify multiple
IndexSpec
s each for old and new records, one
for each of your indexes. This gives the cache the information whether a record
that is requested is invalid and a refresh is necessary before serving a read.
Without this, everything still works correctly, but indexes without information
are marked as dirty and will unconditionally trigger a refresh on the next read
against them.
Note
All records specified in CachedTable.invalidate_records()
must be
present in cache (according to Indexes.covers()
), or a
ValueError
is raised.
invalidate_records
is only meant
to update records that were within the range of the indexes and have changed,
not add completely new ones. While it will work when the new records’ scores
are all in the covered range, the adjustment mechanism (i.e.
Indexes.prepare_adjustment()
etc.) is not used and the
Indexes
will not be informed of the new records. This may or may
not be ok, depending on your implementation.
Available implementations
These implementations of DbAccess
and StorageTable
are available as submodules when selecting the appropriate extras:
DbAccess
: tablecache.postgres
Simple Postgres access available with the postgres
extra. Uses
asyncpg
and specifies records via a query string and an args
tuple.
StorageTable
: tablecache.local
A StorageTable
implementation storing records in local Python data
structures, available with the local
extra. Uses
sortedcontainers
for indexes.
This implementation is probably the better choice over the Redis implementation. Having the data in Python makes it possible to just put references into the index lists, meaning fewer indirections. This implementation also supports any kind of number as scores, including arbitrarily large integers.
StorageTable
: tablecache.redis
A StorageTable
implementation storing records in a Redis instance,
available with the redis
extra. Uses
redis.asyncio
.
Scores must be representible as 64-bit floats (Redis’ sorted set is used).
Records are stored in Redis as byte strings, which means they must be encoded
using Codec
s that come with the module. The Redis instance backing
the cache must be configured to not expire keys (this is the default), or data
will be lost.
Logging
The library logs messages with logger names in the tablecache
namespace (i.e.
with logger names matching tablecache.*
). These inform mostly about a table
being loaded, adjusted, or refreshed.
Metrics
If the prometheus
extra is selected, metrics are written using the
prometheus_client library. A
server to make them accessible has to be started outside the library using e.g.
prometheus_client.start_http_server(your_prometheus_port)
Metric names are in the tablecache
namespace, i.e. all match tablecache_*
.
The CachedTable
writes the following metrics:
tablecache_cached_table_reads_total
: Total number of reads on the table. Labels:table_name
type
: One ofcache_miss
(read from DB),cache_hit
(read from storage), orcache_hit_with_refresh
(read from storage, but a refresh from DB was triggered before)
tablecache_cached_table_refreshes_total
: Total number of refreshes on the table. Labels:table_name
.tablecache_cached_table_adjustments_total
: Total number of adjustments on the table. Labels:table_name
.tablecache_cached_table_adjustment_expired_total
: Total number of records that were expired during adjustments on the table. Labels:table_name
.tablecache_cached_table_adjustment_loaded_total
: Total number of records that were loaded during adjustments on the table. Labels:table_name
.
The LocalStorageTable
writes the following metrics:
tablecache_local_table_records_total
: The number of records currently in storage. Labels:table_name
type
: One ofregular
(normal records that are accessible),scratch
(scratch records that are not merged yet), orscratch_delete
(records that exist but have been marked for deletion in scratch space).