>>> Tracking Time Series with Python and Redis
>>> How I built this blog in one night
As you might have noticed, I’ve built this blog using Jekyll and hosted it using Github Pages as blog engine. So far I think Jekyll it’s a great tool, I spent only a few hours installing it on my laptop and customizing the Solar Theme, et voilà! I have finally a blog engine which fits quite good to my needs and the best of all, open source software, no database, no hosting costs and no dealing with the pain of customizing proprietary blog engines.
>>> Tracking counts with Python and Redis
Redis is an open source, high performance key-value in memory store similar to memcached. The two major differences between them is, first, that redis can be persisted on disk, whereas memcached is volatile. And second, unlike memcached and other tradicional key-value stores which only allow to store string values associated to string keys, redis is actually a data structures server, with support for different kind of values besides plain strings, like lists, sets, sorted sets or hashes.
In this post I’m going to show how to use sorted sets to track counts of a list of events (visits per page, messages sent per user, page views per day, etc) in near real time. A typical use case would be track count of the visits of every page in a site, which could be used later to build a list of the most popular articles. As redis is thread-safe and ensures atomicity of the commands, it can be used as well in a distributed environment, i.e., even when we had a cluster of web servers, we could keep all page views in only one redis counter.
The pages in the website will be the members, and the number of visits will be the score of every member in our sorted set. For every page hit we simply store a page identifier and increment the number of visits. This problem could be also solved with a hash structure, but the sorted set suits us better for this case because it’s pre-sorted, i.e. the members are taken sorted by their score and it’s possible to retrieve a range of elements (for example, get the top 5).
RedisHelper
I created the RedisHelper
class (redishelper.py
) with some functions to handle redis connections. Using a connection
pool will allow to reuse the connections in the pool rather than create a new connection every time a command is sent to
redis. The pool is created the first time a connection is requested, and maintained as a class attribute. The get_pipe
method, returns a redis pipeline which can be used to send a batch of commands which will be executed automatically.
This can dramatically improve the performance, of groups of commands, since the traffic between the client and redis
server is reduced.
import redis
class RedisHelper(object):
server = {}
@classmethod
def set_server(cls, server):
cls.server = server
@classmethod
def get_pool(cls):
try:
pool = cls.pool
except AttributeError:
pool = redis.ConnectionPool(host=cls.server['host'], port=cls.server['port'], db=cls.server['db'])
return pool
@classmethod
def get_connection(cls):
return redis.Redis(connection_pool=cls.get_pool())
@classmethod
def get_pipe(cls):
return cls.get_connection().pipeline()
@classmethod
def flushdb(cls):
cls.get_connection().flushdb()
RedisZSetCounter
Now let’s see the implementation of the RedisZSetCounter
class. In the constructor, we set the config for the redis
server and for the counter. the prefix
is intended to identify a group of counters which measures the same thing. For
every counter in a group we’ll have a counter_id
. For example, if we might create a group of counters to measure page hits
per day, the keys would have the following format: hits:21-06-2015
, hits:22-06-2015
, etc., where ‘hits’ would be the
prefix
, and the date would be the counter_id
.
class RedisZSetCounter(object):
"""
Counter that uses redis to store data
"""
def __init__(self, server, config):
RedisHelper.set_server(server)
self.redis = None
self.prefix = config['prefix']
self.ttl = config['ttl']
The keys for every counter are built concatenating the prefix and the counter identifier.
def get_key(self, counter_id):
"""
build the key that identifies a counter
:return: counter key (string)
"""
return "%s:%s" % (self.prefix, str(counter_id))
Using the RedisHelper
class we get a redis connection.
def get_redis(self):
"""
Get redis connection, lazy loading
:return: redis object
"""
if self.redis is None:
self.redis = RedisHelper.get_connection()
return self.redis
To increase the count of an entry, we use redis zincrby
command. By default, we increase the score by one, but it’s
possible to pass an arbitrary value.
def incr(self, counter_id, entry, count=1):
"""
Increments the count of member_id in the current bucket by count
Supports decreasing, when count < 0
:param counter_id: identifier for this counter
:param entry: identifier of the member / item / whatever is being accounted
:param count: increase count by this amount
:return: the new count for this entry
"""
key = self.get_key(counter_id)
value = self.get_redis().zincrby(key, entry, count)
self.get_redis().expires(key, self.ttl)
return value
It’s also possible to pass a negative value to zincrby
. In that case, the score is decreased by the specified amount.
def decr(self, counter_id, entry, count=1):
"""
Decrease counter, by calling incr with negative count
:param counter_id: suffix identifier for this counter
:param entry: identifier of the member / item / whatever is being accounted
:param count: decrease count by this amount
:return: the new count of the entry
"""
return self.incr(counter_id, entry, -count)
The last N items can be retrieved using zrange
. We need to specify the parameter desc=False
, because we don’t want
the entries in descending order but in increasing. Since dictionaries in python don’t preserve the order of the keys, we
are using a list of tuples to handle the key-values returned by the counter.
def last_n(self, counter_id, how_many):
"""
get the last N entries in this counter
:param counter_id: identifier for this counter
:param how_many: how many items to return (top N)
:return: list of tuples (id, count)
"""
key = self.get_key(counter_id)
return self.get_redis().zrange(key, 0, how_many - 1, desc=False, withscores=True)
Using zrevrange
command, we retrieve the top N items with highest score. Note that zrange
with desc=True
would
also do the job here.
def top_n(self, counter_id, how_many):
"""
get the top N entries in the counter
:param counter_id: identifier for this counter
:param how_many: how many items to return (top N)
:return: list of tuples (id, count)
"""
key = self.get_key(counter_id)
return self.get_redis().zrevrange(key, 0, how_many - 1, withscores=True)
How to use it
Finally, let’s see with an example how to use RedisZSetCounter
from redishelper import RedisHelper
from redisZcounter import RedisZSetCounter
redis_server = {'host': 'localhost', 'port': 6379, 'db': 0}
params = {'prefix': "hits", 'ttl': 120}
zcount = RedisZSetCounter(self.redis_server, params)
RedisHelper.redis_flushdb()
zcount.incr('test_id', 'page_1', 1)
zcount.incr('test_id', 'page_2', 3)
zcount.incr('test_id', 'page_3', 1)
best = zcount.top_n('test_id', 1)
# best => ('page_2', 3)
Wrapping up
Even though the use case presented in this post is very simple, we can see some of the advantages of using a noSQL database as redis over a relational database to do real-time state management. Besides the simplicity and the performance, it’s the power of its data structures. In my next posts I will extend the functionality of this counter and show how to use it to track event counts with time series.
For those who are lazy to type, you can find the code in my github repository. To run the code if you haven’t yet, you’ll need to install redis server and the redis client for python.
>>> print "Hello World!"
Hello World!
I hope this is the start of a long lasting blogging adventure, where I will be writing about some of the projects I have worked on the last years and also about the new ones that are still to come. About the topics you can expect to find here, as you can read in the header, mostly Data Mining, Machine Learning, Python, R, but not limited to only this.
The title ‘Ein teil von mir’, which was inspired originally by the text of a Berlin techno anthem, means ‘A part of me’. And that is exactly the goal I had in mind when I decided to start this blog, sharing a part of me with the world. I hope the knowledge I will share here is somehow helpful for some of you and also that many others of you enjoy reading and learning some stuff.
def print_bye(str):
print "See you soon %s!" % str
print_bye('World')
#=> prints 'See you soon World!' to STDOUT.