blaze_loader and lazy attributes

Over the last six months, I’ve become a huge fan of Blaze, a Python module that translates Pandas-like syntax into queries for a wide range of data storage technologies. In particular, I’ve found Blaze to be an elegant and simple way to in-line SQL queries in your Python code. Using Blaze feels like a nice balance between the ease of passing around strings of pure SQL and the robustness and flexibility of SQLAlchemy.

However, one hitch in using Blaze for data science was the need to create a Blaze Data object for every SQL table I wanted to query. Since I am often joining across many tables and pulling data from a number of entirely separate databases, I found myself copy-pasting 30+ lines of Data object initialization code to the beginning of every one of my 30+ frequently used IPython notebooks. Plus, when a table’s schema changed (as they sometimes do), I needed to go through all my files and update thedatashape of the affected Blaze Data objects. Not cool!

But why let these little drawbacks detract from an otherwise great tool? I went ahead and wrote a simple wrapper class called blaze_loader that makes Blaze Data objects easy to store, import, and modify. You can find it on Github here.

To save connection information with blaze_loader, pass a name (anything you want) and target string to the save_blaze_info method. The target string can point towards any data storage backend that Blaze can handle.

You can also pass save_blaze_info the optional keyword arguments table= (specify a table to use in a SQL connection), schema= (specify a schema to use in SQL connection), columns= (only create Data object with a subset of the available columns in a SQL connection), and datashape (define the datashape that Blaze will use to represent your data in its expression engine).

import blaze_loader
blaze_loader.save_blaze_info(‘users’,
‘postgresql://foo:bar@db.com:5432/data’,
table=’users’, schema=’funnel’,
datashape=’var * name:string, log_ins: int64')
blaze_loader.save_blaze_info(‘purchases’, ‘purchases.csv’)

To load and use your Data objects, use the load method in blaze_loader.

db = blaze_loader.load()
db.users[db.users.name == `theandycamps`]
# To see what Data objects have been loaded, you can call print:
print(db)

You can read more about the full features in the project README on Github.

I want to take the rest of this post to touch on an interesting lazy class attributes pattern I learned from my Quantopian colleague (and Blaze core developer!) Joe Jevnik.

In blaze_loader, Blaze Data objects are lazily loaded attributes of the instance you create when you call blaze_loader.load(). This means that the Data object attributes of blaze_loader are not actually created until you call for them. This lazy loading is helpful when you have many data objects saved by blaze_loader, as it removes the cost of initializing unused Data objects. (Initializing a Data object can be slow when Blaze isn’t passed a datashape and must reflect the requested SQL table’s schema.) Lazy loading also prevents our whole blaze_loader class from breaking when there is an error in the initialization of one Blaze Data object. You’ll only run into that error when you try to use the offending Data object.

So how does it work? We start by creating a wrapper class that caches the output of a function get as the returned value of its own __get__ magic method. It is important to note that the passed get function is only called when the __get__ method of LazyAttr is called.

class LazyAttr(object):
def __init__(self, get):
self._get = get
self._cache = {}
def __get__(self, instance, owner):
if instance is None:
return self
try:
return self._cache[instance]
except KeyError:
val = self._get(instance)
self._cache[instance] = val
    return val

Decorating with this LazyAttr class then allows us to attach functions to our blaze_loader class as attribures without calling for their return value in the process.

class blaze_loader(object):
target_infos = _load_infos_from_json(
target_info_path=target_info_path)
    locals_ = locals()
    for name, target_info in target_infos.iteritems():
    @LazyAttr
def _lazyattr(self, _name=name, _target_info=target_info,
config=config):
        target = _target_info.get(‘target’)
table = _target_info.get(‘table’)
schema = _target_info.get(‘schema’, None)
datashape = _target_info.get(‘datashape’, None)
columns = _target_info.get(‘columns’, None)
        bz_data = _add_blaze_data(_name, 
target,
table=table,
schema=schema,
datashape=datashape,
columns=columns,
config=config)
return bz_data

locals_[name] = _lazyattr
    del _lazyattr
del name
del target_info

And there you have it! Lazy loading of attributes in blaze_loader. Please feel free to reach out with any issues or suggestions!