Author: Tony Yum
Python is an extremely dynamic and flexible language. In this demo, the goal is to show how we can use python’s path_hooks mechanism to change imports from using local filesystem to a remote source.
As a bonus, this article also demonstrates creating necessary AWS resources via Cloudformation and how to remove them with a single command. There will be some basic performance analysis of the each remote source.
But hang on a second, why are we doing this? Is it a good idea to load sources remotely?
Yes, there are many reasons why it is a good idea.
It allows rapid prototyping - server clusters can pick up code changes by simply bouncing the server or even importlib.reload Code sharing - developer could create a simple script that reproduce an issue or demo and ask someone else to take a look by simply running against the repro staging area. UAT staging area can contain only scripts ready to be shipped and testing jobs can run against UAT and backed by production source. Hot fixes - production source can have an empty hotfix db sitting in front of it so that when an urgent bug fix is required, one only needs to move that script into the hotfix source db. Time travelling - we could add a source marker to define a version or point in time, such that all imports would happen as if we have time travelled.
Wouldn’t it be slow?
Yes it would be slower, but as the demo later shows, using an in-memory cache such as Redis yields great performance. We must also not forget that in comparison to having to build a new container or other deployment strategy, it is much faster.
How do you control production source?
Developers should not have write permissions to prod source. There must be a testing and approval procedure and an AWS lambda with permission would do the release.
How do we role back?
Time travel by changing the source marker. Or add a rollback button on the approval ticket.
How is this better than Git
There is no competition. Developer would still be using Git to source control. Production should contain only head version, UAT staging area should contain certain branch from Git, etc...
import sys
import importlib.abc
import types
import s3fs
import ast
import importlib
import imp
import s3fs
import boto3
import time
import redis
import os
from importlib.abc import MetaPathFinder
from importlib.machinery import ModuleSpec
from datetime import datetime
import matplotlib.pyplot as plt
%matplotlib inline
We will use Cloudformation to create the resources we need for this demo.
Specifically we will create the following:
STACK_DEF = '''AWSTemplateFormatVersion: "2010-09-09"
Resources:
'''
STACK_DEF += '''
s3:
Type: AWS::S3::Bucket
DeletionPolicy: Delete
Properties:
BucketName: kinyu-demo-src
'''
STACK_DEF += '''
dynamodb:
Type: AWS::DynamoDB::Table
Properties:
AttributeDefinitions:
-
AttributeName: "path"
AttributeType: "S"
KeySchema:
-
AttributeName: "path"
KeyType: "HASH"
ProvisionedThroughput:
ReadCapacityUnits: "5"
WriteCapacityUnits: "5"
TableName: kinyu-demo-src
'''
Note that the VpcSecurityGroupIds is set to sg-0ce5339b18f038bbd just because I already have this setup.
You could easily have created the security group as part of the cloudformation YAML.
Just ensure that port 6379 is open for inbound traffic.
STACK_DEF += '''
Resources:
redis:
Type: 'AWS::ElastiCache::CacheCluster'
Properties:
AutoMinorVersionUpgrade: 'true'
Engine: redis
CacheNodeType: cache.t2.micro
NumCacheNodes: '1'
VpcSecurityGroupIds: [sg-0ce5339b18f038bbd]
PreferredAvailabilityZone: eu-west-1c
'''
stack_name = 'kinyu-demo'
client = boto3.client('cloudformation')
res = client.create_stack(StackName=stack_name, TemplateBody=STACK_DEF)
stack_id = res['StackId']
res
Now we wait for Cloudformation to setup the various resources that we need. In particule ElasticCache takes a little while to provision. Let's use the code below to poll for the readiness.
for i in range(60):
res = client.describe_stack_events(StackName=stack_name)
status = sorted(res['StackEvents'], key=lambda x: x['Timestamp'])[-1]['ResourceStatus']
print((i, datetime.now().strftime('%H:%M:%S'), status))
if status == 'CREATE_COMPLETE':
break
if status == 'ROLLBACK_COMPLETE':
raise Exception('Failed to create stack: ' + stack_name)
time.sleep(30)
We now define a bunch of demo scripts that would be placed at a remote location.
folders = ['foo', 'foo/bar']
scripts = [
('foo/__init__.py', ''),
('foo/module1.py', '''
def add(x, y):
return x + y
'''),
('foo/module2.py', '''
def subtract(x, y):
return x - y
'''),
('foo/bar/__init__.py', ''),
('foo/bar/module3.py', '''
def multiply(x, y):
return x * y
'''),
('foo/bar/module4.py', '''
def divide(x, y):
return x / y
'''),
('foo/bar/chunky.py', '''
# Just a variable with lots of 'a' in it to test loading a bigger file
lots_of_as = '{}'
'''.format('a' * 10000))
]
Let's first push the scripts defined above into an S3 bucket.
source_base = 's3://kinyu-demo-src/home/tony.yum/'
s3 = s3fs.S3FileSystem()
for path, code in scripts:
fullpath = source_base + path
print('Writing to: ' + fullpath)
with s3.open(fullpath, 'w') as f:
f.write(code)
All the setup is done. time to see how Python allows us to add a hook so that imports are loaded from the s3 bucket instead of the local file system.
More infomration could be found here: https://docs.python.org/3/reference/import.html#import-hooks
This is just a light wrapper around s3fs such that asking .exists() actually loads the script and caches it.
This way after the finder asked if a script exists, the loader could ask for the script content without hitting s3 the second time.
class S3CachedReader:
script_cache = {}
@classmethod
def exists(cls, fullname):
try:
cls.get_code(fullname)
return True
except FileNotFoundError:
return False
@classmethod
def get_code(cls, fullname):
code = cls.script_cache.get(fullname)
if not code:
s3 = s3fs.S3FileSystem()
with s3.open(fullname, 'r') as f:
code = f.read()
cls.script_cache[fullname] = code
return code
Python finder is nothing more than a class that contains the method find_spec.
Per description in PEP-302 :
This method will be called with the fully qualified name of the module. If the finder is installed on sys.meta_path, it will receive a second argument, which is None for a top-level module, or package.path for submodules or subpackages [5]. It should return a loader object if the module was found, or None if it wasn't. If find_module() raises an exception, it will be propagated to the caller, aborting the import.
class S3Finder(MetaPathFinder):
def __init__(self, base_path: str):
self.base_path = base_path
def find_spec(self, fullname, path, target=None):
if not (fullname.startswith('foo') or fullname.startswith('django')):
return
s3_path = '{}{}.py'.format(self.base_path, fullname.replace('.', '/'))
s3 = s3fs.S3FileSystem()
if S3CachedReader.exists(s3_path):
return ModuleSpec(fullname, S3Loader(), origin=s3_path)
s3_path = '{}{}/__init__.py'.format(self.base_path, fullname.replace('.', '/'))
if (S3CachedReader.exists(s3_path)):
return ModuleSpec(fullname, S3Loader, origin=s3_path)
A Loader class implements create_module and exec_module. it is important to note that all modules are first created, then they are executed after to avoid issues with circular imports.
From PEP-302 we see that:
This method returns the loaded module or raises an exception, preferably ImportError if an existing exception is not being propagated. If load_module() is asked to load a module that it cannot, ImportError is to be raised.
There are more to read than the above, but I won't go and paste the rest here. It's all in the PEP-302 page.
class S3Loader:
@classmethod
def create_module(cls, spec):
name_parts = spec.name.split('.')
module = imp.new_module('.'.join(name_parts))
module.__path__ = name_parts[:-1]
module.__file__ = spec.origin
return module
@classmethod
def exec_module(cls, module):
code = S3CachedReader.get_code(module.__file__)
exec(code, module.__dict__)
return module
Let's first see what happens if we try to import a module that is not in our PYTHONPATH
from foo.module1 import add
As expected, we get ModuleNotFoundError. Now let's install our S3Finder in sys.meta_path
finder = S3Finder(source_base)
sys.meta_path.append(finder)
from foo.module1 import add
add(3, 5)
Wow! It worked. We just imported a module from S3.
let's try other form of imports.
import foo.module2
from foo.bar.module3 import *
foo.module2.subtract(6, 2), multiply(3, 3)
Exellent. It is working as expected. But what about performance?
time_taken = {}
before = datetime.now()
for path, code in scripts:
fullpath = source_base + path
with s3.open(fullpath, 'r') as f:
f.read()
time_taken['s3'] = (datetime.now() - before).microseconds
print("Time Take: {}.ms".format(time_taken['s3'] / 1000))
So it took 645ms. Is that a problem? Probably not for this demo example.
But what if a real system has 100 times more scripts? That'll take a minute to load all the scripts.
Let's see how other remote sources perform.
import boto3
from boto3.dynamodb.conditions import Key
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('epython')
for path, code in scripts:
fullpath = dyanmo_base_path + path
table.put_item(Item={
'path': fullpath,
'code': code
})
before = datetime.now()
dyanmo_base_path = '/homedirs/tony.yum/'
for path, code in scripts:
fullpath = dyanmo_base_path + path
table.query(KeyConditionExpression=Key('path').eq(fullpath))
time_taken['dynamodb'] = (datetime.now() - before).microseconds
print("Time Take: {}.ms".format(time_taken['dynamodb'] / 1000))
DynamoDB Took only 39ms. That looks much better! Can we improve this further?
r = redis.Redis(
host='kin-re-108bjvy3vtdll.bqvjwk.0001.euw1.cache.amazonaws.com',
port='6379'
)
redis_base_path = 'scripts/python/homedirs/tony.yum/'
for path, code in scripts:
fullpath = redis_base_path + path
print('Writing to: ' + fullpath)
r.set(fullpath, code)
before = datetime.now()
for path, code in scripts:
fullpath = redis_base_path + path
r.get(fullpath)
time_taken['redis'] = (datetime.now() - before).microseconds
print("Time Take: {}.ms".format(time_taken['redis'] / 1000))
Sweet! That took only 5.2ms. Now loading 100 times more scripts should take only half a second.
print('\n'.join(f'{k}: {v / 1000}ms' for k, v in time_taken.items()))
plt.style.use('ggplot')
x = list(iter(time_taken.keys()))
y = [x / 1000 for x in time_taken.values()]
x_pos = [i for i, _ in enumerate(x)]
plt.bar(x_pos, y, color='green')
plt.xlabel("Script Load Performance")
plt.ylabel("Time Taken (ms)")
plt.title("Remote Script Source")
plt.xticks(x_pos, x)
plt.show()
As we can see. S3 is the slowest option while redis is fastest.
However speed may not be the only consideration. There are cost, features and resistence to data loss to think about.
To compare the merits of these choices is beyond this article.
However I would say this. Even though redis is the fastest, it is after all an in-memory db. Which means even with its various backup strategies, there are no guarentee that committed data will not be loss (unlike s3 or dynamodb). Realistically the system would have to be setup in such a way that writes would be writting to both redis and a persistent db and only when the latter is completed would the write be considered done. Read would be just a simple read from redis.
So far I have shown how you would install an S3Finder into the meta_path.
The obvious extension would be to create a generic loader for any remote source or even a union of sources.
KYDB is just the API for the job.
In the future, I will write about a library that leverages KYDB to allow a docker cluster to load code remotely.
The cool thing is that we can experiment with ideas on this cluster without redeploying any container.
Finally after we are done with our resources, we could delete them all by a simple command.
client.delete_stack(StackName=stack_name)