PacMan: From ad-hoc Python Notebooks to an Internal Tools Platform

Published in

FiscalNoteworthy

7 min readOct 12, 2022

Automation As a Service

At FiscalNote, we’re in the business of connecting people to their governments — and we do this by providing our users with high-quality data, both sourced via web scrapers and generated by internal experts in the field. As we’ve grown in size over the last few years, we’ve seen a need for “automation as a service” — ways to automate the repetitive tasks our customer experience and professional services teams work on to generate the content and analysis that our customers value.

The solutions engineering team started off with the successes of a few scrappy self-taught engineers looking to automate the boring stuff as members of these teams. We created solutions for clients, internal and external, delivering automation as a service by writing python scripts. The focus of the Solutions Engineering team has evolved over the years, from ad hoc Python scripts and notebooks to a scalable web application framework that allows us to quickly deploy new automation-as-a-service tools for our internal teams. At its inception, the SE team was designated as a sort of starting ground for people making a career switch into engineering or new engineers generally. That context informed several of our early decisions around architecture and hiring, but as our scope has grown into being a more traditional engineering team, our engineers have grown with us. We now work with clients on data integrations, build internal tools and data pipelines, and develop and maintain the FiscalNote API.

Everybody Loves Internal Tools

Our Professional Services team delivers critical content and analysis to our clients, powered by the wealth of data and technology we have to offer. Alongside our SaaS platform, the analysts on Professional Services add content analysis and custom reporting on top of our data — which can call for a lot of customization for each specific client. We set out to try and automate the generation of these custom reports so our teams could work more efficiently, and the success of the initial ad hoc scripts led to our quest for an internal application to solve these problems at a larger scale. Our collection of tools for the Professional Services team started interestingly — with a host of command-line python tooling designed to:

1) help our analysts access data from mysterious and complex data pipelines, and

2) automate repetitive data labeling, reformatting, and report generation tasks.

We were quickly confronted with the very predictable problem that our end users, though internal, were not engineers. The ad hoc command line tools were a nifty solution to the automation problem, but they were not user-friendly — and as the demand for more tools rolled in, we realized that the lack of a proper structure was leading to a duplication of effort as we recreated functionality across scripts. There were also several other problems with our lack of structure:

Poor code shareability made the scripts difficult to maintain and brittle,
The lack of a testing and CI workflow highlighted avoidable problems as we tried to scale,
Poor documentation meant that the learning curve was high for new developers, and
The lack of a user interface meant that the learning curve for these tools was quite high for users as well.

Our efforts were recognized for their big internal time and cost savings, which coupled with the host of problems we faced at scale, led us to choose a framework for a web application. The goal was to make our tools more user-friendly and accessible for internal teams. It was also important to us that we could continue to deliver new and innovative tools to our internal customers quickly — which is what led us to choose the Python Flask framework for our internal web application, PacMan.

The Flask Blueprint

PacMan was our solution to the automation-as-a-service problem — a play on it eating the ghosts of projects past and neatly organizing them into a modular Flask application. Flask’s blueprint model made it easy for us to share application configs, templates, etc., across our micro applications or ‘Modules.’ Additionally, it helped that Flask was an easy framework for our pioneering-but-not-senior engineers to learn and deploy — and our primary goal here was to make it easy for new hires to contribute to our ecosystem of internal tools. As a bonus, we had support from other teams that were already using a Flask blueprint for an internal application, making it the easy choice.

The Flask blueprint instantiates a single Flask application object, and each module is registered to the blueprint. Each of our ad hoc tools now becomes a module within the PacMan application — sharing HTML templates (Using Jinja3 syntax), static files, as well as injected resources like database connections.

├── README.md
├── acceptance_tests
│   └── healthcheck.py
├── ansible
│   ├── hosts
│   └── playbooks
├── config_templates
│   ├── config.yaml
├── environment.yml
├── example
│   ├── README.md
│   └── example_module
├── logging.yaml
├── modules
│   ├── __init__.py
│   ├── injector_keys.py
│   ├── pacman
│   ├── shared
│   ├── sql_queries.py
│   ├── validation_schemas.py
├── pacman_service.conf
├── pacman_utils
├── run_background_tasks.sh
├── run_server.sh
├── setup_env_variables.py
├── start.py
├── static
├── templates

In this structure, new modules would be added to the modules directory, with routes defined in config.yaml . The config file would be populated with variables encrypted via ansible vault based on the environment during deploys.

reference_resolvers:
  - type: env_reference_resolver
    identifier: env

app:
  env: "{{ENV}}"
  flask_debug_mode: "{{FLASK_DEBUG_MODE}}"
  SEND_SLACK_DEPLOY_MESSAGE: "{{SEND_SLACK_DEPLOY_MESSAGE}}"

  # Add new modules here
  module_configs:
    pacman:
      route: ""
      active: True
      display: False
      module_name: "pacman"
      display_title: "PacMan"
      description: "PacMan: Technical Services Playground"
    healthcheck:
      route: "healthcheck"
      active: True
      display: False
      module_name: "healthcheck"
    example_module:
      route: "example-module"
      active: True
      display: True # This creates a card on the app home page
      module_name: "example-module"
      display_title: "Example Module"
      description: "Example Module"
      thumbnail_img: "example_module.png"

The main Flask application would then register the route for each configured module:

module = imp.load_source(
    "modules.{}.app".format(module_config["module_name"]),
    "modules/{}/app.py".format(module_config["module_name"]))# Load the 'app' object from the flask app fileflask_blueprint = getattr(module, "blueprint")

if module_config["module_name"] == "pacman":
    flask_app.register_blueprint(flask_blueprint, url_prefix="/")
flask_app.register_blueprint(flask_blueprint, url_prefix="/{}".format(module_config["route"]))

Lessons Learned at Scale

Who gets access?

Building any tooling that writes to production systems brings with it concerns about security and a need for controlled authorization. Since we were building tools that would be used by several different teams with different levels of access to data, it was important that we found a solution that allowed us to segment permissions based on group membership. We decided to extend our organization’s usage of Okta for this purpose.

Module-specific access in PacMan is controlled using our shared configuration file. Each group with access to Pacman is listed in this file, with a list of modules that each group should be able to access. The Flask blueprint’s before_request decorator allows us to check for an auth token stored in the Flask session object before any registered endpoints are accessed. An unauthenticated user will be redirected to a login page, and a user that is not authorized for the module will be redirected to the home page with a message saying they are unauthorized. The PacMan application’s homepage is also rendered based on access — a user that doesn’t have access to a module, won’t see the module card on the homepage.

Here’s how we extended the Flask before_request method for authentication using oidc — “groups” here are defined in our shared configuration template.

@flask_app.route("/login")
@oidc.require_login
def login():
    return redirect(url_for('pacman.home'))

@flask_app.route("/logout")
def logout():
    oidc.logout()
    return redirect('login')

def before_request():
    """
    Log params and handle login
    """

    if not oidc.user_loggedin and (request.endpoint != 'login'):
        code = 302
        return redirect(url_for('login'), code=code)

    else:
            groups = oidc.user_getinfo(fields = ['groups'],access_token = at)   
            if not has_route_access(oidc):
                code = 403
                flash("You do not have access to that module")
                return redirect(url_for('pacman.home'), code=code)

flask_app.before_request(before_request)

Validating user input

PacMan brought some challenges as we went from not being able to create client-facing data at scale, to having access to tools that could easily create client-facing data at scale. Keeping this in mind, we built PacMan around the Cerberus data validation framework in Python. Each Flask application that accepts data as an attachment (and not as a form field), validates the data using a Cerberus schema. While WTForms provides built-in validation for forms, Cerberus allows us to validate file inputs against a pre-defined schema. This allows us to give our users actionable feedback while maintaining data integrity, instead of outright rejecting data inputs based on form-level validation.

Here’s an example of how we extend Cerberus validators to validate Zip Codes:

class AddressValidator(Validator):

    def _validate_zipcode(self, zipcode, field, value):
        if zipcode:
            pattern = re.compile(r"^[0-9]{5}")
            try:
                if not re.search(pattern, value):
                    self._error(field, "Zipcode must be 5 digits")
            except:
                self._error(field, "Zipcode must be 5 digits")

The validation schema for this file then uses the custom validator by setting "zipcode": True in the validation schema: "zipcode": {"required": True, “zipcode": True}

Bringing it all together

Take a sneak peek at what PacMan looks like below — and yes, the game works! It’s opened up a lot of room for innovation and we’re constantly thinking of new ways to experiment by creating internal tools. In addition to automation-as-a-service, we’re on a mission to make PacMan our experimentation platform for data science as well as innovations in our data pipeline.

If this piece has piqued your interest, check out our careers page for open roles and our technical blog “F(N)novate.”

PacMan: From ad-hoc Python Notebooks to an Internal Tools Platform

The Flask Blueprint

Lessons Learned at Scale

Written by Karnika Arora