Fn Graph — Lightweight pipelines in Python

 

Today we are releasing our new Python modelling pipeline (or function graph) library fn_graph. We have been building and operationalising various types of computational models for the last decade, whether they be machine learning models, financial models or general data pipelines. This library is the result of all our learnings from the past.

What is fn_graph?

fn_graph is a lightweight library that lets you easily build and visualise data-flow style models in python, and then easily move them to your production environment or embed them in your model backed products.

fn_graph is built to improve both the development phase, the production phase, and the maintenance phase of a model’s life cycle.

Wait, what do we mean by model?

Model is a very overloaded term, but here we mean it in the holistic sense. This includes statistical and machine learning models, the data preparation logic that precedes them, as well as more classic quantitative models, such as financial models. Associated with the models is all the logic to do testing, evaluation and configuration.

What’s “wrong” with the standard approach?

The standard approach to developing models and integrating them into either production systems or products has a number of real irritations and inefficiencies. The standard model development life cycle can be (very coarsely) split into 3 phases:

Model Development

Model development entails the investigation of any underlying data, design of the solution, preparation of the data, training of any machine learning models, and the testing of any results. This generally happens in a Jupyter notebook, and often on an analyst’s laptop (or personal cloud instance). During this phase the ability to rapidly iterate, try new ideas and share those results with relevant stakeholders is paramount.

Notebooks have been a huge advance for the data community, but they also have some really undesirable qualities. The primary problem is that they are primarily built for experimentation, rather than building something reusable. This is fine in the academic world where primarily you want to show something can be done or prove a specific result, with associated lovely markup, charts and a well written literate-programming style narrative. Unfortunately, and much to the disappointment of many novice data-scientists, this is not what industry wants. Industry wants to be able to run something more than once, in order to hopefully turn a profit.

The common reality is even bleaker, most of the time you don’t get a beautiful Donald Knuth style notebook. Instead due to the nature of notebooks, which militate against modularisation, you get an unsightly mess of very unstructured, un-encapsulated code that often won’t run unless the cells are run in a magical order unknown even to the author. For a suitably complex domain the notebook can become extremely long (because remember it’s not that easy to break things up into modules, and you lose the ability to easily look inside the intermediate results when you do) and very unwieldy.

These considerations extend further into the model life cycle, which we will get to, but also extend horizontally into aspects like extensibility, reusability and the ease of maintaining technical an institutional knowledge across teams.

Model productionisation

Once the analyst/data-scientist/modeller has finished their model, the results have been verified and all the numbers look good, the next phase is to put the model into production. This can mean different things for different projects, but it could be moving it to work off production data sources as some sort of scheduled task, or wrapping it up into an API, or integrating it more deeply into the code base of an existing product.

Whatever the case the requirements are very different from the notebook based model development phase. The model development phase prioritises being able to quickly try new things, and being able to deeply inspect all the steps and inner workings of a model. Conversely the production phase prioritises having a clean encapsulation of the model which is simple to configure and repeatedly run end-to-end.

So what often happens is the notebook gets thrown over the fence to a production engineer who now, wanting to write modular reasonably reusable code, has to pull it apart into various functions, classes etc. This inevitably produces subtle errors which take a long time to be ironed out, especially in statistical models where testing is a lot more difficult than just checking the end results are equal.

If the model does not get rewritten it gets wrapped in one big horrible function that makes debugging, testing and maintenance extremely difficult, while often being very inefficient.

Whatever happens, a version of the model is ready to be put into production .. which it is.

Model maintenance

After the model has been in production for a while it is time to make a change. This could be because the requirements have changed slightly, or now with more data and usage the results are not giving the behaviour that was initially desired.

These changes require the skills of an analyst/data-scientist/modeller, not a production engineer, to make and validate the changes. Now remember we essentially have two versions of the code, the analyst’s notebook, and the engineer’s module. These probably have a few differences in behaviour as well.

The analyst cannot just use the engineer’s production code, because it is nicely encapsulated so it is difficult to get to the intermediate steps, which is probably what you need to investigate. So either the original notebook has to be patched and updated to accommodate any differences, or the production code has to be turned inside out and flattened into another notebook.

The changes can then be made, and the previous productionisation process has to be repeated. This continues for the effective life of the product. It is not that fun.

How does fn_graph help?

fn_graph lets you build composable dataflow style pipelines (well really graphs) out of normal python functions. It does this with minimal boilerplate. This structure allows for the details of a model to be explored and interrogated during model development, and the full pipeline to be easily imported into production code without any changes.

The central trick of the library is that it matches function argument names to function names to build up the graph. This leaves a very clean implementation where everything is just a function, but because we know the structure of the function graph it can interrogate it and access the intermediate results which make inspection very easy. Because each function can and should be pure, as in it should not have any side effects, the code is very reliable and easy to reason about. Once this function graph which we call a Composer has been completed it is just a normal python object that can be imported into production code and have the results called. This is much easier to see in an example (taken from the fn_graph credit model example).

Caching

Something common that analysts and data-scientists repetitively re-implement in their notebooks is caching. When working with even slightly larger datasets it drastically slows down iteration time (or is completely prohibitive) if the entire model has to completely rerun for each change. This often results in sprinkles of if statements over a notebook that control whether to calculate a value or just pull it from a previously saved file. Along with some deft out of order cell execution this sort of works, but leaves the notebook messy and very difficult to reproduce, worse changes in the logic may be unwittingly ignored (cache invalidation errors), leading to all sorts of errors and wasted time.

However since fn_graph has the dependency graph or the functions being called caching and cache invalidation becomes a simple exercise that can be uniformly and automatically applied. fn_graph ships with multiple cache backends including a development_cache which will intelligently cache to disk and invalidate the cache when a function changes. This makes the caching completely transparent to the user.

It’s just plain Python functions

The biggest advantage of fn_graph is that it really is just plain python functions. There is no heavyweight object model that has to be learnt and there is no complicated runtime.

It makes no restrictions on what toolkits you can use, since it really just orchestrates function calls. You can use it with your favourite machine learning library, with pandas, with just plain python data-structures, or whatever niche library you need. You can integrate into any web server, or other task system you may have, as long as it can call a python function there is no restriction.

Sound interesting?

You can also find a live gallery site at https://fn_graph.businessoptics.biz. You can find the documentation at https://fn-graph.readthedocs.io/en/latest/, and checkout the github repositories at: