Accelerating modelling with Frea:
Paidy's in-house feature store with Prefect and PostgreSQL.
“Frea Spinning the Clouds” by John Charles Dollman (1909)
We’ve recently partnered with Prefect to make it easy for us to build scalable solutions for data science at Paidy that make sense for our team and ways of working. This blog post talks about the in-house Feature Store known as Frea which is built on top of the Prefect platform, and which reduced the model training pipeline time from weeks to hours.
Risk modelling is a corner-stone of Paidy’s business. It enables us to give our consumers the best possible shopping experience, and build a solid foundation for the company’s future.
As the diversity of merchants, and therefore diversity of risk, increases, so do the requirements on modelling activities. This means more model training and validation, an ever growing list of model features, and additional data scientists dedicated to the effort.
Until recently feature generation was carried out through semi-standardized Jupyter notebooks and SQL queries executed by data scientists. The results were stored into any number of tables that were versioned manually by the data scientist. This worked fine at the beginning but growing from a team of 3 to 15 people building and maintain multiple models, we encountered a few challenges:
How do we maintain feature definition consistency between models and between modellers?
How do we record and track changes to the features for compliance, auditing, and performance monitoring?
How can we reduce the time-to-market for new or updated features?
How can we reduce the operational and technological burden of feature generation as the business scales into more volume or new areas of modelling?
With an ever increasing catalog of payments how do we keep the feature generation time to a reasonable length?
The general approach to addressing these concerns is to:
Separate the feature generation from the model creation
Separate experimental, ad-hoc, feature exploration from operational feature generation for modelling
Standardize the feature definition process to enable consistent tracking
Automate all aspects of feature generation, including tracking changes to features, and backfilling of data for analysis and training.
Standardize feature retrieval for modelling and analysis.
This approach mitigates the challenges we were facing, reduces the duplication of sensitive data, and enables a host of future projects that improve the modelling experience.
Most of the development should be (and has been) focused on a backend that includes:
A storage layer to store the features,
A metadata layer to store information about the features (arguably the most important to get right) such as feature code, version information, documentation and more,
And feature generation jobs that use the data in the metadata layer to compute the new features and store them in the storage layer.
Our department uses a PostgreSQL data store, known as Mimir, for all analytical work -- this makes it a natural location for the storage layer. It would also host the metadata layer as a simple table.
Keeping up with our philosophy of working with data scientists we decided to keep the feature definitions in SQL but augment them with some YAML for feature configuration and setting up some standard metadata.
As you can see, for most of the layers we have something to build upon PostgreSQL for storage and metadata, SQL for feature definitions. The remaining question is how do we actually orchestrate all the jobs that calculate features, backfill, register new features, etc…? The idea of using cron and implementing a lot of important basic functionality like error recovery, concurrency, logging, ad-hoc task execution, and more was very daunting, we decided instead to look for external partners that make that possible.
Prefect makes it very easy to build, execute and monitor complicated workflows with a really nice (and open-source) interface that enables you to build parameterizable, dynamic, scheduled (and ad-hoc!) jobs —and that’s the tool that filled the orchestrating piece in our custom solution.
Frea & Prefect
The Feature Store, codenamed Frea, leverages the capabilities that come with Prefect to keep track of feature definition changes, carry out feature backfilling, and update the features for when new payments come in.
Prefect is split into two parts:
Prefect Core: An open-source Python API with which you build your Flows.
Prefect Cloud: The monitoring and orchestration component of the system. It lets you monitor the health and progress of your Flows and even carry out ad-hoc execution.
Next we’ll take a look at the basic structure of Frea, starting with Metadata Migration.
Feature definition and registration
As mentioned above, all our features are defined as SQL queries. In order to support adding crucial metadata, we decided to represent feature definitions as YAML files with a standardized structure that could be parsed and handled by a flow.
Below is an example of a new feature definition. The definition includes:
A description of the feature
The feature type
The window of data to backfill upon first registering the feature
The definition of the feature
This structure is quite flexible and lets us add more metadata in the future, e.g. adding an “owner” field with the Slack handle of the person to automatically notify if the backfill failed.
description: The amount of the payment
type: numeric # PostgreSQL column type
initial_backfill: 2 years
JOIN $payments_source_table # will be explained below
ON payments_data.id = $payments_source_table.$source_id
GROUP BY 1
All the YAML files are under version control -- stored inside a private git repo -- so we can run PR reviews and keep track of changes for sanity-checks, debugging and auditing.
Now that we have the file structure figured out, how do we read the metadata and store it in a specific Postgres table? This is where the Metadata registration flow comes in:
It clones the Github repository with the feature definitions.
Using Prefect’s mapping capabilities, the definition files are parsed. We can provide a runtime-configurable Parameter that represents a list of filenames to read. All the definitions are processed in parallel with no extra work by the developer.
The metadata is pushed to the metadata layer in Mimir.
Mapping over all the different inputs is super simple:
With some CI/CD scripting, the Metadata registration flow is automatically triggered when a new merge to master happens in the repository that stores the feature definitions. We’ll have more information on this topic and other details of our CI/CD flow deployment pipeline in a future post.
The Backfilling flow is responsible for calculating and storing the feature values for all the relevant payments. This process is executed overnight to minimize the impact on the database and on the database administrators ;-).
At its core the backfilling logic is simple:
Add columns to the storage layer for any new features or for updated existing features,
Calculate the new value of a given feature for the payments we are backfilling,
Update the storage layer based on the feature definition.
The last two steps are practically one, since we’re leveraging the possibility to include a subquery (actual feature definition) in the UPDATE SET storage layer upsert query.
The data is backfilled based based on the selected timespan chosen by the data scientist for their respective analysis. As you might recall, the feature definition contains a variable $timespans that Frea can replace with the appropriate timespan:
Either covering the time since the last fill, or
The desired timespan for initial backfilling
As with the YAML definitions we make use of other template placeholders to further customize feature definitions at runtime. Those abstractions allow us to modify Frea backend without impacting UX.
Speeding up backfill
At first the backfilling was many times slower than the existing manual backfill because by default Prefect tasks run sequentially. As a result, only one feature was being backfilled at a time, underusing the computation power of the database.
Prefect provides an out-of-the-box solution for parallelization, using the Dask-based executor. With very minimal changes to the flow configuration, Prefect can automatically spawn a Dask cluster to parallelize a mapped task. The only configuration required is to install Dask itself, and to specify the relevant executor. The same DaskExecutor gives us yet another speed bonus by introducing Depth-First Execution for no extra cost!
By introducing a Dask Cluster with around 8 workers, running tasks asynchronously and in parallel, we cut down on backfill time 10~100x. We will discuss our Prefect environment and how we deploy flows in a future post.
Coming up with the features, codifying them and backfilling are often one of the most time-consuming steps to any modelling workflow, further worsened by a lack of conventions and a streamlined process between the Data Science team members.
By introducing Frea, we have:
Cut down our modelling workflow turn-around from a few weeks of dedicated work to hours of mostly hands-off monitoring.
Built a lightweight system that does exactly what we need and nothing more, without expensive data infrastructure changes, and minimal boiler-plate thanks to Prefect.
A solid base for other modeling pipeline integrations for automated hyperparameter tuning, model performance tests, and more, all managed from a a central location using Prefect Cloud.
Want to know more?
In the next article we will share the results of our efforts to automatize Prefect flow deployment and other interesting technical tidbits.