Using FalkorDB to Give Engineers X-Ray Vision Into Their Data Pipelines

A vibrant 3D representation of interconnected nodes and edges in a graph database.

The Problem Nobody Talks About

At a large retail enterprise, data engineers make changes to pipelines every day. A new transformation, a refactored join, a column renamed or dropped. Each of those changes has downstream consequences such as other pipelines that depend on those tables, dashboards that read those columns, models trained on that data.

There was no reliable way to know what those consequences were before pushing to production.

The process was manual and social: ask around, check Confluence pages that hadn’t been updated in months, hope someone who worked on the original pipeline was still on the team. Inevitably things broke in prod that nobody had anticipated. Impact was discovered after the fact, not before.

There was a second problem that turned out to be just as significant. Because there was no global view of the data ecosystem, engineers kept solving the same problems from scratch. New pipelines were built to do work that existing pipelines already did. Nobody knew and not because the information didn’t exist, but because it was invisible.

Our platform was built to make both of these problems solvable.

Why Data Lineage Is a Graph Problem

The instinct when building any data platform is to reach for a relational database. Build tables for workflows, data entities, and features, connect them with foreign keys, and join them when you need answers.

This approach works for one or two hops. It fails fast at enterprise scale.

The question “what is downstream of this feature?” requires following a chain: feature -> data entity -> workflow -> data entity -> feature -> workflow -> and so on, five or more levels deep across a large ecosystem. In a relational model, each level of that chain is a new join or a recursive CTE pass. Performance degrades with each added hop. At the depths our platform needs to query in production, these traversals become impractical.

The underlying issue is not indexing or query optimization. It is a mismatch between the data model and the question. Data lineage relationships are the primary structure – not metadata attached to records. The moment you treat them as first-class graph edges rather than foreign keys, the problem becomes tractable.

FalkorDB is the right fit because it is memory-native. Every traversal hop is a RAM operation. There is no I/O between the compute layer and a storage engine at each step of the traversal. For interactive lineage exploration,where an engineer is looking at a graph in the browser and expects it to respond,that matters directly. And for the local development workflow described below, it matters even more.

Why FalkorDB Over Neo4j

We evaluated FalkorDB and Neo4j before committing to either. The decision came down to five concrete factors.

Memory-native execution. FalkorDB holds the full graph in RAM and executes all traversals without touching a storage layer. Neo4j introduces I/O at each traversal step. For our interactive query patterns, an engineer opening a Lineage Deep Dive triggers several graph queries in sequence, and the latency accumulates. The local development workflow described above requires a database that starts instantly and responds immediately. A storage-backed database with a separate compute tier does not fit that model.

GraphBLAS traversal scaling. FalkorDB represents the graph as sparse adjacency matrices and executes traversal as matrix-vector multiplications via GraphBLAS. Adding hops to a query does not degrade performance the way pointer-following traversal does. Our deepest production queries, for example, five-hop feature lineage chains across the full enterprise graph, complete at interactive latency. Neo4j’s pointer-based model performs well on small graphs but does not scale in the same way to deep traversals on large graphs.

Redis protocol simplicity. FalkorDB speaks the Redis protocol. Every service in the stack from the web application to the ingestion pipelines and the local developer tooling connects using a standard Redis client. No proprietary driver SDK, no separate authentication configuration, no special Kubernetes networking. The same Redis client that engineers already know works unchanged. In a microservices environment where multiple pipeline executors need to write lineage data to the graph, this operational simplicity is not a minor convenience.

Full open-source feature parity. Neo4j gates performance-critical capabilities behind its commercial license. FalkorDB’s full execution engine, including the GraphBLAS-accelerated query compiler, is available under an open-source license. Our platform runs in production with no enterprise license dependency on the database layer.

Operational simplicity: backups, snapshots, and graph portability. Because FalkorDB is built on Redis, its persistence model is Redis persistence – AOF (append-only file) for durability and RDB snapshots for point-in-time backups. This turns out to be remarkably powerful in practice. The production graph can be rebuilt from scratch on a weekly cadence, with AOF providing continuous durability in between. Creating a backup is as simple as capturing a snapshot and archiving it as a tar file. Loading it on any machine including a developer’s laptop is equally simple: start a new FalkorDB Docker container, mount the snapshot, and the full graph is available within seconds.

This is the same mechanism that powers the local blast radius analysis workflow described later in this post. Engineers tap into the same AOF snapshot that the production container maintains, bundle it, and load it locally. No export/import tooling, no proprietary backup format, no cluster to provision. Neo4j is an excellent database, but its operational model which includes separate backup utilities, proprietary storage formats, and a more complex cluster topology would have been overkill for this use case. FalkorDB’s Redis foundation made the operational story as simple as it could possibly be.

How the Graph is Built

The solution does not require engineers to annotate their pipelines manually. The graph is built automatically by processing execution logs from the data processing frameworks already running in production:Hive query logs and Spark execution logs being the primary sources.

Each log entry contains enough information to reconstruct a lineage event: which process ran, which data entities it read from, which data entities it wrote to, and in many cases which specific features were selected, transformed, or produced. A log ingestion pipeline parses these events, constructs the corresponding graph elements using Pydantic-validated models, and persists them to FalkorDB using batched Cypher  MERGE statements.  MERGE ensures the graph is additive. Re-ingesting the same pipeline run does not create duplicate nodes or edges.

The result is a graph that grows continuously and automatically as pipelines execute, without any manual lineage tagging burden on the engineering teams whose work it tracks.

The Graph Model

Direct & Derived Relationships

The graph in FalkorDB is built around three node types:

  • Workflow – a data pipeline, job, or transformation
  • DataEntity – a database table or storage object (GCS, S3, HDFS)
  • Feature – an individual column or field within a DataEntity

And six relationship types, split between direct and derived:

Direct relationships – recorded at ingestion time from execution logs: – Input – a Workflow reads from a DataEntity – Output – a Workflow writes to a DataEntity – Has_A – a Feature belongs to a DataEntity

Derived relationships – computed post-ingestion and stored as first-class edges: – Workflow_Derives – one Workflow’s output feeds into another Workflow’s input – DataEntity_Derives – one DataEntity’s data flows into another DataEntity through a Workflow – Feature_Derives – a Feature’s value is derived from another Feature through a transformation

The direct relationships (Input, Output, Has_A) are recorded as each pipeline run completes. The derived relationships (Workflow_Derives, DataEntity_Derives, Feature_Derives) are computed as a post-ingestion step and stored as first-class edges in the graph.

This distinction matters for query performance. Without derived edges, answering “what is downstream of DataEntity X?” requires traversing through intermediate Workflow nodes – following Output edges backward to find a Workflow, then Input edges forward to find what that Workflow produces, and repeating at every level. With DataEntity_Derives  pre-materialised as a direct edge, the same question is a single-hop lookup regardless of how many intermediate workflows connect those data entities in the underlying pipeline graph. The same principle applies to Feature_Derives  for column-level impact queries and Workflow_Derives for pipeline dependency analysis.

The trade-off is deliberate: we write slightly more at ingestion time in exchange for dramatically faster reads at query time – which is exactly the right trade-off for a platform where engineers are waiting on interactive results.

This schema is minimal but expressive. It can represent any data pipeline topology, any entity dependency pattern, and any feature-level transformation chain including multi-hop chains that cross five or more workflows and a dozen data entities before reaching a final output feature.

Queries against this graph use Cypher. A downstream feature impact query looks like:

            MATCH path = (start:Feature {fullName: "sales.revenue"})
             -[:Feature_Derives*1..5]->(end:Feature)
RETURN DISTINCT
    startNode(rel).fullName AS source,
    endNode(rel).fullName   AS target

        

A workflow dependency traversal – “everything downstream of this workflow, five levels deep” – follows the same pattern over Workflow_Derives  edges. Both return in milliseconds even as the graph grows.

This Changed How the Org Works

The platform is deployed to production across three GKE clusters. Here are the four ways it has had the most direct impact on how the organisation operates.

  1. Blast Radius Analysis Before Merging Code
Pre-Production Impact Analysis with a Local Graph

The platform lets engineers answer the question “what will break if I make this change?” before the change goes anywhere near production.

The workflow: an engineer downloads a snapshot of the production FalkorDB graph. They run a local FalkorDB instance which takes seconds, because FalkorDB is a single container that starts immediately. They add, remove, or modify the nodes and edges that represent their proposed change. Then they run lineage queries against the modified local graph and observe the blast radius: which downstream workflows are affected, which data entities will change, which features propagate the modification.

The output of this analysis goes into the pull request as evidence. Before, PRs for data pipeline changes were reviewed on the basis of what the diff showed (what the code changed). Now they include a lineage graph showing what the data ecosystem impact will be. Reviewers can see, concretely, whether a change is safe to merge.

This mirrors the ephemeral graph pattern used by security teams for penetration testing: a temporary, isolated graph built for the duration of a specific analysis, discarded when the work is done. No production data is at risk. The analysis is self-contained. And because FalkorDB is in-memory, the local graph initializes from the snapshot in seconds without any cluster provisioning.

2. Cascading Failure Prevention and Faster Recovery

This is the reason the platform was built in the first place.

Reduction in MTTR FalkorDB

Before, when a pipeline failed in production, figuring out what was affected was itself a crisis-within-the-crisis. Data engineers had to manually trace dependencies, hunt through documentation, and pull in whoever had institutional knowledge of the affected systems. This was all happening while the outage was ongoing. Mitigating cascading failures took far too long, and too often the full blast radius of a failure was only discovered after further downstream systems had already been impacted.

End-to-end graph visibility changed this completely. When a failure occurs now, the affected component’s upstream and downstream graph is immediately queryable. This means the full cascade is visible in seconds, not hours. Engineers can triage with precision: which consumers need to be paused, which are isolated, what the recovery sequence should be.

More importantly, the same visibility that accelerates recovery also prevents most failures from reaching production at all. Proposed changes are validated against the full lineage graph before merge. Structural problems such as a workflow being removed that still has active downstream consumers, a data entity schema change that would break dependent feature derivations are caught at PR review time, not in production.

The results are measurable: 70% reduction in mean time to recovery (MTTR) for data pipeline incidents, and 91% of potential cascading failures caught before reaching production.

3. The GraphRAG Knowledge Backend

As the graph grew to cover the full data ecosystem, it became a natural knowledge base for answering questions about that ecosystem in natural language.

The platform now serves as the backend for an internal GraphRAG system. Engineers and analysts ask conversational questions:“what are the upstream dependencies for this data entity?”, “which workflows would be affected if this feature definition changed?”, “show me everything that feeds into the revenue reporting pipeline”  and a RAG agent translates those questions into lineage queries against the REST API.

FalkorDB’s graph model is particularly well-suited to this pattern. Multi-hop questions that would require complex query construction against a relational database map directly and naturally to Cypher traversals over the lineage graph. The agent does not need to construct joins across multiple tables,it asks the graph for what it needs, and the graph answers. The fact that FalkorDB responds at interactive latency means the RAG agent’s round-trip time is dominated by the LLM, not the database.

GraphRAG Knowledge Backend FalkorDB

4. Redundancy Detection Across the Entire Ecosystem

 

Redundancy Detection FalkorDB

Before this platform existed, there was no global visibility into what the data ecosystem looked like as a whole. Engineers built pipelines in isolation, with knowledge of the immediate systems around them but not of the broader graph.

One of the first things that became visible once the production graph was populated: there were multiple workflows doing semantically identical work. The same business metric is computed twice, by different teams, writing to different data entities. The same join sequence implemented three ways, each feeding a different downstream consumer without any of the authors knowing the others existed.

Identifying these redundancies required nothing more than looking at the graph. Workflows with identical Input and Output patterns, producing output with matching Feature_Derives  chains, showed up as structural duplicates once you could see them all together.

The consolidation work that followed, decommissioning redundant workflows, routing downstream consumers to a single authoritative source, would have been nearly impossible to do safely without lineage data. With the platform, each consolidation change went through the blast radius analysis above. Engineers could verify that removing a redundant workflow would not break any consumer that had not already been migrated.

Production Deployment

One of the things that surprised us about adopting FalkorDB was how little friction there was getting it into production.

The application layer is a Python FastAPI service. Connecting it to FalkorDB required nothing beyond a standard Redis client – no proprietary SDK, no special driver, no additional authentication layer to configure. FastAPI’s async request handling pairs naturally with FalkorDB’s low-latency query model: a single user interaction triggers several graph traversals in sequence, and because each one returns quickly, the composed response time stays well within interactive range.

FalkorDB itself ships as a standard Docker container. On our cloud-native Kubernetes platform, that meant it slotted into the existing deployment model without modification – a StatefulSet with a persistent volume, a ClusterIP service for internal routing, and the same CI/CD pipeline that handles every other workload. There was no specialised infrastructure to provision, no managed graph database service to integrate, no networking exceptions to request. It was treated like any other containerised dependency.

The persistence configuration combines Redis AOF for durability with RDB snapshots at multiple intervals, tuned for graph write patterns where bulk ingestion from pipeline runs is followed by extended read-heavy query periods. A Kubernetes CronJob runs daily backups to GCP Cloud Storage with 90-day retention. Disaster recovery is a single command: restore from any backup timestamp, verify graph integrity with a post-restore Cypher query, mark ready.

The one production-specific tuning worth calling out is the health probe configuration. Standard short-window readiness probes fail for memory-native databases because loading a large graph from disk takes longer than a typical HTTP service startup. Our StatefulSet uses an extended startup period with a readiness probe that executes a test Cypher query directly.The pod is only marked ready when the graph engine is queryable, not just running.

A backend-triggered deployment pipeline handles the full release workflow: when the lineage ingestion library publishes a new version, a cross-repo trigger rebuilds the stage deployment, runs end-to-end tests, gates on human approval, and promotes to production automatically if approved.

What We Learned

Three takeaways from building and running this in production:

  • The local graph workflow is the most valuable thing we built. Giving engineers the ability to run a local FalkorDB instance, load a production graph snapshot, and query blast radius before opening a PR changed the review culture for data pipeline changes. The database’s in-memory startup model is what makes this practical – it has to start in seconds to fit into a development workflow.
  • Global visibility changes what questions you can ask. Redundancy detection was not a feature we designed. It emerged naturally once the full graph was populated. When you can see the entire data ecosystem as a single queryable structure, patterns that were invisible become obvious. We expect more of this as the graph grows.
  • The data model is the architecture. The six relationship types in our schema : Workflow_Derives, DataEntity_Derives, Input, Output, Has_A, Feature_Derives are not implementation details. They are the conceptual model of the data ecosystem. Every capability in the platform is a traversal query against those five relationships. Getting the schema right made everything else follow naturally.

What’s Next

The next major capability on the roadmap is AI-powered lineage inference: using generative AI to infer Feature_Derives relationships automatically from transformation code, without requiring explicit annotation from pipeline authors. The graph schema already supports this with inferred edges will carry a confidence property, and the UI will visually distinguish them from hand-annotated relationships.

We are also expanding the blast radius analysis tooling to generate structured reports suitable for automated PR checks with a CI gate that flags high-impact changes for mandatory lineage review before merge.