In my last article [1], I threw out a lot of ideas centered around building structured graphs, mainly focused on descriptive or unsupervised exploration of data through graph structures. However, when we use graph features to improve our models, the temporal nature of the data must be taken into account. If we want to avoid undesired effects, we need to be careful not to leak future information into our training process. This means our graph (and the features derived from it) must be constructed in a time-aware, incremental way.
Data leakage is such a paradoxical problem that a 2023 study by Sayash Kapoor and Arvind Narayanan [2] found that, up to that moment, it had affected 294 research papers across 17 scientific fields. They classify the types of data leakages ranging from textbook errors to open research problems.
The issue is that during prototyping, results often seem very promising when they are really not. Most of the time, people do not realize this until models are deployed in production, wasting the time and resources of an entire team. Then, performance usually falls short of expectations without understanding why. This issue can become the Achilles’ heel that undermines all business AI initiatives.
…
ML-base leakage
Data leakage occurs when the training data contains information about the output that won’t be available during inference. This causes overly optimistic evaluation metrics during development, creating misleading expectations. However, when deployed in real-time systems with the proper data flow, the model predictions become untrustworthy because it learned from information not accessible.
Ethically, we must strive to produce results that truly reflect the capabilities of our models, rather than sensational or misleading findings. When a model moves from prototyping to production, it should not fail to generalize properly; if it does, its practical value is undermined. Models that fail to generalize well can exhibit significant problems during inference or deployment, compromising their usefulness.
This is especially dangerous in sensitive contexts like fraud detection, which often involve imbalanced data scenarios (with fewer fraud cases than non-fraud). In these situations, the harm caused by data leakage is more pronounced because the model might overfit to leaked data related to the minority class, producing seemingly good results for the minority label, which is the hardest to predict. This can lead to missed fraud detections, resulting in serious practical consequences.
Data leakage examples can be categorized into textbook errors and open research problems [2] as follows:
Textbook Errors:
- Imputing missing values using the entire dataset instead of only the training set, causing information about the test data to leak into training.
- Duplicated or very similar instances appearing both in training and test sets, such as images of the same object taken from slightly different angles.
- Lack of clear separation between training and test datasets, or no test set at all, leading to models having access to test information before evaluation.
- Using proxies of outcome variables that indirectly reveal the target variable.
- Random data splitting in scenarios where multiple related records belong to a single entity, such as multiple claim status events from the same customer.
- Synthetic data augmentation performed over the whole dataset, instead of only on the training set.
Open problems for research:
- Temporal leakage occurs when future data unintentionally influences training. In such cases, strict separation is challenging because timestamps can be noisy or incomplete.
- Updating database records without lineage or audit trail, for example, changing fraud status without storing history, can cause models to train on future or altered data unintentionally.
- Complex real-world data integration and pipeline issues that introduce leakage through misconfiguration or lack of controls.
These cases are part of a broader taxonomy reported in machine learning research, highlighting data leakage as a critical and often an underinvestigated risk for reliable modeling [3]. Such issues arise even with simple tabular data, and they can remain hidden when working with many features if each one is not individually checked.
Now, let’s consider what happens when we include nodes and edges in the equation…
…
Graph-base leakage
In the case of graph-based models, leakage can be sneakier than in traditional tabular settings. When features are derived from connected components or topological structures, using future nodes or edges can silently alter the graph’s structure. For example:
- methodologies such as graph neural networks (GNNs) learn the context not only from individual nodes but also from their neighbours, which can inadvertently introduce leakage if sensitive or future information is propagated across the graph structure during training.
- when the graph structure is overwritten or updated without preserving the past events means the model loses valuable context needed for accurate temporal analysis, and it may again access information in the incorrect time or lose traceability about possible leakage or problems with the data that originate the graphs.
- Computing graph aggregations like degree, triangles, or PageRank on the entire graph without accounting for the temporal dimension (time-agnostic aggregation) uses all edges: past, present, and future. This causes data leakage because features include information from future edges that wouldn’t be available at prediction time.
Graph temporal leakage occurs when features, edges, or node relationships from future time points are included during training in a way that violates the chronological order of events. This results in edges or training features that incorporate data from time steps that should be unknown.
…
How can this be fixed?
We can build a single graph that captures the entire history by assigning timestamps or time intervals to edges. To analyze the graph up to a specific point in time (t), we “look back in time” by filtering any graph to include only the events that occurred before or at that cutoff. This approach is ideal for preventing data leakage because it ensures that only past and present information is used for modeling. Additionally, it offers flexibility in defining different time windows for safe and accurate temporal analysis.
In this article, we build a temporal graph of insurance claims where the nodes represent individual claims, and temporal links are created when two claims share an entity (e.g., phone number, license plate, repair shop, etc.) to ensure the correct event order. Graph-based features are then computed to feed fraud prediction models, carefully avoiding the use of future information (no peeking).
The idea is simple: if two claims share a common entity and one occurs before the other, we connect them at the moment this connection becomes visible (figure 1). As we explained in the previous section, the way we model the data is crucial, not only to capture what we’re truly looking for, but also to enable the use of advanced methods such as Graph Neural Networks (GNNs).

In our graph model, we save the timestamp when an entity is first seen, capturing the moment it appears in the data. However, in many real-world scenarios, it is also useful to consider a time interval spanning the entity’s first and last appearances (for example, generated with another variable like plate or email). This interval can provide richer temporal context, reflecting the lifespan or active period of nodes and edges, which is valuable for dynamic temporal graph analyses and advanced model training.
Code
The code is available in this repository: Link to the repository
To run the experiments, set up a Python ≥3.11 environment with the required libraries (e.g., torch, torch-geometric, networkx, etc.). It is recommended to use a virtual environment (via venv or conda) to keep dependencies isolated.
Code Pipeline
The diagram of Figure 2, shows the end-to-end workflow for fraud detection with GraphSAGE. Step 1 loads the (simulated) raw claims data. Step 2 builds a time-stamped directed graph (entity→claim and older-claim→newer-claim). Step 3 performs temporal slicing to create train, validation, and test sets, then indexes nodes, builds features, and finally trains and validates the model.

Data) for training and inference. Image by Author.Step 1: Simulated Fraud Dataset
We first simulate a dataset of insurance claims. Each row in the dataset represents a claim and includes variables such as:
- Entities:
insurer_license_plate,insurer_phone_number,insurer_email,insurer_address,repair_shop,bank_account,claim_location,third_party_license_plate - Core information:
claim_id,claim_date,type_of_claim,insurer_id,insurer_name - Target:
fraud(binary variable indicating whether the claim is fraudulent or not)
These entity attributes act as potential links between claims, allowing us to infer connections through shared values (e.g., two claims using the same repair shop or phone number). By modeling these implicit relationships as edges in a graph, we can build powerful topological representations that capture suspicious behavioral patterns and enable downstream tasks such as feature engineering or graph-based learning.


Step2: Graph Modeling
We use the NetworkX library to build our graph model. For small-scale examples, NetworkX is sufficient and effective. For more advanced graph processing, tools like Memgraph or Neo4j could be used. To model with NetworkX, we create nodes and edges representing entities and their relationships, enabling network analysis and visualization within Python.
So, we have:
- one node per claim, with node key equal to the claim_id and attributes as node_type and claim_date
- one node per entity value (phone, plate, bank account, shop, etc.). Node key:
"{column_name}:{value}"and attributesnode_type = <column_name>(e.g.,"insurer_phone_number","bank_account","repair_shop")label = <value>(just the raw value without the prefix)
The graph includes these two types of edges:
claim_id(t-1)→claim_id(t): when two claims share an entity (withedge_type='claim-claim')entity_value→claim_id: direct link to the shared entity (withedge_type='entity-claim')
These edges are annotated with:
edge_type: to distinguish the relation (claim→claimvsentity→claim)entity_type: the column from which the value comes (likebank_account)shared_value: the actual value (like a phone number or license plate)timestamp: when the edge was added (based on the current claim’s date)
To interpret our simulation, we implemented a script that generates explanations for why a claim is flagged as fraud. In Figure 4, claim 20000695 is considered risky primarily because it is associated with repair shop SHOP_856, which acts as an active hub with multiple claims linked around similar dates, a pattern often seen in fraud “bursts.” Additionally, this claim shares a license plate and address with several other claims, creating dense connections to other suspicious cases.

This code saves the graph as a pickel file: temporal_graph_with_edge_attrs.gpickle.
Step 3: Graph preparation & Training
Representation learning transforms complex, high-dimensional data (like text, images, or sensor readings) into simplified, structured formats (often called embeddings) that capture meaningful patterns and relationships. These learned representations improve model performance, interpretability, and the ability to transfer learning across different tasks.
We train a neural network, to map each input to a vector in ℝᵈ that encodes what matters. In our pipeline, GraphSAGE does representation learning on the claim graph: it aggregates information from a node’s neighbours (shared phones, shops, plates, etc.) and mixes that with the node’s own attributes to produce a node embedding. Those embeddings are then fed to a small classifier head to predict fraud.
3.1. Temporal slicing
From the single full graph we create in step 2, we extract three time-sliced subgraphs for train, validation, and test. For each split we choose a cutoff date and keep only (1) claim nodes with claim_date ≤ cutoff, and (2) edges whose timestamp ≤ cutoff. This produces a time-consistent subgraph for that split: no information from the future leaks into the past, matching how the model would run in production with only historical data available.
3.2 Node indexing
Give every node in the sliced graph an integer index 0…N-1. This is just an ID mapping (like tokenization). We’ll use these indices to align features, labels, and edges in tensors.
3.3 Build node features
Create one feature row per node:
- Type one-hot (claim, phone, email, …).
- Degree stats computed within the sliced graph: normalized in-degree, out-degree, and undirected degree within the sliced graph.
- Prior fraud from older neighbors (claims only): fraction of older connected claims (direct claim→claim predecessors) that are labeled fraud, considering only neighbors that existed before the current claim’s time.
We also set the labely(1/0) for claims and 0 for entities, and mark claims inclaim_maskso loss/metrics are computed only on claims.
3.4 Build PyG Data
Translate edges (u→v) into a 2×E integer tensor edge_index using the node indices and add self-loops so each node also retains its own features at every layer. Pack everything into a PyG Data(x, edge_index, y, claim_mask) object. Edges are directed, so message passing respects time (earlier→later).
3.5 GraphSage:
We implement a GraphSAGE architecture in PyTorch Geometric with the SAGEConv layer. so, we run two GraphSAGE convolution layers (mean aggregation), ReLU, dropout, then a linear head to predict fraud vs non-fraud. We train full-batch (no neighbor sampling). The loss is weighted to handle class imbalance and is computed only on claim nodes via claim_mask. After each epoch we evaluate on the validation split and choose the decision threshold that maximizes F1; we keep the best model by val-F1 (early stopping).

3.6 Inference results.
Evaluate the best model on the test split using the validation-chosen threshold. Report accuracy, precision, recall, F1, and the confusion matrix. Produce a lift table/plot (how concentrated fraud is by score decile), export a t-SNE plot of claim embeddings to visualize structure.

The lift chart evaluates how well the model ranks fraud: bars show lift by score decile and the line shows cumulative fraud capture. In the top 10–20% of claims (Deciles 1–2), the fraud rate is about 2–3× the average, suggesting that reviewing the top 20–30% of claims would capture a large share of fraud. The t-SNE plot shows several clusters where fraud concentrates, indicating the model learns meaningful relational patterns, while overlap with non-fraud points highlights remaining ambiguity and opportunities for feature or model tuning.
…
Conclusion
Using a graph that only connects older claims to newer claims (past to future) without “leaking” future fraud information, the model successfully concentrates fraud cases in the top scoring groups, achieving about 2–3 times better detection in the top 10–20%. This setup is reliable enough to deploy.
As a test, it is possible to try a version where the graph is two-way or undirected (connections both ways) and compare the spurious improvement with the one-way version. If the two-way version gets significantly better results, it’s likely because of “temporal leakage,” meaning future information is improperly influencing the model. This is a way to prove why two-way connections should not be used in real use cases.
To avoid making the article too long, we will cover the experiments with and without leakage in a separate article. In this article, we focus on developing a model that meets production readiness.
There’s still room to improve with richer features, calibration, and small model tweaks, but our focus here is to explain a leak-safe temporal graph methodology that addresses data leakage.
References
[1] Gomes-Gonçalves, E. (2025, January 23). Applications and Opportunities of Graphs in Insurance. Medium. Retrieved September 11, 2025, from https://medium.com/@erikapatg/applications-and-opportunities-of-graphs-in-insurance-0078564271ab
[2] Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility crisis in machinelearning-based science. Patterns. 2023; 4 (9): 100804. Link.
[3] Guignard, F., Ginsbourger, D., Levy Häner, L., & Herrera, J. M. (2024). Some combinatorics of data leakage induced by clusters. Stochastic Environmental Research and Risk Assessment, 38(7), 2815–2828.
[4] Huang, S., et. al. (2024). UTG: Towards a Unified View of Snapshot and Event Based Models for Temporal Graphs. arXiv preprint arXiv:2407.12269. https://arxiv.org/abs/2407.12269
[5] Labonne, M. (2022). GraphSAGE: Scaling up Graph Neural Networks. Towards Data Science. Retrieved from https://towardsdatascience.com/introduction-to-graphsage-in-python-a9e7f9ecf9d7/
[6] An Introduction to GraphSAGE. (2025). Weights & Biases. Retrieved from https://wandb.ai/graph-neural-networks/GraphSAGE/reports/An-Introduction-to-GraphSAGE–Vmlldzo1MTEwNzQ1






