No Peeking Forward: Time-Conscious Graph Fraud Detection

The Rise of Semantic Entity Decision

Constructing Analysis Brokers for Tech Insights

In my final article [1], I threw out a variety of concepts centered round constructing structured graphs, primarily centered on descriptive or unsupervised exploration of knowledge by means of graph constructions. Nevertheless, after we use graph options to enhance our fashions, the temporal nature of the information have to be taken into account. If we wish to keep away from undesired results, we must be cautious to not leak future info into our coaching course of. This implies our graph (and the options derived from it) have to be constructed in a time-aware, incremental means.

Information leakage is such a paradoxical drawback {that a} 2023 examine by Sayash Kapoor and Arvind Narayanan [2] discovered that, as much as that second, it had affected 294 analysis papers throughout 17 scientific fields. They classify the varieties of information leakages starting from textbook errors to open analysis issues.

The problem is that in prototyping, outcomes usually appear very promising when they’re actually not. More often than not, folks don’t notice this till fashions are deployed in manufacturing, losing the time and assets of a whole staff. Then, efficiency often falls in need of expectations with out understanding why. This difficulty can develop into the Achilles’ heel that undermines all enterprise AI initiatives.

…

ML-base leakage

Information leakage happens when the coaching information comprises details about the output that received’t be obtainable throughout inference. This causes overly optimistic analysis metrics throughout growth, creating deceptive expectations. Nevertheless, when deployed in real-time methods with the right information stream, the mannequin predictions develop into untrustworthy as a result of it realized from info not accessible.

Ethically, we should try to provide outcomes that really mirror the capabilities of our fashions, slightly than sensational or deceptive findings. When a mannequin strikes from prototyping to manufacturing, it shouldn’t fail to generalize correctly; if it does, its sensible worth is undermined. Fashions that fail to generalize effectively can exhibit important issues throughout inference or deployment, compromising their usefulness.

That is particularly harmful in delicate contexts like fraud detection, which regularly contain imbalanced information situations (with fewer fraud circumstances than non-fraud). In these conditions, the hurt attributable to information leakage is extra pronounced as a result of the mannequin would possibly overfit to leaked information associated to the minority class, producing seemingly good outcomes for the minority label, which is the toughest to foretell. This could result in missed fraud detections, leading to severe sensible penalties.

Information leakage examples might be categorized into textbook errors and open analysis issues [2] as follows:

Textbook Errors:

Imputing lacking values utilizing the whole dataset as an alternative of solely the coaching set, inflicting details about the take a look at information to leak into coaching.
Duplicated or very comparable situations showing each in coaching and take a look at units, corresponding to photographs of the identical object taken from barely totally different angles.
Lack of clear separation between coaching and take a look at datasets, or no take a look at set in any respect, resulting in fashions accessing take a look at info earlier than analysis.
Utilizing proxies of consequence variables that not directly reveal the goal variable.
Random information splitting in situations the place a number of associated information belong to a single entity, corresponding to a number of declare standing occasions from the identical buyer.
Artificial information augmentation carried out over the entire dataset, as an alternative of solely on the coaching set.

Open issues for analysis:

Temporal leakage happens when future information unintentionally influences coaching. In such circumstances, strict separation is difficult as a result of timestamps might be noisy or incomplete.
Updating database information with out lineage or audit path, for instance, altering fraud standing with out storing historical past, may cause fashions to coach on future or altered information unintentionally.
Complicated real-world information integration and pipeline points that introduce leakage by means of misconfiguration or lack of controls.

These circumstances are a part of a broader taxonomy reported in machine studying analysis, highlighting information leakage as a essential and infrequently an underinvestigated threat for dependable modeling [3]. Such points come up even with easy tabular information, they usually can stay hidden when working with many options if every one isn’t individually checked.

Now, let’s think about what occurs after we embody nodes and edges within the equation…

…

Graph-base leakage

Within the case of graph-based fashions, leakage might be sneakier than in conventional tabular settings. When options are derived from linked elements or topological constructions, utilizing future nodes or edges can silently alter the graph’s construction. For instance:

methodologies corresponding to graph neural networks (GNNs) study the context not solely from particular person nodes but additionally from their neighbours, which may inadvertently introduce leakage if delicate or future info is propagated throughout the graph construction throughout coaching.
when the graph construction is overwritten or up to date with out preserving the previous occasions means the mannequin loses priceless context wanted for correct temporal evaluation, and it might once more entry info within the incorrect time or lose traceability about potential leakage or issues with the information that originate the graphs.
Computing graph aggregations like diploma, triangles, or PageRank on the whole graph with out accounting for the temporal dimension (time-agnostic aggregation) makes use of all edges: previous, current, and future. This causes information leakage as a result of options embody info from future edges that wouldn’t be obtainable at prediction time.

Graph temporal leakage happens when options, edges, or node relationships from future time factors are included throughout coaching in a means that violates the chronological order of occasions. This ends in edges or coaching options that incorporate information from time steps that ought to be unknown.

…

How can this be mounted?

We will construct a single graph that captures the whole historical past by assigning timestamps or time intervals to edges. To research the graph as much as a particular cut-off date (t), we “look again in time” by filtering any graph to incorporate solely the occasions that occurred earlier than or at that cutoff. This method is right for stopping information leakage as a result of it ensures that solely previous and current info is used for modeling. Moreover, it gives flexibility in defining totally different time home windows for secure and correct temporal evaluation.

On this article, we construct a temporal graph of insurance coverage claims the place the nodes symbolize particular person claims, and temporal hyperlinks are created when two claims share an entity (e.g., cellphone quantity, license plate, restore store, and so on.) to make sure the right occasion order. Graph-based options are then computed to feed fraud prediction fashions, rigorously avoiding the usage of future info (no peeking).

The thought is straightforward: if two claims share a typical entity and one happens earlier than the opposite, we join them in the intervening time this connection turns into seen (determine 1). As we defined within the earlier part, the way in which we mannequin the information is essential, not solely to seize what we’re really on the lookout for, but additionally to allow the usage of superior strategies corresponding to Graph Neural Networks (GNNs).

**Determine 1:** Claims and entities (corresponding to cellphone numbers) are added to the graph as they arrive over time. When a brand new declare (Claim_id2 at time t) shares a beforehand noticed entity with an earlier declare (Claim_id1 at time t-1), a directed temporal edge (blue arrow) is created from the sooner to the later declare. This development reveals when relationships develop into seen and ensures causal, time-respecting connectivity within the graph. Picture by Writer.

In our graph mannequin, we save the timestamp when an entity is first seen, capturing the second it seems within the information. Nevertheless, in lots of real-world situations, it’s also helpful to think about a time interval spanning the entity’s first and final appearances (for instance, generated with one other variable like plate or e-mail). This interval can present richer temporal context, reflecting the lifespan or lively interval of nodes and edges, which is effective for dynamic temporal graph analyses and superior mannequin coaching.

Code

The code is accessible on this repository: Hyperlink to the repository

To run the experiments, arrange a Python ≥3.11 surroundings with the required libraries (e.g., torch, torch-geometric, networkx, and so on.). It is strongly recommended to make use of a digital surroundings (by way of venv or conda) to maintain dependencies remoted.

Code Pipeline

The diagram of Determine 2, exhibits the end-to-end workflow for fraud detection with GraphSAGE. Step 1 hundreds the (simulated) uncooked claims information. Step 2 builds a time-stamped directed graph (entity→declare and older-claim→newer-claim). Step 3 performs temporal slicing to create prepare, validation, and take a look at units, then indexes nodes, builds options, and eventually trains and validates the mannequin.

Determine 2. Finish-to-end pipeline for temporal fraud modeling: **(I)** load information → **(II)** construct and save a time-stamped graph and **(III)** put together subgraphs (temporal slicing → node indexing → characteristic constructing → PyG `Information`) for coaching and inference. Picture by Writer.

Step 1: Simulated Fraud Dataset

We first simulate a dataset of insurance coverage claims. Every row within the dataset represents a declare and contains variables corresponding to:

Entities: insurer_license_plate, insurer_phone_number, insurer_email, insurer_address, repair_shop, bank_account, claim_location, third_party_license_plate
Core info: claim_id, claim_date, type_of_claim, insurer_id, insurer_name
Goal: fraud (binary variable indicating whether or not the declare is fraudulent or not)

These entity attributes act as potential hyperlinks between claims, permitting us to deduce connections by means of shared values (e.g., two claims utilizing the identical restore store or cellphone quantity). By modeling these implicit relationships as edges in a graph, we are able to construct highly effective topological representations that seize suspicious behavioral patterns and allow downstream duties corresponding to characteristic engineering or graph-based studying.

Desk 1. Overview of simulated insurance coverage claims information, exhibiting key entity fields and the fraud label for every document used within the train. Desk by Writer.

Determine 3. Distribution of fraud and non-fraud claims (left) and evolution of each day fraud charge with declare quantity (proper) within the simulated dataset. Fraud base charge of roughly 12.45%. Picture by Writer.

Step2: Graph Modeling

We use the NetworkX library to construct our graph mannequin. For small-scale examples, NetworkX is adequate and efficient. For extra superior graph processing, instruments like Memgraph or Neo4j may very well be used. To mannequin with NetworkX, we create nodes and edges representing entities and their relationships, enabling community evaluation and visualization inside Python.

So, now we have:

one node per declare, with node key equal to the claim_id and attributes as node_type and claim_date
one node per entity worth (cellphone, plate, checking account, store, and so on.). Node key: "{column_name}:{worth}" and attributes node_type = (e.g., "insurer_phone_number", "bank_account", "repair_shop") label = (simply the uncooked worth with out the prefix)

The graph contains these two varieties of edges:

claim_id(t-1)→ claim_id(t) : when two claims share an entity (with edge_type='claim-claim')
entity_value →claim_id: direct hyperlink to the shared entity (with edge_type='entity-claim')

These edges are annotated with:

edge_type: to tell apart the relation (declare→declare vs entity→declare)
entity_type: the column from which the worth comes (like bank_account)
shared_value: the precise worth (like a cellphone quantity or license plate)
timestamp: when the sting was added (based mostly on the present declare’s date)

To interpret our simulation, we carried out a script that generates explanations for why a declare is flagged as fraud. In Determine 4, declare 20000695 is taken into account dangerous primarily as a result of it’s related to restore store SHOP_856, which acts as an lively hub with a number of claims linked round comparable dates, a sample usually seen in fraud “bursts.” Moreover, this declare shares a license plate and deal with with a number of different claims, creating dense connections to different suspicious circumstances.

Determine 4. Visible clarification for declare 20000695: The left panel exhibits a entity-claim community highlighting connections between the declare and key entities such because the restore store, location, and license plate; the correct panel shows the claim-claim community, revealing how this declare clusters with others by way of shared entities. The decrease panel summarizes threat components supporting the fraud label. Streamlit code. Picture by Writer.

This code saves the graph as a pickel file: temporal_graph_with_edge_attrs.gpickle.

Step 3: Graph preparation & Coaching

Illustration studying transforms advanced, high-dimensional information (like textual content, photographs, or sensor readings) into simplified, structured codecs (usually known as embeddings) that seize significant patterns and relationships. These realized representations enhance mannequin efficiency, interpretability, and the power to switch studying throughout totally different duties.

We prepare a neural community, to map every enter to a vector in ℝᵈ that encodes what issues. In our pipeline, GraphSAGE does illustration studying on the declare graph: it aggregates info from a node’s neighbours (shared telephones, outlets, plates, and so on.) and mixes that with the node’s personal attributes to provide a node embedding. These embeddings are then fed to a small classifier head to foretell fraud.

3.1. Temporal slicing

From the one full graph we create in step 2, we extract three time-sliced subgraphs for prepare, validation, and take a look at. For every break up we select a cutoff date and maintain solely (1) declare nodes with claim_date ≤ cutoff, and (2) edges whose timestamp ≤ cutoff. This produces a time-consistent subgraph for that break up: no info from the longer term leaks into the previous, matching how the mannequin would run in manufacturing with solely historic information obtainable.

3.2 Node indexing

Give each node within the sliced graph an integer index 0…N-1. That is simply an ID mapping (like tokenization). We’ll use these indices to align options, labels, and edges in tensors.

3.3 Construct node options

Create one characteristic row per node:

Sort one-hot (declare, cellphone, e-mail, …).
Diploma stats computed inside the sliced graph: normalized in-degree, out-degree, and undirected diploma inside the sliced graph.
Prior fraud from older neighbors (claims solely): fraction of older linked claims (direct declare→declare predecessors) which can be labeled fraud, contemplating solely neighbors that existed earlier than the present declare’s time.
We additionally set the label y (1/0) for claims and 0 for entities, and mark claims in claim_mask so loss/metrics are computed solely on claims.

3.4 Construct PyG Information

Translate edges (u→v) right into a 2×E integer tensor edge_index utilizing the node indices and add self-loops so every node additionally retains its personal options at each layer. Pack every thing right into a PyG Information(x, edge_index, y, claim_mask) object. Edges are directed, so message passing respects time (earlier→later).

3.5 GraphSage:

We implement a GraphSAGE structure in PyTorch Geometric with the SAGEConv layer. so, we run two GraphSAGE convolution layers (imply aggregation), ReLU, dropout, then a linear head to foretell fraud vs non-fraud. We prepare full-batch (no neighbor sampling). The loss is weighted to deal with class imbalance and is computed solely on declare nodes by way of claim_mask. After every epoch we consider on the validation break up and select the choice threshold that maximizes F1; we maintain the most effective mannequin by val-F1 (early stopping).

Determine 5. PyTorch implementation of a GraphSAGE mannequin structure for node illustration studying and prediction on graph information. Picture by Writer.

3.6 Inference outcomes.

Consider the most effective mannequin on the take a look at break up utilizing the validation-chosen threshold. Report accuracy, precision, recall, F1, and the confusion matrix. Produce a raise desk/plot (how concentrated fraud is by rating decile), export a t-SNE plot of declare embeddings to visualise construction.

**Determine 6:** Mannequin outcomes. **Left:** Raise by decile on the take a look at set; **Proper:** t-SNE of declare embeddings (fraud vs. non-fraud). Picture by Writer.

The raise chart evaluates how effectively the mannequin ranks fraud: bars present raise by rating decile and the road exhibits cumulative fraud seize. Within the high 10–20% of claims (Deciles 1–2), the fraud charge is about 2–3× the common, suggesting that reviewing the highest 20–30% of claims would seize a big share of fraud. The t-SNE plot exhibits a number of clusters the place fraud concentrates, indicating the mannequin learns significant relational patterns, whereas overlap with non-fraud factors highlights remaining ambiguity and alternatives for characteristic or mannequin tuning.

…

Conclusion

Utilizing a graph that solely connects older claims to newer claims (previous to future) with out “leaking” future fraud info, the mannequin efficiently concentrates fraud circumstances within the high scoring teams, attaining about 2–3 occasions higher detection within the high 10–20%. This setup is dependable sufficient to deploy.

As a take a look at, it’s potential to attempt a model the place the graph is two-way or undirected (connections each methods) and examine the spurious enchancment with the one-way model. If the two-way model will get considerably higher outcomes, it’s doubtless due to “temporal leakage,” that means future info is badly influencing the mannequin. It is a technique to show why two-way connections shouldn’t be utilized in actual use circumstances.

To keep away from making the article too lengthy, we are going to cowl the experiments with and with out leakage in a separate article. On this article, we give attention to creating a mannequin that meets manufacturing readiness.

There’s nonetheless room to enhance with richer options, calibration, and small mannequin tweaks, however our focus right here is to elucidate a leak-safe temporal graph methodology that addresses information leakage.

References

[1] Gomes-Gonçalves, E. (2025, January 23). Functions and Alternatives of Graphs in Insurance coverage. Medium. Retrieved September 11, 2025, from https://medium.com/@erikapatg/applications-and-opportunities-of-graphs-in-insurance-0078564271ab

[2] Kapoor, S., & Narayanan, A. (2023). Leakage and the reproducibility disaster in machinelearning-based science. Patterns. 2023; 4 (9): 100804. Hyperlink.

[3] Guignard, F., Ginsbourger, D., Levy Häner, L., & Herrera, J. M. (2024). Some combinatorics of knowledge leakage induced by clusters. Stochastic Environmental Analysis and Danger Evaluation, 38(7), 2815–2828.

[4] Huang, S., et. al. (2024). UTG: In direction of a Unified View of Snapshot and Occasion Primarily based Fashions for Temporal Graphs. arXiv preprint arXiv:2407.12269. https://arxiv.org/abs/2407.12269

[5] Labonne, M. (2022). GraphSAGE: Scaling up Graph Neural Networks. In direction of Information Science. Retrieved from https://towardsdatascience.com/introduction-to-graphsage-in-python-a9e7f9ecf9d7/

[6] An Introduction to GraphSAGE. (2025). Weights & Biases. Retrieved from https://wandb.ai/graph-neural-networks/GraphSAGE/studies/An-Introduction-to-GraphSAGE–Vmlldzo1MTEwNzQ1