Skip to content
Go back

Elastic + PyTorch for Security – Part 3: Benchmarking Unsupervised Models

Elastic + PyTorch for Security – Part 3: Benchmarking Unsupervised Models

Introduction

In the previous posts, we built an unsupervised detection pipeline: behavioral features from Windows authentication telemetry, then an autoencoder that converts deviations into anomaly scores.

At that point, a natural question arises:

Is this the right model?

Answering that question is harder than it seems.

In supervised machine learning, model evaluation is relatively straightforward: you compare predictions against labeled ground truth and compute metrics such as accuracy, precision, or recall.

In security anomaly detection, things are different.

As a result, choosing a model based on intuition, popularity, or a single demo can be dangerously misleading.

This post focuses on the third stage of the pipeline: benchmarking unsupervised models in a way that reflects SOC constraints.

What This Post Is About

This post is not about proposing a new detection algorithm. Instead, it addresses a more fundamental question:

Given several unsupervised models, how do we decide which one is appropriate for a real security use case?

To answer that, we evaluate multiple models on the same feature space, inject controlled synthetic anomalies (proxy ground truth), and compare detectors using operational metrics under an alert budget.

The goal is comparative: understand trade-offs under realistic constraints, not crown a single “best” model.

Why Benchmarking Matters More Than the Model

In practice, many security ML failures come from evaluation and thresholding choices, not algorithms. Benchmarking forces the discussion into operational terms: alert volume, missed anomalies, and stability over time.

Scope and Continuity

As in the previous posts, the focus remains on authentication telemetry. This keeps the problem well-scoped and the results interpretable.

The same principles, however, apply to richer datasets: process execution, PowerShell activity, lateral movement signals, or network telemetry.

Authentication logs serve here as a controlled baseline for understanding how unsupervised models should be evaluated before being trusted.

What Comes Next

The rest of this post will:

  1. Explain why benchmarking unsupervised models in security is fundamentally hard.
  2. Introduce synthetic anomalies as a controlled evaluation tool.
  3. Compare several common unsupervised detectors under the same conditions.
  4. And interpret the results from a security and SOC perspective.

Only after that does it make sense to talk about deployment, orchestration, and production pipelines — which is where the next part of this series will go.


The Benchmarking Problem in Unsupervised Security ML

Benchmarking machine learning models usually assumes one thing:

We can compare model outputs to reliable ground truth labels.

In supervised learning, this assumption holds. You train a model, compare predictions against labeled data, and compute metrics such as accuracy, precision, recall, or ROC curves.

In security anomaly detection, that assumption breaks almost immediately.

Why Ground Truth Is Rare (or Unreliable)

In real authentication telemetry:

Just because an authentication event was never investigated does not mean it was benign. Likewise, many investigations end without a clear conclusion. As a result, historical labels — when they exist at all — are a poor foundation for evaluating unsupervised models.

Anomaly Does Not Mean Attack

In authentication telemetry, anomaly is a statement about deviation, not intent. Benign anomalies are common, for example:

Benchmarking must therefore focus on behavioral separation, not “attack detection”.

Why Traditional Metrics Are Misleading

Without reliable labels, common ML metrics lose meaning:

A model that flags more anomalies may look better on paper while being unusable in practice due to alert overload.

This is why unsupervised security ML requires different evaluation thinking.

Synthetic Anomalies as a Proxy Ground Truth

To evaluate models in a controlled way, we introduce synthetic anomalies.

These are not random perturbations. They are behaviorally implausible events constructed with intent.

Examples include:

The purpose of synthetic anomalies is not to simulate attacks. It is to create known deviations against which models can be compared.

This provides a limited but useful proxy for ground truth: we know which events are intentionally abnormal.

What Synthetic Anomalies Can (and Cannot) Tell Us

Synthetic anomalies allow us to answer questions such as:

They do not tell us:

This distinction is important.

The goal is comparative evaluation, not absolute validation.

From “Best Model” to “Best Trade-Off”

Instead of asking:

Which model is best?

We ask:

Which model provides the best trade-off between sensitivity, stability, and operational cost for this use case?

Answering that question requires:

The next section introduces the models included in this benchmark and explains why each one was chosen.


Models Under Evaluation

To benchmark unsupervised anomaly detection meaningfully, models must be evaluated under the same conditions and with a clear understanding of what each one actually does.

All models in this post operate on the same feature space, use the same preprocessing, and are evaluated using the same alert budget and metrics whenever possible.

The goal is not to exhaustively cover every algorithm, but to compare representative families of unsupervised detectors commonly used in security and anomaly detection.

Isolation Forest (IF)

Isolation Forest is based on a simple idea:

Anomalies are easier to isolate than normal points.

The model builds an ensemble of random decision trees. Points that require fewer splits to isolate are considered more anomalous.

In security contexts:

Isolation Forest is a common baseline: fast, robust, and often conservative.

One-Class Support Vector Machine (OC-SVM)

One-Class SVM learns a boundary that encloses normal data.

Rather than modeling anomalies directly, it answers the question: is this point outside the learned “normal” region?

Key properties:

In security contexts:

OC-SVM can be effective in authentication telemetry, but requires careful preprocessing to avoid overfitting.

Local Outlier Factor (LOF)

Local Outlier Factor is a density-based method.

Instead of comparing points globally, it compares each point to its local neighborhood and measures how isolated it is relative to nearby points.

Key properties:

In security contexts:

LOF is particularly useful when anomalies are locally rare rather than globally extreme.

Autoencoder (Reconstruction-Based)

An autoencoder is a neural network trained to reconstruct its input. The intuition is straightforward:

Autoencoders require careful normalization and sensible capacity/regularization.

In security contexts:

DBSCAN (Density-Based Clustering)

DBSCAN is fundamentally different from the other models.

It is a clustering algorithm that labels points as:

Key properties:

In security contexts:

DBSCAN is included deliberately to illustrate an important point: not all anomaly detectors are ranking-based, and evaluation strategies must adapt accordingly.

Why These Models?

These models were selected because they represent different detection philosophies:

FamilyModel
Tree-based isolationIsolation Forest
Boundary learningOne-Class SVM
Local densityLOF
ReconstructionAutoencoder
Density clusteringDBSCAN

Together, they provide a broad view of how unsupervised models behave under the same behavioral data.

The next section introduces the evaluation framework used to compare these models fairly, focusing on alert budgets and operational metrics rather than abstract ML scores.


A Common Evaluation Framework

Comparing unsupervised models only makes sense if they are evaluated under a shared and realistic framework.

In security, a model is not judged solely by how well it separates data points mathematically, but by how usable its output is in practice.

This section defines the evaluation principles used throughout this benchmark.

What We Can (and Cannot) Measure

Because the problem is unsupervised, there are clear limits to what evaluation can tell us.

We cannot reliably measure:

We can measure:

This benchmark focuses on what can be measured consistently.

Alert Budget: A Security-Driven Constraint

Detection systems do not operate with unlimited analyst capacity, so we evaluate models under an explicit alert budget.

We define an alert budget as:

The percentage of total events that a model is allowed to flag as anomalous.

Examples:

By fixing the alert budget, we ensure that models are compared under equal operational cost, not equal mathematical thresholds.

Ranking-Based Evaluation

Most models in this benchmark produce a continuous anomaly score.

For these models:

  1. events are ranked by anomaly score,
  2. the top N% (defined by the alert budget) are flagged as anomalies,
  3. all other events are treated as normal.

Budget Filter

This mirrors real usage: analysts investigate the most unusual events first.

Noise-Based Models (DBSCAN)

DBSCAN does not produce a continuous score.

Instead, it directly labels events as:

Because of this:

For DBSCAN, evaluation focuses on:

This highlights a practical limitation of clustering-based detectors in operational settings.

Confusion Matrix in an Unsupervised Context

To quantify results, we use a confusion matrix adapted to synthetic anomalies:

This does not imply that non-synthetic events are benign; it only reflects whether an event was injected for evaluation.

Metrics Used

From the confusion matrix, we derive:

These metrics align directly with security concerns: missed anomalies, wasted analyst time, and operational cost.

Why Accuracy Is Not Used

Accuracy is excluded: in highly imbalanced settings, a model that flags nothing can look “accurate” while being operationally useless.

This benchmark prioritizes actionable signal over statistical comfort.

Outcome of the Framework

By fixing:

we ensure that observed differences between models reflect behavioral differences, not evaluation artifacts.

With the framework defined, we can move to synthetic anomaly generation and the benchmark dataset.


Synthetic Anomalies and Experimental Setup

Without reliable ground truth, we introduce a controlled reference: synthetic anomalies (intentional behavioral deviations) used for benchmarking. The objective is not attack simulation, but reproducible stress tests for different detectors.

Why Synthetic Anomalies Are Needed

Synthetic anomalies provide a controlled proxy: we know what was altered, can quantify ranking behavior, and compare detectors under identical conditions. This enables relative benchmarking, not real-world attack validation.

Design Principles for Synthetic Anomalies

To avoid unrealistic “easy outliers”, the anomalies used here follow three principles:

  1. Plausibility All feature values remain within realistic ranges.

  2. Subtlety Many anomalies are ambiguous, not obviously malicious.

  3. Diversity Different anomaly patterns stress different detection mechanisms.

The goal is to approximate plausible deviations seen in real authentication telemetry.

Anomaly Families

Four types of synthetic anomalies are generated, with increasing severity:

TypeDescription
off_hoursLogins at unusual but still plausible times
rare_ipAuthentication from infrequently used IPs
high_activityHigher-than-usual authentication volume
combinedMultiple deviations in the same event

Each anomaly type targets a different behavioral dimension: time, frequency, recurrence, or combinations thereof.

Generating Synthetic Anomalies

Synthetic anomalies are created by sampling existing normal events and applying controlled transformations.

The implementation below is used as-is in the notebook:

def generate_synthetic_anomalies(
    X,
    n_anomalies,
    random_state=42,
):
    """
    Generate realistic authentication anomalies with different patterns
    and severities. All values remain within plausible ranges.
    """
    rng = np.random.default_rng(random_state)

    X_syn = X.sample(
        n=n_anomalies,
        replace=True,
        random_state=random_state
    ).copy()

    # Assign anomaly types
    anomaly_types = rng.choice(
        ["off_hours", "rare_ip", "high_activity", "combined"],
        size=n_anomalies,
        p=[0.35, 0.30, 0.20, 0.15],
    )
    X_syn["anomaly_type"] = anomaly_types

    # ---- Type A: off-hours login (subtle) ----
    mask = anomaly_types == "off_hours"
    X_syn.loc[mask, "hour"] = rng.choice([5, 6, 21, 22], size=mask.sum())

    # ---- Type B: rare IP usage (medium) ----
    mask = anomaly_types == "rare_ip"
    ip_low = float(X_syn["user_ip_count"].quantile(0.10))
    X_syn.loc[mask, "user_ip_count"] = ip_low * rng.uniform(0.8, 1.2, size=mask.sum())

    # ---- Type C: higher-than-usual activity (medium-high) ----
    mask = anomaly_types == "high_activity"
    mult = rng.uniform(1.15, 1.35, size=mask.sum())
    X_syn.loc[mask, "user_login_count"] = (
        X_syn.loc[mask, "user_login_count"].to_numpy() * mult
    )

    # ---- Type D: combined deviation (high severity) ----
    mask = anomaly_types == "combined"
    X_syn.loc[mask, "hour"] = rng.choice([2, 3, 4, 22, 23], size=mask.sum())

    ip_low = float(X_syn["user_ip_count"].quantile(0.10))
    X_syn.loc[mask, "user_ip_count"] = ip_low * rng.uniform(0.6, 1.0, size=mask.sum())

    mult = rng.uniform(1.30, 1.60, size=mask.sum())
    X_syn.loc[mask, "user_login_count"] = (
        X_syn.loc[mask, "user_login_count"].to_numpy() * mult
    )

    return X_syn

This function ensures that anomalies are:

Injecting Anomalies Into the Dataset

Synthetic anomalies are injected at a low rate (≈1%) to reflect real-world imbalance.

Normal events are labeled with is_synthetic = 0, anomalies with is_synthetic = 1.

X_base_labeled = X_base.copy()
X_base_labeled["is_synthetic"] = 0

X_syn = generate_synthetic_anomalies(
    X_base,
    n_anomalies=int(0.01 * len(X_base)),
)

X_syn_labeled = X_syn.copy()
X_syn_labeled["is_synthetic"] = 1

df_mix = pd.concat(
    [X_base_labeled, X_syn_labeled],
    ignore_index=True
)

A quick sanity check shows the resulting class balance and anomaly mix:

df_mix["is_synthetic"].value_counts()

Output:

is_synthetic
0    26417
1      264
Name: count, dtype: int64
df_mix["anomaly_type"].value_counts()

Output:

anomaly_type
normal           26417
off_hours           94
rare_ip             76
high_activity       61
combined            33
Name: count, dtype: int64

This distribution ensures that:

Preparing the Benchmark Dataset

The final benchmark dataset consists of:

X_mix = df_mix[FEATURE_COLUMNS].copy()
y_mix = df_mix["is_synthetic"].values

X_mix_scaled = scaler.transform(X_mix)

At this point, all models will see exactly the same data, and differences in performance can be attributed to model behavior rather than preprocessing artifacts.

What This Setup Enables

With this experimental setup, we can now answer:

What it does not answer:

Those questions require production feedback and long-term monitoring. Next, we train/score each model under identical conditions.


Training and Scoring the Models

With the benchmark dataset defined, we apply multiple unsupervised models to the same standardized feature space and compare how they score events.

At this stage, we are not deciding alerts. We only ask:

How does each model score authentication events relative to one another?

To keep the comparison fair, all models use the same feature matrix (X_mix_scaled) and preprocessing; synthetic labels are used only for evaluation.

A Note on Scoring Conventions

Unsupervised models do not agree on what an “anomaly score” is. For consistency, we enforce:

Higher score = more anomalous

When a model’s native output follows the opposite convention, it is inverted.

This keeps ranking and alert-budget evaluation consistent across models.


Isolation Forest

Isolation Forest assigns an anomaly score based on how quickly a point is isolated in random trees.

Scikit-learn returns higher scores for more normal points, so we invert the score.

from sklearn.ensemble import IsolationForest

iforest = IsolationForest(
    n_estimators=200,
    contamination="auto",
    random_state=42
)

iforest.fit(X_mix_scaled)

scores_iforest = -iforest.score_samples(X_mix_scaled)

results_mix["score_iforest"] = scores_iforest

To understand how well the model separates normal and synthetic events, we inspect the score distribution:

plot_score_distribution(
    scores_iforest,
    y_mix,
    "Isolation Forest — anomaly score distribution"
)

Isolation Forest Score Distribution

This visualization shows how much overlap exists between normal behavior and injected anomalies, which directly affects achievable recall under tight alert budgets.


One-Class SVM

One-Class SVM learns a boundary enclosing normal behavior.

Points far outside this boundary receive large negative decision values. We invert them so that higher values indicate greater deviation.

from sklearn.svm import OneClassSVM

ocsvm = OneClassSVM(
    kernel="rbf",
    gamma="scale",
    nu=0.01
)

ocsvm.fit(X_mix_scaled)

scores_ocsvm = -ocsvm.decision_function(X_mix_scaled)

results_mix["score_ocsvm"] = scores_ocsvm

Score distribution:

plot_score_distribution(
    scores_ocsvm,
    y_mix,
    "One-Class SVM — anomaly score distribution"
)

One-Class SVM Score Distribution

OC-SVM often produces strong separation when the feature space is well conditioned, but can also be sensitive to scaling and noise.


Local Outlier Factor (LOF)

LOF compares the local density of each point to its neighbors.

We use the negative outlier factor and invert it so that higher values indicate anomalies.

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(
    n_neighbors=35,
    novelty=True
)

lof.fit(X_mix_scaled)

scores_lof = -lof.score_samples(X_mix_scaled)

results_mix["score_lof"] = scores_lof

Score distribution:

plot_score_distribution(
    scores_lof,
    y_mix,
    "Local Outlier Factor — anomaly score distribution"
)

Local Outlier Factor Score Distribution

LOF is particularly sensitive to local deviations, which makes it useful for certain anomaly types but unstable in others.


Autoencoder (Reconstruction Error)

The autoencoder trained in Part 2 is reused here without retraining.

Anomaly score is defined as mean squared reconstruction error.

import torch

model.eval()

with torch.no_grad():
    X_t = torch.tensor(X_mix_scaled, dtype=torch.float32).to(device)
    X_hat = model(X_t).cpu().numpy()

scores_autoencoder = ((X_hat - X_mix_scaled) ** 2).mean(axis=1)

results_mix["score_autoencoder"] = scores_autoencoder

Score distribution:

plot_score_distribution(
    scores_autoencoder,
    y_mix,
    "Autoencoder — reconstruction error distribution"
)

Autoencoder Score Distribution

Reconstruction-based models tend to produce smooth score distributions, which can be advantageous for thresholding and alert prioritization.


DBSCAN (Noise-Based Detection)

DBSCAN does not produce a continuous anomaly score.

Instead, it labels each point as either:

Noise points are treated as anomalies.

from sklearn.cluster import DBSCAN

dbscan = DBSCAN(
    eps=0.8,
    min_samples=25
)

labels = dbscan.fit_predict(X_mix_scaled)

results_mix["is_anomaly_dbscan"] = (labels == -1).astype(int)
results_mix["dbscan_cluster"] = labels

Because DBSCAN is not ranking-based:

This limitation will be important when interpreting results.

PCA is used here purely for visualization, projecting the high-dimensional feature space into two dimensions so that DBSCAN’s clustering and noise behavior can be inspected by a human.

from sklearn.decomposition import PCA

pca = PCA(n_components=2, random_state=42)
X_2d = pca.fit_transform(X_mix_scaled)

plt.figure(figsize=(7, 5))
sns.scatterplot(
    x=X_2d[:, 0],
    y=X_2d[:, 1],
    hue=results_mix["is_anomaly_dbscan"],
    style=results_mix["is_synthetic"],
    alpha=0.7
)
plt.title("DBSCAN — noise vs clustered (PCA projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.show()

This projection shows how DBSCAN separates dense behavioral regions from sparse points: noise points lie outside well-defined clusters, but lack any ranking or notion of severity, making DBSCAN unsuitable for alert prioritization despite its ability to flag structural outliers.

DBSCAN Noise vs Clustered


Summary of Scoring Outputs

At the end of this step, results_mix contains:

All subsequent evaluation uses only these outputs (no retraining). Next, we translate scores into decisions under an alert budget.


Alert Budgets and Recall-Based Evaluation

At this point, each model has produced either:

Raw scores alone are not actionable. The key operational constraint is:

How many alerts can realistically be investigated?

This is where alert budgets become central.

What Is an Alert Budget?

An alert budget defines the fraction of total events that a detection system is allowed to flag for investigation.

Examples:

Instead of comparing models at arbitrary thresholds, we compare them under the same operational cost.

This avoids a common pitfall in anomaly detection benchmarking: models that look “better” simply because they generate more alerts.

Recall at Fixed Budget

To evaluate models fairly, we use recall at budget:

Given a fixed alert budget, what fraction of injected anomalies are captured?

This directly answers: if I can only review X% of events, how many injected anomalies appear in that set?

Computing Recall at Budget

The following helper function implements this logic:

def recall_at_budget(scores, y_true, budget_pct):
    scores = np.asarray(scores)
    y_true = np.asarray(y_true).astype(int)

    thr = np.percentile(scores, 100 - budget_pct)
    y_pred = (scores >= thr).astype(int)

    captured = ((y_pred == 1) & (y_true == 1)).sum()
    total_anomalies = y_true.sum()

    recall = captured / (total_anomalies + 1e-12)

    return recall, y_pred.sum(), captured, thr

This function:

  1. ranks events by anomaly score,
  2. selects the top budget_pct percent,
  3. computes how many synthetic anomalies fall within that set.

Recall vs Alert Budget (Global)

We now evaluate all score-based models across multiple budgets:

budgets = [0.5, 1.0, 2.0]
rows = []

model_scores = {
    "iforest": results_mix["score_iforest"].values,
    "ocsvm": results_mix["score_ocsvm"].values,
    "lof": results_mix["score_lof"].values,
    "autoencoder": results_mix["score_autoencoder"].values,
}

for model, scores in model_scores.items():
    for b in budgets:
        r, selected, captured, thr = recall_at_budget(scores, y_mix, b)
        rows.append({
            "model": model,
            "budget_pct": b,
            "recall": r,
            "selected": selected,
            "captured": captured,
            "threshold": thr,
        })

df_budget = pd.DataFrame(rows)
df_budget

Visualization:

plt.figure(figsize=(7, 4))
sns.lineplot(
    data=df_budget,
    x="budget_pct",
    y="recall",
    hue="model",
    marker="o"
)
plt.title("Recall vs Alert Budget")
plt.xlabel("Alert budget (% of events reviewed)")
plt.ylabel("Recall of synthetic anomalies")
plt.ylim(0, 1.05)
plt.show()

Recall vs Alert Budget

How to Read This Plot

This plot reveals trade-offs:

This is exactly the kind of information required to choose a detector for a real SOC.


Confusion-Matrix–Based Evaluation at a Fixed Alert Budget

To compare models operationally, we translate scores into binary decisions under a fixed alert budget and compute confusion-matrix metrics.

From Scores to Decisions Under an Alert Budget

For score-based models (Isolation Forest, OC-SVM, LOF, Autoencoder), evaluation is consistent: rank by score and flag only the top N% (the alert budget).

In this experiment, we fix the alert budget to 1%, meaning that at most 1% of all authentication events would be reviewed by analysts.

The following helper function implements this logic:

def confusion_metrics(scores, y_true, budget_pct):
    scores = np.asarray(scores)
    y_true = np.asarray(y_true).astype(int)

    thr = np.percentile(scores, 100 - budget_pct)
    y_pred = (scores >= thr).astype(int)

    tp = ((y_pred == 1) & (y_true == 1)).sum()
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    tn = ((y_pred == 0) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()

    precision = tp / (tp + fp + 1e-12)
    recall = tp / (tp + fn + 1e-12)
    fpr = fp / (fp + tn + 1e-12)

    return {
        "tp": tp, "fp": fp, "tn": tn, "fn": fn,
        "precision": precision,
        "recall": recall,
        "fpr": fpr,
        "threshold": thr
    }

This converts continuous anomaly scores into a binary alert decision, making models directly comparable.

Metrics Used

Each model is evaluated using the following metrics:

These metrics are far more meaningful for security operations than accuracy or ROC curves.

Results at 1% Alert Budget

At a fixed 1% alert budget, the results are:

modelbudget_pcttpfptnfnprecisionrecallfprthreshold
ocsvm1.0156114263031080.57780.59090.00430.5079
autoencoder1.0135142262751290.48740.51140.00540.4081
lof1.059216262012050.21450.22350.00821.07e9
iforest1.011262261552530.04030.04170.00990.6855
dbscan15132263851130.82510.57200.0012noise-based

How to Interpret These Results

Several important insights emerge:

Model choice depends on whether you prioritize recall, precision, alert control, or stability over time.


Recall by Anomaly Type

So far, recall was computed globally, across all synthetic anomalies.

But not all anomalies are the same.

In this benchmark, injected anomalies are labeled by behavioral category:

A model that performs well globally may still fail completely on certain anomaly types.

Why Per-Type Recall Matters

Different models are sensitive to different kinds of deviations:

Evaluating recall by anomaly type reveals what a model is actually good at.

Computing Recall per Anomaly Type

rows = []

for model, scores in model_scores.items():
    for b in budgets:
        thr = np.percentile(scores, 100 - b)
        y_pred = (scores >= thr).astype(int)

        for atype in df_mix["anomaly_type"].unique():
            if atype == "normal":
                continue

            mask = (df_mix["anomaly_type"] == atype).values
            total = mask.sum()
            captured = ((y_pred == 1) & mask).sum()
            recall = captured / (total + 1e-12)

            rows.append({
                "model": model,
                "budget_pct": b,
                "anomaly_type": atype,
                "recall": recall,
                "total": total,
                "captured": captured,
            })

df_budget_by_type = pd.DataFrame(rows)
df_budget_by_type

Visualization as a heatmap (for a fixed budget):

budget = 1.0

pivot = (
    df_budget_by_type[df_budget_by_type["budget_pct"] == budget]
    .pivot(index="model", columns="anomaly_type", values="recall")
)

plt.figure(figsize=(7, 4))
sns.heatmap(
    pivot,
    annot=True,
    cmap="viridis",
    vmin=0,
    vmax=1
)
plt.title(f"Recall by Anomaly Type (Budget = {budget}%)")
plt.xlabel("Anomaly type")
plt.ylabel("Model")
plt.show()

Recall by Anomaly Type

Interpreting the Heatmap

This heatmap is often the most actionable result: it shows which anomaly families each model captures (or misses) under the same budget, exposing blind spots hidden by global averages.


What These Results Actually Tell Us

At this stage, the benchmark provides operational answers (early surfacing, noise, specialization by anomaly type, robustness under budgets). It does not establish real attack detection. Instead, it tells us:

How different unsupervised detectors behave when forced to operate like real security systems.


Transition to Conclusions

The final section ties these results together and gives a practical model-selection checklist based on analyst capacity and detection goals.


Conclusions: Choosing an Unsupervised Model in Practice

After benchmarking multiple unsupervised models under the same conditions, one conclusion becomes clear:

There is no universally “best” anomaly detection model.

What exists instead are trade-offs.

Each model encodes a different assumption about what “abnormal” means, and those assumptions interact directly with:

This is precisely why benchmarking matters more than algorithm choice.

What the Benchmark Actually Demonstrates

This benchmark does not prove that any model detects real attacks.

What it demonstrates is something more practical and more honest:

These are the properties that determine whether a model is usable in production.

Key Observations Across Models

From the experiments, several consistent patterns emerge.

Isolation Forest

Best suited for: Environments where false positives are extremely costly and only strong deviations matter.

One-Class SVM

Best suited for: Well-conditioned behavioral features and environments prioritizing sensitivity.

Local Outlier Factor (LOF)

Best suited for: Scenarios where anomalies are rare relative to local behavior, not globally extreme.

Autoencoder

Best suited for: Behavioral baselining and continuous anomaly scoring pipelines.

DBSCAN

Best suited for: Noise detection, pre-filtering, or exploratory analysis — not primary alerting.

Why Alert Budgets Matter More Than Scores

A key takeaway is that alert budget dominates evaluation: models with similar score distributions can behave very differently once constrained by analyst capacity. Metrics that ignore this constraint (e.g., accuracy or unconstrained curves) can be misleading.

Synthetic Anomalies: A Tool, Not a Truth

The synthetic anomalies used here are not attacks. They provide relative insight into ranking behavior and blind spots under controlled conditions, not absolute proof of real-world detection.

How to Choose a Model in a Real Pipeline

Rather than selecting a model upfront, a more reliable approach is:

  1. Start with feature quality (bad features make every model bad).
  2. Benchmark multiple models early (prefer comparative behavior over intuition).
  3. Fix the alert budget before tuning (operational cost first).
  4. Evaluate by anomaly family (global metrics hide blind spots).

In practice, production systems often combine detectors and rely on correlation rather than single-model alerts.

What This Series Has Shown So Far

Across the series, we built features, trained an unsupervised baseline, and benchmarked detectors under realistic constraints—without relying on perfect labels.


What Comes Next

With benchmarking in place, the next natural steps are no longer purely ML-related.

They involve engineering:

ML Pipeline

The next part of this series will focus on: from models to pipelines — how to operationalize these detectors in a real security environment.


Part 4: From Models to Pipelines — Operationalizing Unsupervised Detection in Security Systems



Share this post on:

Next Post
Elastic + PyTorch for Security – Part 2: Unsupervised Anomaly Detection