Skip to content
Go back

Elastic + PyTorch for Security – Part 2: Unsupervised Anomaly Detection

Elastic + PyTorch for Security – Part 2: Unsupervised Anomaly Detection

Introduction

In the previous post, we focused on transforming raw Windows authentication logs into a compact set of behavioral features. That step was intentionally heavy on data engineering, because anomaly detection only works if the underlying data has structure.

At this stage, we are no longer dealing with logs.

We have:

This post covers the second stage of the pipeline: learning what normal authentication behavior looks like using an unsupervised machine learning model.

Two clarifications upfront:

  1. This post is not about alerts or SOC workflows.
  2. This post is not about detecting attacks directly.

The goal is simpler:

How unusual is this authentication event compared to normal behavior?

Everything else is built on top of that signal.


Problem Framing: Behavioral Anomaly Detection Without Labels

Authentication anomaly detection is a label-poor problem.

In real environments:

Because of this, supervised approaches are usually impractical.

Instead of classifying events as good or bad, we reframe the problem:

Learn what normal authentication behavior looks like, and measure deviations from that baseline.

This framing drives all design decisions that follow.

What This Model Does (and What It Does Not)

A common mistake in security ML is equating anomalies with attacks.

They are not the same.

Examples of normal but unusual behavior include:

Conversely, many real attacks appear statistically normal.

For this reason, the model in this post:

What it provides is a continuous anomaly score that reflects behavioral deviation.

Authentication failures illustrate this well.

Failures are often part of normal behavior:

Excluding failures from training would artificially narrow the learned behavior space and increase false positives. Instead, failures are treated as contextual signals, not anomalies by default.

Scope and Intent

This work intentionally focuses on authentication events only.

Authentication telemetry provides:

This scope is a proof of concept, not a limitation.

The same pipeline can be extended to include:

As additional signals are added, the notion of “normal behavior” becomes richer, while the modeling approach remains the same.


Input Data and Feature Space

Before introducing a model, it is important to clarify what the model actually sees.

The autoencoder does not operate on raw logs, usernames, or IP addresses. It operates on a numerical feature space derived from authentication behavior.

Each authentication event is represented by:

This feature set is intentionally compact. The goal is to encode behavioral regularities, not full event semantics.

Why Identifiers Are Excluded

A common question is:

Why not include usernames or IP addresses directly?

Because identifiers do not generalize.

They are:

Including them would encourage memorization rather than learning behavior.

Instead, identifiers are used indirectly to derive signals such as activity frequency and IP stability. Raw identifiers remain available outside the model for investigation and explainability.

From Events to Behavior

Raw event fields answer what happened. Behavioral features answer how this event relates to past behavior.

A failed login, an off-hours login, or a new IP address is not inherently suspicious. What matters is how these properties combine over time and frequency.

This shift — from events to behavior — makes unsupervised detection feasible.


Preparing the Feature Space for Modeling

Although all features are numerical, they behave very differently.

Temporal features are bounded. Frequency-based features exhibit heavy-tailed distributions, which is typical in authentication telemetry.

User Ip Count Distribution

If left untreated, these differences would cause reconstruction loss to be dominated by a small subset of features.

To address this:

This does not remove rare events; it mainly reduces scale effects so that one count-based feature cannot dominate the reconstruction loss.

Authentication failures are intentionally retained, as they are part of normal behavior.

At the end of this step, the feature space is numerically stable and suitable for reconstruction-based learning.


Why an Autoencoder?

The problem is now well defined:

Autoencoders fit this setting naturally.

They learn to reconstruct typical behavior and produce a continuous reconstruction error when behavior deviates from what was learned.

They:

The autoencoder used here is intentionally minimal and serves as a baseline for later comparison with other unsupervised techniques.

Autoencoder Diagram

With the modeling choice justified, we can now move from concepts to implementation.

The next section walks through the training notebook step by step.


Training an Autoencoder for Authentication Behavior

All experiments in this post were conducted using a notebook.
Rather than presenting the notebook verbatim, this section follows its structure step by step and explains the rationale behind each stage.

The objective is not to tune a model aggressively, but to demonstrate a clean, reproducible baseline for unsupervised behavioral detection.


Step 1 — Loading and Validating the Feature Dataset

The first step of the notebook focuses on validating the feature dataset produced in Part 1.

At this stage, no modeling decisions are made.
The objective is to ensure that the data is structurally sound, internally consistent, and safe to use as input for an unsupervised model.

Loading the Feature Matrix

The feature dataset is loaded from a Parquet file generated during the feature extraction phase:

FEATURE_PATH = "../data/auth_features.parquet"
X = pd.read_parquet(FEATURE_PATH)

X.head()

Each row corresponds to a single authentication event, and each column represents a behavioral feature derived from Windows authentication logs.

Dataset Shape and Data Types

Before proceeding, the overall shape and data types are inspected:

print("Shape:", X.shape)
print("Columns:", X.columns.tolist())
print("\nDtypes:\n", X.dtypes)
Shape: (26417, 5)
Columns: ['hour', 'dayofweek', 'is_failure', 'user_ip_count', 'user_login_count']

Dtypes:
 hour                float64
dayofweek           float64
is_failure          float64
user_ip_count       float64
user_login_count    float64
dtype: object

This confirms:

Ensuring consistent data types at this stage avoids subtle issues later during scaling or model training.

Explicit Feature Selection

To guarantee consistency with the feature engineering performed in Part 1, the expected feature set is defined explicitly:

FEATURE_COLUMNS = [
    "hour",
    "dayofweek",
    "is_failure",
    "user_ip_count",
    "user_login_count",
]

The dataset is then validated against this expected schema:

missing = [c for c in FEATURE_COLUMNS if c not in X.columns]
extra = [c for c in X.columns if c not in FEATURE_COLUMNS]

if missing:
    raise ValueError(f"Missing columns in parquet: {missing}")
if extra:
    print(f"Warning: extra columns found (ignored): {extra}")

This defensive check ensures that:

The feature matrix is then reordered explicitly:

X = X[FEATURE_COLUMNS].copy()

This guarantees a stable and reproducible feature order across experiments.

Missing and Infinite Values

Before any modeling step, the dataset is checked for invalid values:

na_counts = X.isna().sum()
print("Null counts:\n", na_counts)

is_inf = np.isinf(X.to_numpy()).sum()
print("\nNumber of +/-inf values:", is_inf)

if na_counts.any() or is_inf > 0:
    raise ValueError("Found NaNs or infinite values. Fix before modeling.")

These checks catch common data issues early, such as:

Failing fast at this stage prevents undefined behavior during training.

Range Validation for Bounded Features

Several features are expected to lie within strict bounds:

These constraints are validated explicitly:

bad_hour = ~X["hour"].between(0, 23)
bad_dow = ~X["dayofweek"].between(0, 6)
bad_fail = ~X["is_failure"].isin([0, 1])

print("Bad hour rows:", bad_hour.sum())
print("Bad dayofweek rows:", bad_dow.sum())
print("Bad is_failure rows:", bad_fail.sum())

if bad_hour.any() or bad_dow.any() or bad_fail.any():
    display(X[bad_hour | bad_dow | bad_fail].head(20))
    raise ValueError("Range check failed. Fix feature extraction before modeling.")

This step validates that feature extraction logic has preserved semantic correctness.

Descriptive Statistics

A high-level statistical summary is then computed:

X.describe().T
Featurecountmeanstdmin25%50%75%max
hour26417.015.5588455.7030170.013.017.019.023.0
dayofweek26417.01.9488591.4541490.01.02.03.05.0
is_failure26417.00.0006440.0253600.00.00.00.01.0
user_ip_count26417.050.30279710.19903948.048.048.048.097.0
user_login_count26417.01056.6806600.8355931056.01056.01056.01057.01059.0

This summary shows that temporal features are bounded as expected (hour in [0, 23], dayofweek in [0, 5] in this dataset), while the frequency-based features take a narrow range here (e.g., user_login_count is tightly concentrated around ~1056 and user_ip_count around ~48–97).

Note: this dataset is synthetic, so count-based features are much less variable than what you would typically see in a production environment.

At this point, no transformations are applied. The goal is simply to understand the raw feature behavior.

Feature Correlation

Finally, linear correlations between features are examined:

corr = X.corr()

plt.figure(figsize=(6, 5))
sns.heatmap(
    corr,
    annot=True,
    cmap="coolwarm",
    center=0,
    fmt=".2f"
)
plt.title("Feature correlation matrix (Pearson)")
plt.show()

Feature Correlation Matrix

The correlation heatmap suggests that no single feature is perfectly correlated with the others (no values close to ±1), which is a good sign: each feature contributes some independent signal to the model.

Preparing the Dataset for Modeling

After validation, the full dataset is retained for training:

X_all = X.copy()
print("Total events used for training:", len(X_all))
Total events used for training: 26417

At this point, the feature space has been validated but not modified.

The next step focuses on conditioning the feature distributions to make them suitable for reconstruction-based learning.


Step 2 — Conditioning, Normalization, and Dataset Splitting

Once the feature space has been validated, the next step is to prepare it for model training.

This stage does not introduce any modeling logic yet.
Its purpose is to ensure that the numerical properties of the data are compatible with neural network training.

Why Conditioning Is Necessary

Although all features are numerical, they do not behave in the same way.

From the previous step, two important observations emerge:

If left untreated, these differences would cause the reconstruction loss to be dominated by frequency-based features.

For this reason, frequency stabilization is applied before normalization.

Log-Transforming Frequency-Based Features

The following features represent counts:

A logarithmic transformation is applied to compress extreme values:

COUNT_COLS = ["user_login_count", "user_ip_count"]

X_tf = X_all.copy()

for c in COUNT_COLS:
    X_tf[c] = np.log1p(X_tf[c])

This transformation:

It does not remove rare behavior — it simply expresses it on a more learnable scale.

Verifying the Effect of the Transformation

After transformation, the distributions of the frequency-based features are re-examined:

X_tf[COUNT_COLS].describe().T
Featurecountmeanstdmin25%50%75%max
user_login_count26417.06.9638330.0007906.963196.963196.963196.9641366.966024
user_ip_count26417.03.9247450.1450153.891823.891823.891823.8918204.584967

The resulting statistics confirm that:

At this stage, no other features are modified.

Train / Validation Split

Although the task is unsupervised, a validation set is still required.

The validation split is used to:

Both splits represent the same underlying behavior distribution.

from sklearn.model_selection import train_test_split

X_train, X_val = train_test_split(
    X_tf,
    test_size=0.2,
    random_state=42,
    shuffle=True,
)

print("Train shape:", X_train.shape)
print("Validation shape:", X_val.shape)
Train shape: (21133, 5)
Validation shape: (5284, 5)

No filtering is applied at this stage:

Feature Normalization

Neural networks are sensitive to feature scale.

After conditioning, features are normalized using standard scaling:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_all_scaled = scaler.transform(X_tf)

The scaler is fitted only on the training set, then applied consistently to validation and inference data.

This avoids data leakage and ensures that the learned notion of “normal” reflects what the model actually saw during training.

Sanity Check After Scaling

As a final validation step, the scaled training data is inspected:

pd.DataFrame(
    X_train_scaled,
    columns=X_train.columns
).describe().loc[["mean", "std"]]
Statistichourdayofweekis_failureuser_ip_countuser_login_count
mean-1.075918e-162.790661e-175.715812e-183.802024e-158.857132e-13
std1.000024e+001.000024e+001.000024e+001.000024e+001.000024e+00

The scaled training set has mean ~0 and std ~1 for every feature (small e-16 values are normal floating-point noise), confirming that scaling was applied correctly.

Outcome of This Step

At the end of this phase:

The next step introduces the autoencoder architecture and training procedure.


Step 3 — Training a Simple Autoencoder in PyTorch

With the feature space conditioned, normalized, and split into train/validation sets, we can train an autoencoder.

The purpose of this model is not classification.
It learns to reconstruct typical authentication behavior and will later be used to compute a reconstruction error score.

PyTorch Setup (Device and DataLoaders)

First, we import PyTorch components and select a device (GPU if available, otherwise CPU):

import torch
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

Next, we wrap the scaled NumPy arrays into PyTorch datasets and dataloaders.

Dataloaders allow training in mini-batches, which:

train_ds = TensorDataset(torch.tensor(X_train_scaled, dtype=torch.float32))
val_ds   = TensorDataset(torch.tensor(X_val_scaled, dtype=torch.float32))

train_loader = DataLoader(train_ds, batch_size=256, shuffle=True)
val_loader   = DataLoader(val_ds, batch_size=256, shuffle=False)

Autoencoder Architecture

The input space is only 5-dimensional. A deep or complex network would be unnecessary and harder to reason about.

We use a small architecture that is intentionally easy to explain:

The 3-dimensional bottleneck forces the model to learn a compact representation of normal behavior.

class Autoencoder(nn.Module):
    def __init__(self, input_dim=5, hidden_dim=16, latent_dim=3):
        super().__init__()

        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, latent_dim),
        )

        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, input_dim),
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return x_hat

Instantiate the model:

model = Autoencoder(
    input_dim=len(FEATURE_COLUMNS),
    hidden_dim=16,
    latent_dim=3
).to(device)

model

Loss Function and Optimizer

To train an autoencoder, we minimize reconstruction loss.

Mean squared error (MSE) is used because:

We optimize using Adam, with a small weight decay term for regularization.

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-5)

Training Loop (Train + Validation)

Training proceeds in epochs.

At each epoch:

This provides a sanity check that the model generalizes beyond the training batch distribution.

A helper function runs one epoch in either training or evaluation mode:

def run_epoch(model, loader, train: bool):
    model.train(train)

    total_loss = 0.0
    n = 0

    for (xb,) in loader:
        xb = xb.to(device)

        if train:
            optimizer.zero_grad()

        xb_hat = model(xb)
        loss = criterion(xb_hat, xb)

        if train:
            loss.backward()
            optimizer.step()

        total_loss += loss.item() * xb.size(0)
        n += xb.size(0)

    return total_loss / n

Now we train the model and track train/validation loss:

EPOCHS = 50

train_losses = []
val_losses = []

for epoch in range(1, EPOCHS + 1):
    train_loss = run_epoch(model, train_loader, train=True)
    val_loss   = run_epoch(model, val_loader, train=False)

    train_losses.append(train_loss)
    val_losses.append(val_loss)

    print(f"Epoch {epoch:02d} | train={train_loss:.6f} | val={val_loss:.6f}")
Epoch 01 | train=0.978410 | val=0.716642
Epoch 02 | train=0.750922 | val=0.506557
Epoch 03 | train=0.517377 | val=0.297333
Epoch 04 | train=0.325818 | val=0.161109
Epoch 05 | train=0.228896 | val=0.109804
Epoch 06 | train=0.169867 | val=0.081227
Epoch 07 | train=0.114129 | val=0.058324
Epoch 08 | train=0.065550 | val=0.042978
Epoch 09 | train=0.042997 | val=0.032912
...

Training Curve (Convergence Check)

A training curve is plotted to confirm that the model learns stably and does not overfit.

plt.figure(figsize=(6, 4))
sns.lineplot(x=range(1, len(train_losses) + 1), y=train_losses, label="train")
sns.lineplot(x=range(1, len(val_losses) + 1), y=val_losses, label="val")
plt.title("Autoencoder Training Curve (MSE Loss)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.show()

Autoencoder Training Curve

In this run, both training and validation loss drop quickly during the first epochs and then converge smoothly, with no large gap between them in the plotted curve. This suggests the model is learning a consistent reconstruction baseline for this feature space.

Outcome of This Step

At the end of training, we have a model that can reconstruct typical authentication events.

However, the model still does not output “anomaly” decisions.

To detect anomalies, we must:

  1. run the trained model over all events,
  2. compute reconstruction error per event,
  3. and choose a threshold to flag the most unusual cases.

That is the focus of the next step.


Step 4 — From Reconstruction Error to Anomalies (Scoring + Thresholding)

After training, the autoencoder can reconstruct typical authentication behavior.
However, it does not output anomaly labels directly.

To turn reconstruction into detection, we compute a reconstruction error score for each event and define a threshold to flag the most unusual cases.

Running Inference on the Full Dataset

We apply the trained model to the entire feature matrix (X_all_scaled).

This includes:

model.eval()

with torch.no_grad():
    X_all_t = torch.tensor(X_all_scaled, dtype=torch.float32).to(device)
    X_hat_t = model(X_all_t)
    X_hat = X_hat_t.cpu().numpy()

Computing Reconstruction Error (Anomaly Score)

For each event, we compute mean squared reconstruction error across features:

recon_error = np.mean((X_hat - X_all_scaled) ** 2, axis=1)

This produces a single score per event:

At this stage, these are still scores — not anomaly decisions.

Inspecting the Error Distribution

Before choosing a threshold, we inspect the distribution of reconstruction error:

plt.figure(figsize=(6, 4))
sns.histplot(recon_error, bins=50, kde=True)
plt.title("Reconstruction error distribution")
plt.xlabel("Reconstruction error (MSE)")
plt.ylabel("Count")
plt.show()

In this dataset, the reconstruction errors concentrate at low values with a visible right tail. That tail is what we exploit when defining a threshold: it captures the most poorly reconstructed (i.e., most unusual) events under the learned baseline.

Reconstruction Error Distribution

This shape is exactly what makes thresholding meaningful.

Attaching Scores Back to the Feature DataFrame

To analyze results, we attach reconstruction error back to an interpretable DataFrame.

We use the transformed feature DataFrame (X_tf) rather than the standardized version, because it is easier to interpret feature values in human terms.

df_scores = X_tf.copy()
df_scores["recon_error"] = recon_error

Now each event has both:

Thresholding Strategy 1: Percentile-Based Threshold

A practical and widely used thresholding approach is to flag the top X% most unusual events.

For example, using the 99.5th percentile:

p = 99.5
threshold_pct = np.percentile(recon_error, p)
threshold_pct

We flag anomalies as those exceeding the threshold:

df_scores["is_anomaly_pct"] = (df_scores["recon_error"] > threshold_pct).astype(int)
df_scores["is_anomaly_pct"].value_counts()

This approach is attractive operationally because it provides direct control over alert volume.

For example:

Threshold Strategy 1

Thresholding Strategy 2: IQR-Based Threshold (Robust Alternative)

As a second option, we can use an IQR-based rule:

threshold = Q3 + k * IQR

where IQR = Q3 - Q1.

q1, q3 = np.percentile(recon_error, [25, 75])
iqr = q3 - q1

k = 3.0
threshold_iqr = q3 + k * iqr
threshold_iqr

And flag anomalies:

df_scores["is_anomaly_iqr"] = (df_scores["recon_error"] > threshold_iqr).astype(int)
df_scores["is_anomaly_iqr"].value_counts()

Threshold Strategy 2

This method is less directly controllable than percentiles, but often more robust when the error distribution has irregular tails.

Inspecting Top Anomalies

Finally, we inspect the highest-scoring events:

df_scores.sort_values("recon_error", ascending=False).head(20)
indexhourdayofweekis_failureuser_ip_countuser_login_countrecon_erroris_anomaly_pct
1943013.01.01.04.5849676.965083.9725261
22960.02.00.04.5747116.965081.3113681
143650.02.00.04.5849676.965081.2694511
143770.02.00.04.5849676.965081.2694511
143840.02.00.04.5849676.965081.2694511
143700.02.00.04.5849676.965081.2694511
143760.02.00.04.5849676.965081.2694511
143630.02.00.04.5849676.965081.2694511
143851.02.00.04.5849676.965081.1112871
143931.02.00.04.5849676.965081.1112871
191580.01.00.04.5747116.963191.0934101
191550.01.00.04.5747116.963191.0934101
191510.01.00.04.5747116.963191.0934101
191360.01.00.04.5747116.963191.0934101
191500.01.00.04.5747116.963191.0934101
11360.01.00.04.5747116.963191.0934101
131120.01.00.04.5747116.963191.0934101
11120.01.00.04.5747116.963191.0934101
131010.01.00.04.5747116.963191.0934101
191630.01.00.04.5747116.963191.0934101

In this run, the top-scoring rows are driven by combinations that the model reconstructs poorly. For example, the highest anomaly includes is_failure = 1 together with high user_ip_count, while many of the other top rows correspond to off-hours logins (hour 0–1) combined with high user_ip_count.

These scores are not “attack labels”; they are triage candidates that should be investigated by joining back to the raw Elasticsearch event context (user, source IP, host, logon type, etc.).

Outcome of This Step

At the end of this scoring phase, we have:

This completes the full unsupervised detection workflow:

ML Pipeline


Summary

In this post, we built a complete unsupervised anomaly detection pipeline for authentication telemetry.

Starting from a compact set of behavioral features, we:

Throughout the process, several design principles guided the implementation:

The result is not a rule engine or an attack classifier, but a behavioral baseline that highlights statistically unusual authentication activity.

This kind of detector is most effective when used as an upstream signal: to reduce noise, prioritize investigations, and complement traditional security controls.


What’s Next

While autoencoders provide a strong baseline, they are not the only option for unsupervised detection.

In the next part of this series, we will compare this approach against alternative unsupervised models, including:

The goal is not to declare a single “best” model, but to understand the trade-offs between sensitivity, stability, interpretability, and operational cost.

From there, the series will move toward orchestration, deployment, and operationalization of these models within a real security pipeline.


Part 3: Benchmarking Unsupervised Models for Authentication Anomaly Detection



Share this post on:

Previous Post
Elastic + PyTorch for Security – Part 3: Benchmarking Unsupervised Models
Next Post
Elastic + PyTorch for Security – Part 1: From Logs to Features