Windows Event Log Architecture: Why Your SIEM Is Probably Missing 30% of Events and How to Verify It

April 18, 2026 · 31 min read

Threat Intelligence & Detection Engineering

An analyst flags a suspicious lateral movement alert. You pull the investigation timeline. There is a 47-minute gap in process creation events from a critical server right across the window where the attacker moved. The EDR shows nothing. The SIEM shows nothing. Post-incident forensics on the local machine reveals 6,800 events that never left the endpoint. The Security event log overwrote itself. The WEF subscription had a filter bug. The WEC server was under load. Nobody noticed because nobody measured. This scenario is not hypothetical it is the most common root cause of detection gaps found during post-incident reviews, and it is almost entirely preventable.

Why This Matters More Than Any Detection Rule You'll Write

Security teams invest enormous effort writing detection rules, tuning Sigma, and expanding MITRE ATT&CK coverage. Those efforts are worthless if the underlying events never reach your SIEM.

The assumption baked into virtually every SIEM dashboard is that the event collection pipeline is working. That assumption is almost never tested, and when it fails, it fails silently. There is no alert for "we stopped receiving process creation events from this host." There is no dashboard tile that turns red when your WEC server starts dropping events under load. There is no automatic notification when a GPO conflict silently rolls back your advanced audit policy to defaults.

The result is what security engineers sometimes call coverage theater you have the rules, you have the dashboards, you have the ATT&CK heatmap lit up, but underneath it all is a collection infrastructure with real gaps that an attacker who understands Windows internals will never trigger an alert through.

This post goes from first principles how Windows event logging actually works internally through the specific failure modes that cause events to be lost, and ends with concrete tools and scripts you can run this week to measure your actual collection fidelity.

Part 1 The Architecture: From Kernel Event to SIEM Record

Understanding where events can be lost requires understanding the full pipeline. Most practitioners know the high-level model. Few know the internals where things actually break.

1.1 Event Tracing for Windows (ETW): The Kernel Foundation

Every Windows event originates in Event Tracing for Windows (ETW) the low-level kernel subsystem that acts as the backbone for all Windows telemetry. ETW is not the same as the Windows Event Log. It is the underlying transport mechanism.

Ten distinct failure points across four layers. An event can be lost at any one of them, with no notification to the analyst on the other end.

1.2 The ETW Ring Buffer Where Events Are Born and First Lost

ETW operates using in-memory ring buffers circular memory regions that providers write events into. Consumers (including the Windows Event Log service) read from these buffers. When a buffer fills faster than consumers can drain it, new events overwrite old ones in memory before they are ever written to disk.

This is not the same as log overwriting (which happens on disk). ETW ring buffer overflow is silent, in-memory loss that leaves no trace of the dropped events not even a gap in the EventRecordID sequence.

ETW buffer parameters are configurable but almost never tuned:

:: View current ETW session configuration for a specific session
logman query "EventLog-Security" -ets

:: Sample output:
:: Name:                 EventLog-Security
:: Status:               Running
:: Root Path:            %systemdrive%\PerfLogs\Admin
:: Segment:              Off
:: Schedules:            On
:: Segment Max Size:     100 MB
:: 
:: Name:                 EventLog-Security
:: Type:                 Trace
:: Append:               Off
:: Circular:             Off
:: Overwrite:            Off
:: Buffer Size:          64              ← 64KB per buffer
:: Buffers Lost:         0               ← Watch this number
:: Buffers Written:      15432
:: Buffer Flush Timer:   1
:: Clock Type:           System
:: File Mode:            Real-time

The Buffers Lost counter is the key metric. If this is non-zero, events are being dropped in ETW before the Event Log service even sees them. Check this on domain controllers and high-activity servers:

# Check ETW buffer loss for all active security-related sessions
Get-WinEvent -ListLog Security | Select-Object LogName, RecordCount, IsEnabled

# More detailed: check ETW session stats via Performance Counters
$counterPaths = @(
    '\Security System-Wide Statistics\Audit Failures',
    '\Security System-Wide Statistics\System Events'
)
Get-Counter -Counter $counterPaths -SampleInterval 1 -MaxSamples 5

1.3 The EVTX File: Structure and How Overwrites Work

Windows event logs are stored as .evtx (XML Event Log) files in C:\Windows\System32\winevt\logs\. The format uses a chunked binary structure:

When the log wraps, EventRecordIDs continue incrementing they do not reset. This means you can detect overwrite gaps by looking for discontinuities in the EventRecordID sequence. A jump from EventRecordID 482,441 to 489,209 means 6,768 events were overwritten and are gone.

# Detect EventRecordID gaps that indicate log overwriting
# Run on a remote host or locally
$events = Get-WinEvent -LogName Security -MaxEvents 100 |
    Select-Object RecordId, TimeCreated, Id |
    Sort-Object RecordId

for ($i = 1; $i -lt $events.Count; $i++) {
    $gap = $events[$i].RecordId - $events[$i-1].RecordId
    if ($gap -gt 1) {
        Write-Output "GAP DETECTED: RecordId jumped from $($events[$i-1].RecordId) to $($events[$i].RecordId)"
        Write-Output "  Missing events: $($gap - 1)"
        Write-Output "  Time of gap: $($events[$i-1].TimeCreated) → $($events[$i].TimeCreated)"
    }
}

Part 2 Audit Policy: The Silent Misconfiguration

Before a single event travels anywhere, it must first be generated. Audit policy controls what the Security Reference Monitor (the kernel component that enforces security policy) actually logs. This is where the majority of defensive coverage gaps originate not in the collection pipeline, but in the policy that controls whether events are generated at all.

2.1 Legacy vs. Advanced Audit Policy The Conflict That Silently Disables Your Logging

Windows has two audit policy systems that can conflict:

System	Location	Granularity	Subcategories
Legacy Audit Policy	secpol.msc → Local Policies → Audit Policy	9 top-level categories	None
Advanced Audit Policy	secpol.msc → Advanced Audit Policy Configuration	10 categories, 58 subcategories	Full control

The critical, frequently unknown behavior: if both are configured, legacy policy wins by default and silently overrides advanced policy subcategories.

Example of the conflict:

The fix one GPO setting that most organizations are missing:

GPO Path: Computer Configuration → Windows Settings → Security Settings →
          Local Policies → Security Options

Setting: "Audit: Force audit policy subcategory settings (Windows Vista or later)
          to override audit policy category settings"

Value: ENABLED

Without this setting enabled, any legacy audit policy in the GPO hierarchy silently defeats your advanced policy subcategories. You will see events being generated (because the legacy category is enabled), but you will lose the subcategory filtering that gives you specific, high-value event IDs.

2.2 Reading Your Actual Effective Audit Policy (Not What You Configured)

The GPO editor shows what you configured. auditpol.exe shows what is actually in effect on a given machine. These are often different.

:: View the complete effective audit policy  all 58 subcategories
:: Run on a DC, critical server, or workstation you want to verify
auditpol /get /category:*

:: Sample output (showing common gap areas):
:: System audit policy
:: Category/Subcategory                      Setting
::
:: Account Logon
::   Credential Validation                   No Auditing    ← PROBLEM: logons not logged
::   Kerberos Authentication Service         Success        ← OK
::   Kerberos Service Ticket Operations      Success        ← Missing Failure events
::   Other Account Logon Events              No Auditing    ← PROBLEM
::
:: Logon/Logoff
::   Logon                                   Success and Failure
::   Logoff                                  Success
::   Account Lockout                         Success
::   Special Logon                           No Auditing    ← PROBLEM: admin logons missed
::   Other Logon/Logoff Events               No Auditing    ← PROBLEM
::
:: Object Access
::   File System                             No Auditing    ← May be intentional (too noisy)
::   Registry                                No Auditing
::   SAM                                     No Auditing    ← PROBLEM on DCs
::   Certification Services                  No Auditing    ← ADCS attacks invisible
::   Detailed File Share                     No Auditing
::   File Share                              No Auditing    ← Lateral movement via shares
::
:: Privilege Use
::   Sensitive Privilege Use                 No Auditing    ← PROBLEM: SeDebugPrivilege, etc.
::   Non Sensitive Privilege Use             No Auditing    ← Usually intentional (noisy)

Scripted audit across your fleet:

# Collect audit policy from multiple remote machines and compare against baseline
$targetHosts = @("DC01", "DC02", "SERVER01", "WSADMIN01")
$results = @()

foreach ($host in $targetHosts) {
    try {
        $output = Invoke-Command -ComputerName $host -ScriptBlock {
            $raw = auditpol /get /category:* /r  # CSV format
            $raw | ConvertFrom-Csv
        } -ErrorAction Stop

        foreach ($row in $output) {
            $results += [PSCustomObject]@{
                ComputerName  = $host
                Category      = $row.'Category/Subcategory'
                Setting       = $row.'Inclusion Setting'
            }
        }
    } catch {
        Write-Warning "Failed to query $host : $_"
    }
}

# Find hosts where "Credential Validation" is NOT audited
$results | Where-Object {
    $_.Category -like "*Credential Validation*" -and
    $_.Setting -eq "No Auditing"
} | Select-Object ComputerName, Category, Setting

# Export full comparison
$results | Export-Csv "audit_policy_fleet.csv" -NoTypeInformation

2.3 The Subcategories That Must Be Enabled (And Why)

The following table maps the subcategories most critical for detection to the specific attack techniques they cover. This is the minimum baseline for a detection-capable environment:

Subcategory	Event IDs	Covers	Default State
Credential Validation	4776, 4768, 4771	NTLM auth, Kerberos TGT, pre-auth failure	❌ Disabled on many systems
Kerberos Service Ticket Operations	4769	Kerberoasting, silver ticket	⚠ Success only (miss failures)
Process Creation	4688	All process executions	❌ Disabled by default
Process Termination	4689	Timeline reconstruction	❌ Disabled by default
DPAPI Activity	4693, 4694	Credential decryption by malware	❌ Disabled by default
Special Logon	4672	Admin-equivalent logon (SeDebug, etc.)	❌ Disabled on many systems
Sensitive Privilege Use	4673, 4674	Privilege escalation evidence	❌ Disabled by default
Security Group Management	4728, 4732, 4756	Group membership changes	✅ Enabled on DCs
Directory Service Access	4661, 4662	DCSync, object access on AD	⚠ Often disabled (high volume)
Directory Service Changes	5136, 5137, 5141	AD object creation/modification	⚠ Sometimes disabled
Audit Policy Change	4719	Someone changing audit policy	⚠ Often disabled
Filtering Platform Connection	5156, 5158	Network connections per process	❌ Disabled extremely noisy
Other Object Access	4698, 4700, 4702	Scheduled task creation	❌ Disabled on many systems

Critical: enabling Process Creation (4688) with command-line logging

Event 4688 logs process creation, but without an additional registry setting, the command line is NOT included making the event largely useless for detecting LOLBin abuse, PowerShell attacks, or anything that relies on command-line arguments:

# Enable command-line logging in process creation events (4688)
# This must be set SEPARATELY from the audit policy subcategory
$registryPath = "HKLM:\Software\Microsoft\Windows\CurrentVersion\Policies\System\Audit"
if (-not (Test-Path $registryPath)) {
    New-Item -Path $registryPath -Force | Out-Null
}
Set-ItemProperty -Path $registryPath `
    -Name "ProcessCreationIncludeCmdLine_Enabled" `
    -Value 1 -Type DWord

# Verify the setting applied:
Get-ItemProperty -Path $registryPath -Name "ProcessCreationIncludeCmdLine_Enabled"

Without this registry value, you will see 4688 events with CommandLine: - an empty command line. Every rule you write for detecting powershell -enc, certutil -urlcache, or wmic abuse will silently never fire.

Part 3 Log Size: The Most Common Cause of Overwriting

The default log sizes for Windows security channels are laughably inadequate for enterprise environments with active security audit policies:

Log Channel	Windows Default Max Size	Events Per Day (busy DC)	Retention at Default
Security	20 MB	500,000–2,000,000+	< 1 hour
System	20 MB	10,000–50,000	8–24 hours
Application	20 MB	5,000–20,000	1–3 days
PowerShell/Operational	15 MB	20,000–200,000	1–4 hours
Sysmon/Operational	20 MB	200,000–1,000,000+	Minutes

A busy domain controller generating 1 million Security events per day will overwrite its 20MB Security log roughly every 2 minutes.

3.1 Setting Appropriate Log Sizes

:: Set Security log to 4GB (recommended for DCs with active audit policies)
wevtutil sl Security /ms:4294967296

:: Set Sysmon operational log to 2GB
wevtutil sl Microsoft-Windows-Sysmon/Operational /ms:2147483648

:: Set PowerShell operational log to 1GB
wevtutil sl Microsoft-Windows-PowerShell/Operational /ms:1073741824

:: Set Application log to 500MB
wevtutil sl Application /ms:524288000

:: Set System log to 500MB
wevtutil sl System /ms:524288000

:: Verify the change took effect:
wevtutil gl Security
:: Output includes:
:: maxSize: 4294967296
:: retention: false    ← "false" = overwrite as needed (correct setting)
:: autoBackup: false

Deploying via GPO (the right way to do this at scale):

GPO Path: Computer Configuration → Administrative Templates → 
          Windows Components → Event Log Service → Security

Setting: "Specify the maximum log file size (KB)"
Value: 4194304   (= 4GB for DCs)
       1048576   (= 1GB for servers)
       512000    (= 500MB for workstations)

Setting: "Control Event Log behavior when the log file reaches its maximum size"
Value: NOT configured (leave default overwrite behavior)
       [Do NOT set "Do not overwrite events" unless you have extremely fast collection]

3.2 Checking Current Log Status Across Your Fleet

# Inventory log sizes, fill percentage, and oldest retained event across hosts
$hosts = @("DC01", "DC02", "SERVER01", "SERVER02")
$logNames = @("Security", "System", "Microsoft-Windows-Sysmon/Operational",
              "Microsoft-Windows-PowerShell/Operational")
$report = @()

foreach ($computer in $hosts) {
    foreach ($logName in $logNames) {
        try {
            $log = Invoke-Command -ComputerName $computer -ScriptBlock {
                param($ln)
                $l = Get-WinEvent -ListLog $ln -ErrorAction SilentlyContinue
                if ($l) {
                    [PSCustomObject]@{
                        LogName       = $l.LogName
                        MaxSizeMB     = [math]::Round($l.MaximumSizeInBytes / 1MB, 1)
                        CurrentSizeMB = [math]::Round($l.FileSize / 1MB, 1)
                        FillPct       = [math]::Round(($l.FileSize / $l.MaximumSizeInBytes) * 100, 1)
                        RecordCount   = $l.RecordCount
                        IsEnabled     = $l.IsEnabled
                        OldestRecord  = if ($l.RecordCount -gt 0) {
                            (Get-WinEvent -LogName $ln -MaxEvents 1 -Oldest -ErrorAction SilentlyContinue).TimeCreated
                        } else { $null }
                    }
                }
            } -ArgumentList $logName -ErrorAction SilentlyContinue

            if ($log) {
                $log | Add-Member -NotePropertyName ComputerName -NotePropertyValue $computer
                $report += $log
            }
        } catch {}
    }
}

# Flag any log retaining less than 24 hours of events
$report | Where-Object {
    $_.OldestRecord -and
    ((Get-Date) - $_.OldestRecord).TotalHours -lt 24
} | Select-Object ComputerName, LogName, MaxSizeMB, FillPct, OldestRecord |
    Format-Table -AutoSize

# Export full report
$report | Export-Csv "log_inventory.csv" -NoTypeInformation

Part 4 Windows Event Forwarding: The Pipeline That Silently Drops Events

For organizations using WEF/WEC rather than or in addition to a SIEM agent, the forwarding pipeline introduces additional failure modes that are largely invisible without explicit monitoring.

4.1 WEF Architecture and the Subscription Model

WEF uses WinRM (port 5985 HTTP / 5986 HTTPS) to transport events from source machines to a Windows Event Collector (WEC) server. The flow:

The bookmark mechanism and how it fails:

WEC maintains a bookmark per source machine per subscription, tracking the last EventRecordID successfully forwarded. When a source reconnects after going offline, forwarding resumes from the bookmark. This sounds reliable. It has two critical failure modes:

The source's local log overwrote the bookmarked position. If the source was offline and its Security log overwrote itself before reconnecting, the WEC resumes from the bookmark which no longer exists in the log. Events between last bookmark and current position are silently lost. The WEC receives no notification that a gap exists.
The bookmark itself is in the WEC registry and can be corrupted. If the WEC server crashes or the registry becomes inconsistent, bookmarks reset, causing either duplicate or missed events.

Microsoft's own documentation acknowledges this explicitly:

"When the event log overwrites existing events (resulting in data loss if the device isn't connected to the Event Collector), there's no notification sent to the WEF collector that events are lost from the client. Neither is there an indicator that there was a gap encountered in the event stream."

4.2 The Three WEF Delivery Optimization Modes

WEF offers three delivery modes that trade latency for reliability. Most organizations leave the default, which is optimized for the wrong scenario:

:: View current subscription configuration
wecutil gs "BaselineSubscription"

:: The "DeliveryMaxLatency" field controls delivery mode:
:: 
:: Normal     (default): 15 minutes delivery delay. Batches events.
::            Events buffered on source for up to 15 minutes.
::            During a 4-minute incident, you may see NO events in SIEM.
::
:: Minimize Latency:  30 seconds delivery delay.
::            Better for detection but higher WEC load.
::
:: Minimize Bandwidth: 6 hours delivery delay.
::            Clearly wrong for security use cases.

:: Set a subscription to Minimize Latency mode:
wecutil ss "BaselineSubscription" /cm:MinLatency

:: Or set custom timing (delivery every 30 seconds, heartbeat every 60):
wecutil ss "BaselineSubscription" /cm:Custom /hi:60000 /dmi:30000

:: Verify:
wecutil gs "BaselineSubscription" | findstr -i "latency\|heartbeat\|delivery"

In Normal mode, a 15-minute incident can generate zero SIEM alerts because events haven't been forwarded yet. This is not a theoretical concern it is a documented behavior that directly impacts mean time to detect.

4.3 WEC Server Capacity Limits and Drop Behavior

A WEC server on commodity hardware handles approximately 3,000 events per second on average across all subscriptions. This sounds like a lot. It is not, for a large enterprise.

Calculation: 1,000 workstations × 150 events/sec each at peak (logon storms, patch Tuesday, incident response) = 150,000 events/sec. A single WEC server will be saturated at ~2% of that load.

When the WEC server exceeds capacity:

Monitor WEC health with these performance counters:

# Run on the WEC server
$counters = @(
    '\Event Tracing for Windows Session(EventLog-ForwardedEvents)\Events Lost',
    '\Event Tracing for Windows Session(EventLog-ForwardedEvents)\Events Logged per second',
    '\Web Service(_Total)\Current Connections',
    '\Web Service(_Total)\Maximum Connections',
    '\Processor(_Total)\% Processor Time',
    '\Memory\Available MBytes'
)

# Continuous monitoring with 10-second samples
Get-Counter -Counter $counters -SampleInterval 10 -MaxSamples 60 |
    Select-Object -ExpandProperty CounterSamples |
    Select-Object Path, CookedValue, Timestamp |
    Format-Table -AutoSize

# Watch specifically for Events Lost counter  any non-zero value is critical
Get-Counter '\Event Tracing for Windows Session(EventLog-ForwardedEvents)\Events Lost' `
    -SampleInterval 5 -MaxSamples 12 |
    Select-Object -ExpandProperty CounterSamples |
    Where-Object { $_.CookedValue -gt 0 } |
    ForEach-Object { Write-Warning "EVENTS LOST at $($_.Timestamp): $($_.CookedValue)" }

4.4 XPath Subscription Filters: The Gaps You Introduced Intentionally

WEF subscriptions use XPath queries to filter which events are forwarded. These queries are powerful but error-prone. A syntax mistake or logic error in an XPath filter silently excludes events with no error message.

Example of a broken XPath filter that silently misses events:

<!-- BROKEN: This filter tries to catch Event ID 4688 AND 4624
     but the XPath is semantically wrong  will not match anything -->
<Query Id="0" Path="Security">
  <Select Path="Security">
    *[System[(EventID=4688)]] AND *[System[(EventID=4624)]]
  </Select>
</Query>

<!-- CORRECT: Use separate Select elements or proper XPath OR syntax -->
<Query Id="0" Path="Security">
  <Select Path="Security">
    *[System[(EventID=4688 or EventID=4624)]]
  </Select>
</Query>

Validate your XPath filters before deployment:

# Test an XPath filter against local logs before putting it in a subscription
# This reveals whether the filter syntax is correct and returns events
$xpath = "*[System[(EventID=4688 or EventID=4624 or EventID=4625)]]"
$logName = "Security"

try {
    $events = Get-WinEvent -LogName $logName -FilterXPath $xpath -MaxEvents 10 -ErrorAction Stop
    Write-Host "XPath filter valid. Matched $($events.Count) recent events."
    $events | Select-Object TimeCreated, Id, Message | Format-Table -AutoSize
} catch [System.Exception] {
    Write-Error "XPath filter INVALID or no matching events: $_"
}

# Also validate that key event IDs ARE present in the log at all
# (if they're not, the audit policy isn't generating them)
$criticalEventIDs = @(4688, 4624, 4625, 4672, 4698, 4719, 4776)
foreach ($id in $criticalEventIDs) {
    $count = (Get-WinEvent -LogName Security -FilterXPath "*[System[EventID=$id]]" `
              -MaxEvents 1000 -ErrorAction SilentlyContinue).Count
    $status = if ($count -gt 0) { "✓ Present ($count in last 1000)" } else { "⚠ ABSENT  check audit policy" }
    Write-Host "Event ID $id : $status"
}

Part 5 The SIEM Agent Layer: Hidden Drop Points

SIEM agents (Splunk Universal Forwarder, Elastic Agent, Microsoft Monitoring Agent, etc.) introduce their own failure modes. These are frequently overlooked because the agent is "running" and heartbeating to the SIEM, even while dropping events.

5.1 The Bookmark Race Condition

SIEM agents reading .evtx files maintain a local bookmark (position marker) in the file they are reading. The agent reads from the bookmark forward, ships events, and updates the bookmark. The race condition:

The fix is twofold: make the log large enough that it doesn't wrap during the agent's read cycle, and ensure the agent's batch processing interval is short enough relative to the event generation rate. For Splunk UF:

# inputs.conf  Splunk Universal Forwarder tuning for high-volume Security logs
[WinEventLog://Security]
disabled = 0
start_from = oldest
current_only = 0
checkpointInterval = 5        # Flush bookmark every 5 seconds (default: 60)
batch_size = 10               # Read 10 events per batch (tune down on busy DCs)
renderXml = true              # Capture full XML for field extraction
blacklist1 = EventCode="4634" # Exclude logoff events if volume too high
blacklist2 = EventCode="4656" # Exclude handle requests (very noisy)

[WinEventLog://Microsoft-Windows-Sysmon/Operational]
disabled = 0
start_from = oldest
checkpointInterval = 5
batch_size = 20
renderXml = true

5.2 License-Cap Induced Dropping (The Invisible Budget Problem)

Many SIEM platforms enforce daily ingestion limits based on license volume. When the daily cap is hit:

Splunk: Indexing stops. No new events accepted until the next license window. A warning appears in the Splunk UI but only if someone is watching.
Microsoft Sentinel: Ingestion continues but per-GB pricing means cost spikes, sometimes triggering organizational decisions to cap ingestion implemented via Data Collection Rules that silently filter events.
Elastic: License limits restrict feature use, but ingest is less commonly hard-capped.

Check your Splunk license usage:

| rest /services/licenser/pools
| table title, used_bytes, effective_quota, slave_count
| eval used_GB = round(used_bytes/1073741824, 2)
| eval quota_GB = round(effective_quota/1073741824, 2)
| eval pct_used = round((used_bytes/effective_quota)*100, 1)
| where pct_used > 80
| sort -pct_used

Check for indexing gaps in Splunk (license exceeded periods):

index=_internal source=*license_usage.log type=Usage
| timechart span=1h sum(b) as bytes_indexed
| eval GB_indexed = round(bytes_indexed/1073741824, 2)
| where GB_indexed = 0

Part 6 How to Actually Measure Your Collection Fidelity

Everything above describes where things go wrong. This section tells you how to measure whether they are going wrong in your environment, right now.

6.1 The EventRecordID Continuity Test

The most direct measurement: compare the EventRecordID sequence seen in your SIEM against what the source machine has generated. Any gap = events you do not have.

# On the source machine: get the current highest EventRecordID and earliest retained
$securityLog = Get-WinEvent -LogName Security -MaxEvents 1
$oldestEvent = Get-WinEvent -LogName Security -MaxEvents 1 -Oldest

$sourceStats = [PSCustomObject]@{
    LatestRecordId  = $securityLog.RecordId
    OldestRecordId  = $oldestEvent.RecordId
    OldestTimestamp = $oldestEvent.TimeCreated
    TotalRetained   = $securityLog.RecordId - $oldestEvent.RecordId + 1
}

Write-Output "Source latest RecordId: $($sourceStats.LatestRecordId)"
Write-Output "Source oldest retained: $($sourceStats.OldestRecordId) at $($sourceStats.OldestTimestamp)"
Write-Output "Events retained locally: $($sourceStats.TotalRetained)"

Now check what your SIEM has for the same host:

index=wineventlog host="DC01" source="WinEventLog:Security"
| stats min(EventRecordID) as earliest_in_siem, 
        max(EventRecordID) as latest_in_siem,
        count as total_in_siem
        by host
| eval coverage_pct = round((total_in_siem / (latest_in_siem - earliest_in_siem + 1)) * 100, 2)

If coverage_pct is substantially below 100%, events in that ID range are missing from your SIEM. The delta between source TotalRetained and SIEM total_in_siem over the same period is your gap count.

6.2 The Event Volume Baseline Method

A subtler but more scalable approach: establish a baseline of expected event volume per host per event type, then alert on deviations.

index=wineventlog source="WinEventLog:Security" EventCode=4688
| timechart span=1h count by host
| foreach [
    eval avg_$host$ = mvavg($host$, 168),
    eval pct_of_avg_$host$ = round(($host$ / avg_$host$) * 100, 0)
  ]

More practically, for a KQL (Microsoft Sentinel) equivalent:

// Detect hosts reporting significantly fewer events than their 7-day average
// Indicator of agent failure, log overwrite acceleration, or active suppression
let lookback = 7d;
let evaluationWindow = 1h;

SecurityEvent
| where TimeGenerated > ago(lookback)
| where EventID == 4688  // Process creation  high volume, good baseline indicator
| summarize 
    EventCount = count() 
    by Computer, bin(TimeGenerated, evaluationWindow)
| summarize 
    AvgHourlyCount = avg(EventCount),
    StdDev = stdev(EventCount),
    LastHourCount = take_anyif(EventCount, TimeGenerated > ago(evaluationWindow))
    by Computer
| where isnotempty(LastHourCount)
| extend 
    DropThreshold = AvgHourlyCount * 0.5,  // Alert if below 50% of average
    PctOfAverage = round((LastHourCount / AvgHourlyCount) * 100, 1)
| where LastHourCount < DropThreshold
| where AvgHourlyCount > 10  // Exclude hosts with low baseline (too noisy)
| project Computer, AvgHourlyCount, LastHourCount, PctOfAverage, DropThreshold
| sort by PctOfAverage asc

This query runs every hour. Any host reporting fewer than 50% of its normal process creation events triggers an alert. The root cause could be: the machine is off (expected), the agent crashed (fix it), the log is not being collected (configuration issue), or an attacker suppressed logging (respond immediately).

6.3 The Gold Standard: Synthetic Event Injection

The most reliable test: inject known events into a source machine and verify they appear in your SIEM with the correct fields within an expected time window. This is functionally equivalent to a canary test for your collection pipeline.

# On a test or production machine:
# Inject a synthetic event into the Application log with a unique identifier
# that you can search for in your SIEM

$uniqueMarker = "SIEM-FIDELITY-TEST-$(Get-Date -Format 'yyyyMMdd-HHmmss')-$(New-Guid)"

# Write a synthetic event using .NET EventLog class
$eventSource = "SIEMFidelityTest"
if (-not [System.Diagnostics.EventLog]::SourceExists($eventSource)) {
    [System.Diagnostics.EventLog]::CreateEventSource($eventSource, "Application")
}

$log = New-Object System.Diagnostics.EventLog("Application")
$log.Source = $eventSource
$log.WriteEntry($uniqueMarker, [System.Diagnostics.EventLogEntryType]::Information, 9999)

Write-Output "Injected marker: $uniqueMarker"
Write-Output "Now search your SIEM for this string within the next 5 minutes."
Write-Output "If absent after 10 minutes, the collection pipeline has a gap."

You can wrap this into a scheduled task that runs every 4 hours, writes a unique marker, and then a separate SIEM query checks for the marker's arrival within a 15-minute window. Missing markers = pipeline failure = automatic ticket.

SIEM search to validate the marker arrived (Splunk):

index=wineventlog OR index=windows EventCode=9999 source="WinEventLog:Application"
| where Message like "%SIEM-FIDELITY-TEST%"
| rex field=Message "SIEM-FIDELITY-TEST-(?<marker_id>[^\s]+)"
| eval latency_seconds = now() - strptime(substr(marker_id, 1, 15), "%Y%m%d-%H%M%S")
| table _time, host, marker_id, latency_seconds
| sort -_time

If latency_seconds is consistently over 900 (15 minutes), your collection pipeline is too slow for meaningful detection of fast-moving incidents.

6.4 Checking WEF Subscription Health

# On the WEC server  view health of all subscriptions and their sources
wecutil es  # List all subscriptions

# For each subscription, check the runtime status of all enrolled sources
$subscriptions = wecutil es
foreach ($sub in $subscriptions) {
    Write-Host "`n=== Subscription: $sub ===" -ForegroundColor Cyan
    
    # Get full subscription config
    wecutil gs "$sub" | Select-String -Pattern "Name|Status|Enabled|Uri"
    
    # Get per-source runtime status
    wecutil gr "$sub" | ForEach-Object {
        if ($_ -match "Source|LastError|NextRetry|LastHeartbeat") {
            if ($_ -match "LastError" -and $_ -notmatch "LastError: 0x0") {
                Write-Host $_ -ForegroundColor Red  # Non-zero error = problem
            } else {
                Write-Host $_
            }
        }
    }
}

Look for sources with LastError values other than 0x0. Common error codes and their meaning:

Error Code	Meaning	Action
0x0	OK	None needed
0x80070005	Access denied	Check WinRM configuration, DACL on subscription
0x80070776	Subscription not found	Re-apply GPO, restart WEC service
0x803300004	Connection refused	WinRM not running on source, firewall blocking 5985
0x803300005	Could not connect	DNS resolution failure, network issue
0x8033000f	No more endpoints	Source machine offline or unreachable

# Find all WEF sources that haven't heartbeated in the last 2 hours
# These are machines with potential coverage gaps
$twoHoursAgo = (Get-Date).AddHours(-2)

wecutil gr "BaselineSubscription" |
    Select-String "Source:|LastHeartbeat:" |
    ForEach-Object {
        $line = $_.Line.Trim()
        if ($line -match "^Source:") {
            $currentSource = ($line -split "Source: ")[1]
        }
        if ($line -match "LastHeartbeat:") {
            $hb = ($line -split "LastHeartbeat: ")[1]
            if ($hb -ne "N/A") {
                $heartbeatTime = [DateTime]::Parse($hb)
                if ($heartbeatTime -lt $twoHoursAgo) {
                    Write-Warning "STALE: $currentSource last heartbeat: $heartbeatTime"
                }
            }
        }
    }

Part 7 Attackers Exploiting These Gaps: T1562.002

Everything above describes accidental gaps. Sophisticated attackers deliberately exploit them. MITRE ATT&CK T1562.002 (Impair Defenses: Disable Windows Event Logging) documents the specific techniques.

7.1 Disabling Audit Policy Mid-Attack

:: Attacker with local admin rights can disable specific audit subcategories
:: to suppress logging of their specific techniques

:: Disable process creation logging before running tools
auditpol /set /subcategory:"Process Creation" /success:disable /failure:disable

:: Disable logon event logging during lateral movement
auditpol /set /subcategory:"Logon" /success:disable

:: This generates Event ID 4719 (audit policy changed)  IF you're logging it
:: Most environments don't alert on 4719. Check yours:
auditpol /get /subcategory:"Audit Policy Change"

The defense: Alert on Event ID 4719 (system audit policy changed). This event is generated whenever auditpol modifies the local policy. It is one of the highest-fidelity indicators of active defense evasion it has almost no legitimate use outside of planned administrative changes.

// KQL  Alert on audit policy changes from non-scheduled-task processes
SecurityEvent
| where EventID == 4719
| where TimeGenerated > ago(24h)
| extend 
    SubjectUser = tostring(EventData.SubjectUserName),
    SubjectLogon = tostring(EventData.SubjectLogonId),
    AuditPolicyChanges = tostring(EventData.AuditPolicyChanges)
| where SubjectUser !endswith "$"  // Exclude machine accounts (GPO application)
| project TimeGenerated, Computer, SubjectUser, AuditPolicyChanges
| sort by TimeGenerated desc

7.2 Clearing the Event Log

# Attacker clears the Security log to destroy evidence
wevtutil cl Security   # Generates Event 1102 (audit log cleared)
# OR
Clear-EventLog -LogName Security  # Same result

# Remove-EventLog is more destructive  removes the channel entirely
Remove-EventLog -LogName Security
# Does NOT generate 1102  the channel is gone before the event can be written
# Generates 104 in System log (log service error)

Detecting log clearing:

// Alert on Event 1102 (Security log cleared)  rare legitimate event
SecurityEvent
| where EventID == 1102
| project TimeGenerated, Computer, 
          Account = tostring(EventData.SubjectUserName),
          LogonId = tostring(EventData.SubjectLogonId)
| sort by TimeGenerated desc

// Also alert on Event 104 (System log) which indicates service-level log removal
Event
| where EventLog == "System" and EventID == 104
| project TimeGenerated, Computer, RenderedDescription

7.3 ETW Provider Tampering (Advanced)

A sophisticated attacker can tamper with ETW at the kernel level, disabling specific providers without triggering log-clearing events:

Technique: Patch the ETW provider registration in the target process's memory
to return early from the ETW write function, silently suppressing all
events from that provider without any Event ID 1102, 4719, or 104 appearing.

Detection: 
- Compare expected vs. actual event volumes (Section 6.2)
- Monitor for Sysmon Event ID 1 (process creation) with known ETW-patching
  tool signatures in CommandLine field
- Check ETW session buffer loss counters (Section 1.2)
- Synthetic event injection will catch this (Section 6.3)

There is no single event that fires when ETW is patched in memory. Volume-based detection and synthetic injection are the only reliable detections.

Part 8 The Hardening Roadmap: Fix It This Week

Priority 1 (Do This Today)

# 1. Verify the audit policy override flag is set on all DCs and critical servers
# Expected: "Audit: Force audit policy..." = Enabled
Invoke-Command -ComputerName "DC01","DC02","SERVER01" -ScriptBlock {
    $setting = secedit /export /cfg "$env:TEMP\secpol.cfg" /quiet
    Select-String "MACHINE\\System\\CurrentControlSet\\Control\\Lsa\\SCENoApplyLegacyAuditPolicy" `
        "$env:TEMP\secpol.cfg"
}

# 2. Check that process creation (4688) IS generating events on at least one DC
$recent4688 = Get-WinEvent -ComputerName "DC01" -LogName Security `
    -FilterXPath "*[System[EventID=4688 and TimeCreated[timediff(@SystemTime) <= 3600000]]]" `
    -MaxEvents 5 -ErrorAction SilentlyContinue
if (-not $recent4688) {
    Write-Warning "No 4688 events in last hour on DC01  audit policy not configured correctly"
}

# 3. Check command-line logging is enabled
$cmdLineSetting = Invoke-Command -ComputerName "DC01" -ScriptBlock {
    $path = "HKLM:\Software\Microsoft\Windows\CurrentVersion\Policies\System\Audit"
    (Get-ItemProperty -Path $path -Name "ProcessCreationIncludeCmdLine_Enabled" -EA SilentlyContinue).ProcessCreationIncludeCmdLine_Enabled
}
if ($cmdLineSetting -ne 1) {
    Write-Warning "Command-line logging NOT enabled on DC01  all 4688 events have empty CommandLine"
}

Priority 2 (This Week)

# Resize Security log on all DCs to 4GB
$dcs = (Get-ADDomainController -Filter *).Name
foreach ($dc in $dcs) {
    Invoke-Command -ComputerName $dc -ScriptBlock {
        wevtutil sl Security /ms:4294967296        # 4GB
        wevtutil sl Microsoft-Windows-Sysmon/Operational /ms:2147483648  # 2GB
        wevtutil sl Microsoft-Windows-PowerShell/Operational /ms:1073741824  # 1GB
        Write-Output "$env:COMPUTERNAME log sizes updated"
    }
}

Priority 3 (This Month)

Deploy the synthetic event injection test as a scheduled task on 10 representative hosts (DCs, critical servers, sample workstations). Run every 4 hours. Alert in SIEM if any marker is absent after 15 minutes. This gives you continuous, automated validation of collection fidelity the metric that turns this from a one-time audit into an ongoing operational control.

The Complete Gap Inventory: What to Check and How

Gap	Detection Method	Tool	Time to Verify
Audit policy not generating events	auditpol /get /category:*	auditpol.exe	5 min per host
Legacy/advanced policy conflict	Check for SCENoApplyLegacyAuditPolicy=0	secedit / registry	10 min
Command-line logging disabled	Registry check	PowerShell	2 min per host
Log sizes too small	wevtutil gl Security	wevtutil.exe	2 min per host
WEF subscription filter errors	Test XPath with Get-WinEvent -FilterXPath	PowerShell	15 min
WEC server dropping events	ETW Buffers Lost performance counter	Get-Counter	10 min
WEF delivery mode too slow	wecutil gs <subscription> DeliveryMaxLatency	wecutil.exe	5 min
Stale WEF sources	wecutil gr <subscription> LastHeartbeat	wecutil.exe	15 min
EventRecordID gaps in SIEM	Compare source RecordId vs. SIEM query	PowerShell + SIEM	30 min
Volume baseline deviation	SIEM query comparing last hour to 7-day avg	SIEM	Ongoing
Audit log cleared (1102)	Alert rule in SIEM	SIEM	Deploy now
Audit policy tampered (4719)	Alert rule in SIEM	SIEM	Deploy now
ETW tampering	Synthetic injection test	Scheduled PowerShell	Deploy weekly

References

Microsoft Learn: "Use Windows Event Forwarding to help with intrusion detection"
Palantir: windows-event-forwarding GitHub repository production WEF architecture
Elastic: "The Essentials of Central Log Collection with WEF/WEC"
MITRE ATT&CK T1562.002: Impair Defenses Disable Windows Event Logging
MITRE ATT&CK T1070.001: Indicator Removal Clear Windows Event Logs
Microsoft Learn: Event ID 1102 and 4719 documentation
NSA/CISA: "Windows Event Logging and Forwarding" (NSA-CSI-18-130)
Malware Archaeology: Windows Logging Cheat Sheet v2019
Roberto Rodriguez (Cyb3rWard0g): ThreatHunter-Playbook ETW research

Why This Matters More Than Any Detection Rule You'll Write​

Part 1 The Architecture: From Kernel Event to SIEM Record​

1.1 Event Tracing for Windows (ETW): The Kernel Foundation​

1.2 The ETW Ring Buffer Where Events Are Born and First Lost​

1.3 The EVTX File: Structure and How Overwrites Work​

Part 2 Audit Policy: The Silent Misconfiguration​

2.1 Legacy vs. Advanced Audit Policy The Conflict That Silently Disables Your Logging​

2.2 Reading Your Actual Effective Audit Policy (Not What You Configured)​

2.3 The Subcategories That Must Be Enabled (And Why)​

Part 3 Log Size: The Most Common Cause of Overwriting​

3.1 Setting Appropriate Log Sizes​

3.2 Checking Current Log Status Across Your Fleet​

Part 4 Windows Event Forwarding: The Pipeline That Silently Drops Events​

4.1 WEF Architecture and the Subscription Model​

4.2 The Three WEF Delivery Optimization Modes​

4.3 WEC Server Capacity Limits and Drop Behavior​

4.4 XPath Subscription Filters: The Gaps You Introduced Intentionally​

Part 5 The SIEM Agent Layer: Hidden Drop Points​

5.1 The Bookmark Race Condition​

5.2 License-Cap Induced Dropping (The Invisible Budget Problem)​

Part 6 How to Actually Measure Your Collection Fidelity​

6.1 The EventRecordID Continuity Test​

6.2 The Event Volume Baseline Method​

6.3 The Gold Standard: Synthetic Event Injection​

6.4 Checking WEF Subscription Health​

Part 7 Attackers Exploiting These Gaps: T1562.002​

7.1 Disabling Audit Policy Mid-Attack​

7.2 Clearing the Event Log​

7.3 ETW Provider Tampering (Advanced)​

Part 8 The Hardening Roadmap: Fix It This Week​

Priority 1 (Do This Today)​

Priority 2 (This Week)​

Priority 3 (This Month)​

The Complete Gap Inventory: What to Check and How​

References​

Further Reading​