Windows Event Log Architecture: Why Your SIEM Is Probably Missing 30% of Events and How to Verify It
An analyst flags a suspicious lateral movement alert. You pull the investigation timeline. There is a 47-minute gap in process creation events from a critical server right across the window where the attacker moved. The EDR shows nothing. The SIEM shows nothing. Post-incident forensics on the local machine reveals 6,800 events that never left the endpoint. The Security event log overwrote itself. The WEF subscription had a filter bug. The WEC server was under load. Nobody noticed because nobody measured. This scenario is not hypothetical it is the most common root cause of detection gaps found during post-incident reviews, and it is almost entirely preventable.
Why This Matters More Than Any Detection Rule You'll Write
Security teams invest enormous effort writing detection rules, tuning Sigma, and expanding MITRE ATT&CK coverage. Those efforts are worthless if the underlying events never reach your SIEM.
The assumption baked into virtually every SIEM dashboard is that the event collection pipeline is working. That assumption is almost never tested, and when it fails, it fails silently. There is no alert for "we stopped receiving process creation events from this host." There is no dashboard tile that turns red when your WEC server starts dropping events under load. There is no automatic notification when a GPO conflict silently rolls back your advanced audit policy to defaults.
The result is what security engineers sometimes call coverage theater you have the rules, you have the dashboards, you have the ATT&CK heatmap lit up, but underneath it all is a collection infrastructure with real gaps that an attacker who understands Windows internals will never trigger an alert through.
This post goes from first principles how Windows event logging actually works internally through the specific failure modes that cause events to be lost, and ends with concrete tools and scripts you can run this week to measure your actual collection fidelity.
Part 1 The Architecture: From Kernel Event to SIEM Record
Understanding where events can be lost requires understanding the full pipeline. Most practitioners know the high-level model. Few know the internals where things actually break.
1.1 Event Tracing for Windows (ETW): The Kernel Foundation
Every Windows event originates in Event Tracing for Windows (ETW) the low-level kernel subsystem that acts as the backbone for all Windows telemetry. ETW is not the same as the Windows Event Log. It is the underlying transport mechanism.
Ten distinct failure points across four layers. An event can be lost at any one of them, with no notification to the analyst on the other end.
1.2 The ETW Ring Buffer Where Events Are Born and First Lost
ETW operates using in-memory ring buffers circular memory regions that providers write events into. Consumers (including the Windows Event Log service) read from these buffers. When a buffer fills faster than consumers can drain it, new events overwrite old ones in memory before they are ever written to disk.
This is not the same as log overwriting (which happens on disk). ETW ring buffer overflow is silent, in-memory loss that leaves no trace of the dropped events not even a gap in the EventRecordID sequence.
ETW buffer parameters are configurable but almost never tuned:
:: View current ETW session configuration for a specific session
logman query "EventLog-Security" -ets
:: Sample output:
:: Name: EventLog-Security
:: Status: Running
:: Root Path: %systemdrive%\PerfLogs\Admin
:: Segment: Off
:: Schedules: On
:: Segment Max Size: 100 MB
::
:: Name: EventLog-Security
:: Type: Trace
:: Append: Off
:: Circular: Off
:: Overwrite: Off
:: Buffer Size: 64 ← 64KB per buffer
:: Buffers Lost: 0 ← Watch this number
:: Buffers Written: 15432
:: Buffer Flush Timer: 1
:: Clock Type: System
:: File Mode: Real-time
The Buffers Lost counter is the key metric. If this is non-zero, events are being dropped in ETW before the Event Log service even sees them. Check this on domain controllers and high-activity servers:
# Check ETW buffer loss for all active security-related sessions
Get-WinEvent -ListLog Security | Select-Object LogName, RecordCount, IsEnabled
# More detailed: check ETW session stats via Performance Counters
$counterPaths = @(
'\Security System-Wide Statistics\Audit Failures',
'\Security System-Wide Statistics\System Events'
)
Get-Counter -Counter $counterPaths -SampleInterval 1 -MaxSamples 5
1.3 The EVTX File: Structure and How Overwrites Work
Windows event logs are stored as .evtx (XML Event Log) files in C:\Windows\System32\winevt\logs\. The format uses a chunked binary structure:
When the log wraps, EventRecordIDs continue incrementing they do not reset. This means you can detect overwrite gaps by looking for discontinuities in the EventRecordID sequence. A jump from EventRecordID 482,441 to 489,209 means 6,768 events were overwritten and are gone.
# Detect EventRecordID gaps that indicate log overwriting
# Run on a remote host or locally
$events = Get-WinEvent -LogName Security -MaxEvents 100 |
Select-Object RecordId, TimeCreated, Id |
Sort-Object RecordId
for ($i = 1; $i -lt $events.Count; $i++) {
$gap = $events[$i].RecordId - $events[$i-1].RecordId
if ($gap -gt 1) {
Write-Output "GAP DETECTED: RecordId jumped from $($events[$i-1].RecordId) to $($events[$i].RecordId)"
Write-Output " Missing events: $($gap - 1)"
Write-Output " Time of gap: $($events[$i-1].TimeCreated) → $($events[$i].TimeCreated)"
}
}
Part 2 Audit Policy: The Silent Misconfiguration
Before a single event travels anywhere, it must first be generated. Audit policy controls what the Security Reference Monitor (the kernel component that enforces security policy) actually logs. This is where the majority of defensive coverage gaps originate not in the collection pipeline, but in the policy that controls whether events are generated at all.
2.1 Legacy vs. Advanced Audit Policy The Conflict That Silently Disables Your Logging
Windows has two audit policy systems that can conflict:
| System | Location | Granularity | Subcategories |
|---|---|---|---|
| Legacy Audit Policy | secpol.msc → Local Policies → Audit Policy | 9 top-level categories | None |
| Advanced Audit Policy | secpol.msc → Advanced Audit Policy Configuration | 10 categories, 58 subcategories | Full control |
The critical, frequently unknown behavior: if both are configured, legacy policy wins by default and silently overrides advanced policy subcategories.
Example of the conflict:
The fix one GPO setting that most organizations are missing:
GPO Path: Computer Configuration → Windows Settings → Security Settings →
Local Policies → Security Options
Setting: "Audit: Force audit policy subcategory settings (Windows Vista or later)
to override audit policy category settings"
Value: ENABLED
Without this setting enabled, any legacy audit policy in the GPO hierarchy silently defeats your advanced policy subcategories. You will see events being generated (because the legacy category is enabled), but you will lose the subcategory filtering that gives you specific, high-value event IDs.
2.2 Reading Your Actual Effective Audit Policy (Not What You Configured)
The GPO editor shows what you configured. auditpol.exe shows what is actually in effect on a given machine. These are often different.
:: View the complete effective audit policy all 58 subcategories
:: Run on a DC, critical server, or workstation you want to verify
auditpol /get /category:*
:: Sample output (showing common gap areas):
:: System audit policy
:: Category/Subcategory Setting
::
:: Account Logon
:: Credential Validation No Auditing ← PROBLEM: logons not logged
:: Kerberos Authentication Service Success ← OK
:: Kerberos Service Ticket Operations Success ← Missing Failure events
:: Other Account Logon Events No Auditing ← PROBLEM
::
:: Logon/Logoff
:: Logon Success and Failure
:: Logoff Success
:: Account Lockout Success
:: Special Logon No Auditing ← PROBLEM: admin logons missed
:: Other Logon/Logoff Events No Auditing ← PROBLEM
::
:: Object Access
:: File System No Auditing ← May be intentional (too noisy)
:: Registry No Auditing
:: SAM No Auditing ← PROBLEM on DCs
:: Certification Services No Auditing ← ADCS attacks invisible
:: Detailed File Share No Auditing
:: File Share No Auditing ← Lateral movement via shares
::
:: Privilege Use
:: Sensitive Privilege Use No Auditing ← PROBLEM: SeDebugPrivilege, etc.
:: Non Sensitive Privilege Use No Auditing ← Usually intentional (noisy)
Scripted audit across your fleet:
# Collect audit policy from multiple remote machines and compare against baseline
$targetHosts = @("DC01", "DC02", "SERVER01", "WSADMIN01")
$results = @()
foreach ($host in $targetHosts) {
try {
$output = Invoke-Command -ComputerName $host -ScriptBlock {
$raw = auditpol /get /category:* /r # CSV format
$raw | ConvertFrom-Csv
} -ErrorAction Stop
foreach ($row in $output) {
$results += [PSCustomObject]@{
ComputerName = $host
Category = $row.'Category/Subcategory'
Setting = $row.'Inclusion Setting'
}
}
} catch {
Write-Warning "Failed to query $host : $_"
}
}
# Find hosts where "Credential Validation" is NOT audited
$results | Where-Object {
$_.Category -like "*Credential Validation*" -and
$_.Setting -eq "No Auditing"
} | Select-Object ComputerName, Category, Setting
# Export full comparison
$results | Export-Csv "audit_policy_fleet.csv" -NoTypeInformation
2.3 The Subcategories That Must Be Enabled (And Why)
The following table maps the subcategories most critical for detection to the specific attack techniques they cover. This is the minimum baseline for a detection-capable environment:
| Subcategory | Event IDs | Covers | Default State |
|---|---|---|---|
| Credential Validation | 4776, 4768, 4771 | NTLM auth, Kerberos TGT, pre-auth failure | ❌ Disabled on many systems |
| Kerberos Service Ticket Operations | 4769 | Kerberoasting, silver ticket | ⚠ Success only (miss failures) |
| Process Creation | 4688 | All process executions | ❌ Disabled by default |
| Process Termination | 4689 | Timeline reconstruction | ❌ Disabled by default |
| DPAPI Activity | 4693, 4694 | Credential decryption by malware | ❌ Disabled by default |
| Special Logon | 4672 | Admin-equivalent logon (SeDebug, etc.) | ❌ Disabled on many systems |
| Sensitive Privilege Use | 4673, 4674 | Privilege escalation evidence | ❌ Disabled by default |
| Security Group Management | 4728, 4732, 4756 | Group membership changes | ✅ Enabled on DCs |
| Directory Service Access | 4661, 4662 | DCSync, object access on AD | ⚠ Often disabled (high volume) |
| Directory Service Changes | 5136, 5137, 5141 | AD object creation/modification | ⚠ Sometimes disabled |
| Audit Policy Change | 4719 | Someone changing audit policy | ⚠ Often disabled |
| Filtering Platform Connection | 5156, 5158 | Network connections per process | ❌ Disabled extremely noisy |
| Other Object Access | 4698, 4700, 4702 | Scheduled task creation | ❌ Disabled on many systems |
Critical: enabling Process Creation (4688) with command-line logging
Event 4688 logs process creation, but without an additional registry setting, the command line is NOT included making the event largely useless for detecting LOLBin abuse, PowerShell attacks, or anything that relies on command-line arguments:
# Enable command-line logging in process creation events (4688)
# This must be set SEPARATELY from the audit policy subcategory
$registryPath = "HKLM:\Software\Microsoft\Windows\CurrentVersion\Policies\System\Audit"
if (-not (Test-Path $registryPath)) {
New-Item -Path $registryPath -Force | Out-Null
}
Set-ItemProperty -Path $registryPath `
-Name "ProcessCreationIncludeCmdLine_Enabled" `
-Value 1 -Type DWord
# Verify the setting applied:
Get-ItemProperty -Path $registryPath -Name "ProcessCreationIncludeCmdLine_Enabled"
Without this registry value, you will see 4688 events with CommandLine: - an empty command line. Every rule you write for detecting powershell -enc, certutil -urlcache, or wmic abuse will silently never fire.
Part 3 Log Size: The Most Common Cause of Overwriting
The default log sizes for Windows security channels are laughably inadequate for enterprise environments with active security audit policies:
| Log Channel | Windows Default Max Size | Events Per Day (busy DC) | Retention at Default |
|---|---|---|---|
| Security | 20 MB | 500,000–2,000,000+ | < 1 hour |
| System | 20 MB | 10,000–50,000 | 8–24 hours |
| Application | 20 MB | 5,000–20,000 | 1–3 days |
| PowerShell/Operational | 15 MB | 20,000–200,000 | 1–4 hours |
| Sysmon/Operational | 20 MB | 200,000–1,000,000+ | Minutes |
A busy domain controller generating 1 million Security events per day will overwrite its 20MB Security log roughly every 2 minutes.
3.1 Setting Appropriate Log Sizes
:: Set Security log to 4GB (recommended for DCs with active audit policies)
wevtutil sl Security /ms:4294967296
:: Set Sysmon operational log to 2GB
wevtutil sl Microsoft-Windows-Sysmon/Operational /ms:2147483648
:: Set PowerShell operational log to 1GB
wevtutil sl Microsoft-Windows-PowerShell/Operational /ms:1073741824
:: Set Application log to 500MB
wevtutil sl Application /ms:524288000
:: Set System log to 500MB
wevtutil sl System /ms:524288000
:: Verify the change took effect:
wevtutil gl Security
:: Output includes:
:: maxSize: 4294967296
:: retention: false ← "false" = overwrite as needed (correct setting)
:: autoBackup: false
Deploying via GPO (the right way to do this at scale):
GPO Path: Computer Configuration → Administrative Templates →
Windows Components → Event Log Service → Security
Setting: "Specify the maximum log file size (KB)"
Value: 4194304 (= 4GB for DCs)
1048576 (= 1GB for servers)
512000 (= 500MB for workstations)
Setting: "Control Event Log behavior when the log file reaches its maximum size"
Value: NOT configured (leave default overwrite behavior)
[Do NOT set "Do not overwrite events" unless you have extremely fast collection]
3.2 Checking Current Log Status Across Your Fleet
# Inventory log sizes, fill percentage, and oldest retained event across hosts
$hosts = @("DC01", "DC02", "SERVER01", "SERVER02")
$logNames = @("Security", "System", "Microsoft-Windows-Sysmon/Operational",
"Microsoft-Windows-PowerShell/Operational")
$report = @()
foreach ($computer in $hosts) {
foreach ($logName in $logNames) {
try {
$log = Invoke-Command -ComputerName $computer -ScriptBlock {
param($ln)
$l = Get-WinEvent -ListLog $ln -ErrorAction SilentlyContinue
if ($l) {
[PSCustomObject]@{
LogName = $l.LogName
MaxSizeMB = [math]::Round($l.MaximumSizeInBytes / 1MB, 1)
CurrentSizeMB = [math]::Round($l.FileSize / 1MB, 1)
FillPct = [math]::Round(($l.FileSize / $l.MaximumSizeInBytes) * 100, 1)
RecordCount = $l.RecordCount
IsEnabled = $l.IsEnabled
OldestRecord = if ($l.RecordCount -gt 0) {
(Get-WinEvent -LogName $ln -MaxEvents 1 -Oldest -ErrorAction SilentlyContinue).TimeCreated
} else { $null }
}
}
} -ArgumentList $logName -ErrorAction SilentlyContinue
if ($log) {
$log | Add-Member -NotePropertyName ComputerName -NotePropertyValue $computer
$report += $log
}
} catch {}
}
}
# Flag any log retaining less than 24 hours of events
$report | Where-Object {
$_.OldestRecord -and
((Get-Date) - $_.OldestRecord).TotalHours -lt 24
} | Select-Object ComputerName, LogName, MaxSizeMB, FillPct, OldestRecord |
Format-Table -AutoSize
# Export full report
$report | Export-Csv "log_inventory.csv" -NoTypeInformation
Part 4 Windows Event Forwarding: The Pipeline That Silently Drops Events
For organizations using WEF/WEC rather than or in addition to a SIEM agent, the forwarding pipeline introduces additional failure modes that are largely invisible without explicit monitoring.
4.1 WEF Architecture and the Subscription Model
WEF uses WinRM (port 5985 HTTP / 5986 HTTPS) to transport events from source machines to a Windows Event Collector (WEC) server. The flow:
The bookmark mechanism and how it fails:
WEC maintains a bookmark per source machine per subscription, tracking the last EventRecordID successfully forwarded. When a source reconnects after going offline, forwarding resumes from the bookmark. This sounds reliable. It has two critical failure modes:
- The source's local log overwrote the bookmarked position. If the source was offline and its Security log overwrote itself before reconnecting, the WEC resumes from the bookmark which no longer exists in the log. Events between last bookmark and current position are silently lost. The WEC receives no notification that a gap exists.
- The bookmark itself is in the WEC registry and can be corrupted. If the WEC server crashes or the registry becomes inconsistent, bookmarks reset, causing either duplicate or missed events.
Microsoft's own documentation acknowledges this explicitly:
"When the event log overwrites existing events (resulting in data loss if the device isn't connected to the Event Collector), there's no notification sent to the WEF collector that events are lost from the client. Neither is there an indicator that there was a gap encountered in the event stream."
4.2 The Three WEF Delivery Optimization Modes
WEF offers three delivery modes that trade latency for reliability. Most organizations leave the default, which is optimized for the wrong scenario:
:: View current subscription configuration
wecutil gs "BaselineSubscription"
:: The "DeliveryMaxLatency" field controls delivery mode:
::
:: Normal (default): 15 minutes delivery delay. Batches events.
:: Events buffered on source for up to 15 minutes.
:: During a 4-minute incident, you may see NO events in SIEM.
::
:: Minimize Latency: 30 seconds delivery delay.
:: Better for detection but higher WEC load.
::
:: Minimize Bandwidth: 6 hours delivery delay.
:: Clearly wrong for security use cases.
:: Set a subscription to Minimize Latency mode:
wecutil ss "BaselineSubscription" /cm:MinLatency
:: Or set custom timing (delivery every 30 seconds, heartbeat every 60):
wecutil ss "BaselineSubscription" /cm:Custom /hi:60000 /dmi:30000
:: Verify:
wecutil gs "BaselineSubscription" | findstr -i "latency\|heartbeat\|delivery"
In Normal mode, a 15-minute incident can generate zero SIEM alerts because events haven't been forwarded yet. This is not a theoretical concern it is a documented behavior that directly impacts mean time to detect.
4.3 WEC Server Capacity Limits and Drop Behavior
A WEC server on commodity hardware handles approximately 3,000 events per second on average across all subscriptions. This sounds like a lot. It is not, for a large enterprise.
Calculation: 1,000 workstations × 150 events/sec each at peak (logon storms, patch Tuesday, incident response) = 150,000 events/sec. A single WEC server will be saturated at ~2% of that load.
When the WEC server exceeds capacity:
Monitor WEC health with these performance counters:
# Run on the WEC server
$counters = @(
'\Event Tracing for Windows Session(EventLog-ForwardedEvents)\Events Lost',
'\Event Tracing for Windows Session(EventLog-ForwardedEvents)\Events Logged per second',
'\Web Service(_Total)\Current Connections',
'\Web Service(_Total)\Maximum Connections',
'\Processor(_Total)\% Processor Time',
'\Memory\Available MBytes'
)
# Continuous monitoring with 10-second samples
Get-Counter -Counter $counters -SampleInterval 10 -MaxSamples 60 |
Select-Object -ExpandProperty CounterSamples |
Select-Object Path, CookedValue, Timestamp |
Format-Table -AutoSize
# Watch specifically for Events Lost counter any non-zero value is critical
Get-Counter '\Event Tracing for Windows Session(EventLog-ForwardedEvents)\Events Lost' `
-SampleInterval 5 -MaxSamples 12 |
Select-Object -ExpandProperty CounterSamples |
Where-Object { $_.CookedValue -gt 0 } |
ForEach-Object { Write-Warning "EVENTS LOST at $($_.Timestamp): $($_.CookedValue)" }
4.4 XPath Subscription Filters: The Gaps You Introduced Intentionally
WEF subscriptions use XPath queries to filter which events are forwarded. These queries are powerful but error-prone. A syntax mistake or logic error in an XPath filter silently excludes events with no error message.
Example of a broken XPath filter that silently misses events:
<!-- BROKEN: This filter tries to catch Event ID 4688 AND 4624
but the XPath is semantically wrong will not match anything -->
<Query Id="0" Path="Security">
<Select Path="Security">
*[System[(EventID=4688)]] AND *[System[(EventID=4624)]]
</Select>
</Query>
<!-- CORRECT: Use separate Select elements or proper XPath OR syntax -->
<Query Id="0" Path="Security">
<Select Path="Security">
*[System[(EventID=4688 or EventID=4624)]]
</Select>
</Query>
Validate your XPath filters before deployment:
# Test an XPath filter against local logs before putting it in a subscription
# This reveals whether the filter syntax is correct and returns events
$xpath = "*[System[(EventID=4688 or EventID=4624 or EventID=4625)]]"
$logName = "Security"
try {
$events = Get-WinEvent -LogName $logName -FilterXPath $xpath -MaxEvents 10 -ErrorAction Stop
Write-Host "XPath filter valid. Matched $($events.Count) recent events."
$events | Select-Object TimeCreated, Id, Message | Format-Table -AutoSize
} catch [System.Exception] {
Write-Error "XPath filter INVALID or no matching events: $_"
}
# Also validate that key event IDs ARE present in the log at all
# (if they're not, the audit policy isn't generating them)
$criticalEventIDs = @(4688, 4624, 4625, 4672, 4698, 4719, 4776)
foreach ($id in $criticalEventIDs) {
$count = (Get-WinEvent -LogName Security -FilterXPath "*[System[EventID=$id]]" `
-MaxEvents 1000 -ErrorAction SilentlyContinue).Count
$status = if ($count -gt 0) { "✓ Present ($count in last 1000)" } else { "⚠ ABSENT check audit policy" }
Write-Host "Event ID $id : $status"
}
Part 5 The SIEM Agent Layer: Hidden Drop Points
SIEM agents (Splunk Universal Forwarder, Elastic Agent, Microsoft Monitoring Agent, etc.) introduce their own failure modes. These are frequently overlooked because the agent is "running" and heartbeating to the SIEM, even while dropping events.
5.1 The Bookmark Race Condition
SIEM agents reading .evtx files maintain a local bookmark (position marker) in the file they are reading. The agent reads from the bookmark forward, ships events, and updates the bookmark. The race condition:
The fix is twofold: make the log large enough that it doesn't wrap during the agent's read cycle, and ensure the agent's batch processing interval is short enough relative to the event generation rate. For Splunk UF:
# inputs.conf Splunk Universal Forwarder tuning for high-volume Security logs
[WinEventLog://Security]
disabled = 0
start_from = oldest
current_only = 0
checkpointInterval = 5 # Flush bookmark every 5 seconds (default: 60)
batch_size = 10 # Read 10 events per batch (tune down on busy DCs)
renderXml = true # Capture full XML for field extraction
blacklist1 = EventCode="4634" # Exclude logoff events if volume too high
blacklist2 = EventCode="4656" # Exclude handle requests (very noisy)
[WinEventLog://Microsoft-Windows-Sysmon/Operational]
disabled = 0
start_from = oldest
checkpointInterval = 5
batch_size = 20
renderXml = true
5.2 License-Cap Induced Dropping (The Invisible Budget Problem)
Many SIEM platforms enforce daily ingestion limits based on license volume. When the daily cap is hit:
- Splunk: Indexing stops. No new events accepted until the next license window. A warning appears in the Splunk UI but only if someone is watching.
- Microsoft Sentinel: Ingestion continues but per-GB pricing means cost spikes, sometimes triggering organizational decisions to cap ingestion implemented via Data Collection Rules that silently filter events.
- Elastic: License limits restrict feature use, but ingest is less commonly hard-capped.
Check your Splunk license usage:
| rest /services/licenser/pools
| table title, used_bytes, effective_quota, slave_count
| eval used_GB = round(used_bytes/1073741824, 2)
| eval quota_GB = round(effective_quota/1073741824, 2)
| eval pct_used = round((used_bytes/effective_quota)*100, 1)
| where pct_used > 80
| sort -pct_used
Check for indexing gaps in Splunk (license exceeded periods):
index=_internal source=*license_usage.log type=Usage
| timechart span=1h sum(b) as bytes_indexed
| eval GB_indexed = round(bytes_indexed/1073741824, 2)
| where GB_indexed = 0
Part 6 How to Actually Measure Your Collection Fidelity
Everything above describes where things go wrong. This section tells you how to measure whether they are going wrong in your environment, right now.
6.1 The EventRecordID Continuity Test
The most direct measurement: compare the EventRecordID sequence seen in your SIEM against what the source machine has generated. Any gap = events you do not have.
# On the source machine: get the current highest EventRecordID and earliest retained
$securityLog = Get-WinEvent -LogName Security -MaxEvents 1
$oldestEvent = Get-WinEvent -LogName Security -MaxEvents 1 -Oldest
$sourceStats = [PSCustomObject]@{
LatestRecordId = $securityLog.RecordId
OldestRecordId = $oldestEvent.RecordId
OldestTimestamp = $oldestEvent.TimeCreated
TotalRetained = $securityLog.RecordId - $oldestEvent.RecordId + 1
}
Write-Output "Source latest RecordId: $($sourceStats.LatestRecordId)"
Write-Output "Source oldest retained: $($sourceStats.OldestRecordId) at $($sourceStats.OldestTimestamp)"
Write-Output "Events retained locally: $($sourceStats.TotalRetained)"
Now check what your SIEM has for the same host:
index=wineventlog host="DC01" source="WinEventLog:Security"
| stats min(EventRecordID) as earliest_in_siem,
max(EventRecordID) as latest_in_siem,
count as total_in_siem
by host
| eval coverage_pct = round((total_in_siem / (latest_in_siem - earliest_in_siem + 1)) * 100, 2)
If coverage_pct is substantially below 100%, events in that ID range are missing from your SIEM. The delta between source TotalRetained and SIEM total_in_siem over the same period is your gap count.
6.2 The Event Volume Baseline Method
A subtler but more scalable approach: establish a baseline of expected event volume per host per event type, then alert on deviations.
index=wineventlog source="WinEventLog:Security" EventCode=4688
| timechart span=1h count by host
| foreach [
eval avg_$host$ = mvavg($host$, 168),
eval pct_of_avg_$host$ = round(($host$ / avg_$host$) * 100, 0)
]
More practically, for a KQL (Microsoft Sentinel) equivalent:
// Detect hosts reporting significantly fewer events than their 7-day average
// Indicator of agent failure, log overwrite acceleration, or active suppression
let lookback = 7d;
let evaluationWindow = 1h;
SecurityEvent
| where TimeGenerated > ago(lookback)
| where EventID == 4688 // Process creation high volume, good baseline indicator
| summarize
EventCount = count()
by Computer, bin(TimeGenerated, evaluationWindow)
| summarize
AvgHourlyCount = avg(EventCount),
StdDev = stdev(EventCount),
LastHourCount = take_anyif(EventCount, TimeGenerated > ago(evaluationWindow))
by Computer
| where isnotempty(LastHourCount)
| extend
DropThreshold = AvgHourlyCount * 0.5, // Alert if below 50% of average
PctOfAverage = round((LastHourCount / AvgHourlyCount) * 100, 1)
| where LastHourCount < DropThreshold
| where AvgHourlyCount > 10 // Exclude hosts with low baseline (too noisy)
| project Computer, AvgHourlyCount, LastHourCount, PctOfAverage, DropThreshold
| sort by PctOfAverage asc
This query runs every hour. Any host reporting fewer than 50% of its normal process creation events triggers an alert. The root cause could be: the machine is off (expected), the agent crashed (fix it), the log is not being collected (configuration issue), or an attacker suppressed logging (respond immediately).
6.3 The Gold Standard: Synthetic Event Injection
The most reliable test: inject known events into a source machine and verify they appear in your SIEM with the correct fields within an expected time window. This is functionally equivalent to a canary test for your collection pipeline.
# On a test or production machine:
# Inject a synthetic event into the Application log with a unique identifier
# that you can search for in your SIEM
$uniqueMarker = "SIEM-FIDELITY-TEST-$(Get-Date -Format 'yyyyMMdd-HHmmss')-$(New-Guid)"
# Write a synthetic event using .NET EventLog class
$eventSource = "SIEMFidelityTest"
if (-not [System.Diagnostics.EventLog]::SourceExists($eventSource)) {
[System.Diagnostics.EventLog]::CreateEventSource($eventSource, "Application")
}
$log = New-Object System.Diagnostics.EventLog("Application")
$log.Source = $eventSource
$log.WriteEntry($uniqueMarker, [System.Diagnostics.EventLogEntryType]::Information, 9999)
Write-Output "Injected marker: $uniqueMarker"
Write-Output "Now search your SIEM for this string within the next 5 minutes."
Write-Output "If absent after 10 minutes, the collection pipeline has a gap."
You can wrap this into a scheduled task that runs every 4 hours, writes a unique marker, and then a separate SIEM query checks for the marker's arrival within a 15-minute window. Missing markers = pipeline failure = automatic ticket.
SIEM search to validate the marker arrived (Splunk):
index=wineventlog OR index=windows EventCode=9999 source="WinEventLog:Application"
| where Message like "%SIEM-FIDELITY-TEST%"
| rex field=Message "SIEM-FIDELITY-TEST-(?<marker_id>[^\s]+)"
| eval latency_seconds = now() - strptime(substr(marker_id, 1, 15), "%Y%m%d-%H%M%S")
| table _time, host, marker_id, latency_seconds
| sort -_time
If latency_seconds is consistently over 900 (15 minutes), your collection pipeline is too slow for meaningful detection of fast-moving incidents.
6.4 Checking WEF Subscription Health
# On the WEC server view health of all subscriptions and their sources
wecutil es # List all subscriptions
# For each subscription, check the runtime status of all enrolled sources
$subscriptions = wecutil es
foreach ($sub in $subscriptions) {
Write-Host "`n=== Subscription: $sub ===" -ForegroundColor Cyan
# Get full subscription config
wecutil gs "$sub" | Select-String -Pattern "Name|Status|Enabled|Uri"
# Get per-source runtime status
wecutil gr "$sub" | ForEach-Object {
if ($_ -match "Source|LastError|NextRetry|LastHeartbeat") {
if ($_ -match "LastError" -and $_ -notmatch "LastError: 0x0") {
Write-Host $_ -ForegroundColor Red # Non-zero error = problem
} else {
Write-Host $_
}
}
}
}
Look for sources with LastError values other than 0x0. Common error codes and their meaning:
| Error Code | Meaning | Action |
|---|---|---|
| 0x0 | OK | None needed |
| 0x80070005 | Access denied | Check WinRM configuration, DACL on subscription |
| 0x80070776 | Subscription not found | Re-apply GPO, restart WEC service |
| 0x803300004 | Connection refused | WinRM not running on source, firewall blocking 5985 |
| 0x803300005 | Could not connect | DNS resolution failure, network issue |
| 0x8033000f | No more endpoints | Source machine offline or unreachable |
# Find all WEF sources that haven't heartbeated in the last 2 hours
# These are machines with potential coverage gaps
$twoHoursAgo = (Get-Date).AddHours(-2)
wecutil gr "BaselineSubscription" |
Select-String "Source:|LastHeartbeat:" |
ForEach-Object {
$line = $_.Line.Trim()
if ($line -match "^Source:") {
$currentSource = ($line -split "Source: ")[1]
}
if ($line -match "LastHeartbeat:") {
$hb = ($line -split "LastHeartbeat: ")[1]
if ($hb -ne "N/A") {
$heartbeatTime = [DateTime]::Parse($hb)
if ($heartbeatTime -lt $twoHoursAgo) {
Write-Warning "STALE: $currentSource last heartbeat: $heartbeatTime"
}
}
}
}
Part 7 Attackers Exploiting These Gaps: T1562.002
Everything above describes accidental gaps. Sophisticated attackers deliberately exploit them. MITRE ATT&CK T1562.002 (Impair Defenses: Disable Windows Event Logging) documents the specific techniques.
7.1 Disabling Audit Policy Mid-Attack
:: Attacker with local admin rights can disable specific audit subcategories
:: to suppress logging of their specific techniques
:: Disable process creation logging before running tools
auditpol /set /subcategory:"Process Creation" /success:disable /failure:disable
:: Disable logon event logging during lateral movement
auditpol /set /subcategory:"Logon" /success:disable
:: This generates Event ID 4719 (audit policy changed) IF you're logging it
:: Most environments don't alert on 4719. Check yours:
auditpol /get /subcategory:"Audit Policy Change"
The defense: Alert on Event ID 4719 (system audit policy changed). This event is generated whenever auditpol modifies the local policy. It is one of the highest-fidelity indicators of active defense evasion it has almost no legitimate use outside of planned administrative changes.
// KQL Alert on audit policy changes from non-scheduled-task processes
SecurityEvent
| where EventID == 4719
| where TimeGenerated > ago(24h)
| extend
SubjectUser = tostring(EventData.SubjectUserName),
SubjectLogon = tostring(EventData.SubjectLogonId),
AuditPolicyChanges = tostring(EventData.AuditPolicyChanges)
| where SubjectUser !endswith "$" // Exclude machine accounts (GPO application)
| project TimeGenerated, Computer, SubjectUser, AuditPolicyChanges
| sort by TimeGenerated desc
7.2 Clearing the Event Log
# Attacker clears the Security log to destroy evidence
wevtutil cl Security # Generates Event 1102 (audit log cleared)
# OR
Clear-EventLog -LogName Security # Same result
# Remove-EventLog is more destructive removes the channel entirely
Remove-EventLog -LogName Security
# Does NOT generate 1102 the channel is gone before the event can be written
# Generates 104 in System log (log service error)
Detecting log clearing:
// Alert on Event 1102 (Security log cleared) rare legitimate event
SecurityEvent
| where EventID == 1102
| project TimeGenerated, Computer,
Account = tostring(EventData.SubjectUserName),
LogonId = tostring(EventData.SubjectLogonId)
| sort by TimeGenerated desc
// Also alert on Event 104 (System log) which indicates service-level log removal
Event
| where EventLog == "System" and EventID == 104
| project TimeGenerated, Computer, RenderedDescription
7.3 ETW Provider Tampering (Advanced)
A sophisticated attacker can tamper with ETW at the kernel level, disabling specific providers without triggering log-clearing events:
Technique: Patch the ETW provider registration in the target process's memory
to return early from the ETW write function, silently suppressing all
events from that provider without any Event ID 1102, 4719, or 104 appearing.
Detection:
- Compare expected vs. actual event volumes (Section 6.2)
- Monitor for Sysmon Event ID 1 (process creation) with known ETW-patching
tool signatures in CommandLine field
- Check ETW session buffer loss counters (Section 1.2)
- Synthetic event injection will catch this (Section 6.3)
There is no single event that fires when ETW is patched in memory. Volume-based detection and synthetic injection are the only reliable detections.
Part 8 The Hardening Roadmap: Fix It This Week
Priority 1 (Do This Today)
# 1. Verify the audit policy override flag is set on all DCs and critical servers
# Expected: "Audit: Force audit policy..." = Enabled
Invoke-Command -ComputerName "DC01","DC02","SERVER01" -ScriptBlock {
$setting = secedit /export /cfg "$env:TEMP\secpol.cfg" /quiet
Select-String "MACHINE\\System\\CurrentControlSet\\Control\\Lsa\\SCENoApplyLegacyAuditPolicy" `
"$env:TEMP\secpol.cfg"
}
# 2. Check that process creation (4688) IS generating events on at least one DC
$recent4688 = Get-WinEvent -ComputerName "DC01" -LogName Security `
-FilterXPath "*[System[EventID=4688 and TimeCreated[timediff(@SystemTime) <= 3600000]]]" `
-MaxEvents 5 -ErrorAction SilentlyContinue
if (-not $recent4688) {
Write-Warning "No 4688 events in last hour on DC01 audit policy not configured correctly"
}
# 3. Check command-line logging is enabled
$cmdLineSetting = Invoke-Command -ComputerName "DC01" -ScriptBlock {
$path = "HKLM:\Software\Microsoft\Windows\CurrentVersion\Policies\System\Audit"
(Get-ItemProperty -Path $path -Name "ProcessCreationIncludeCmdLine_Enabled" -EA SilentlyContinue).ProcessCreationIncludeCmdLine_Enabled
}
if ($cmdLineSetting -ne 1) {
Write-Warning "Command-line logging NOT enabled on DC01 all 4688 events have empty CommandLine"
}
Priority 2 (This Week)
# Resize Security log on all DCs to 4GB
$dcs = (Get-ADDomainController -Filter *).Name
foreach ($dc in $dcs) {
Invoke-Command -ComputerName $dc -ScriptBlock {
wevtutil sl Security /ms:4294967296 # 4GB
wevtutil sl Microsoft-Windows-Sysmon/Operational /ms:2147483648 # 2GB
wevtutil sl Microsoft-Windows-PowerShell/Operational /ms:1073741824 # 1GB
Write-Output "$env:COMPUTERNAME log sizes updated"
}
}
Priority 3 (This Month)
Deploy the synthetic event injection test as a scheduled task on 10 representative hosts (DCs, critical servers, sample workstations). Run every 4 hours. Alert in SIEM if any marker is absent after 15 minutes. This gives you continuous, automated validation of collection fidelity the metric that turns this from a one-time audit into an ongoing operational control.
The Complete Gap Inventory: What to Check and How
| Gap | Detection Method | Tool | Time to Verify |
|---|---|---|---|
| Audit policy not generating events | auditpol /get /category:* | auditpol.exe | 5 min per host |
| Legacy/advanced policy conflict | Check for SCENoApplyLegacyAuditPolicy=0 | secedit / registry | 10 min |
| Command-line logging disabled | Registry check | PowerShell | 2 min per host |
| Log sizes too small | wevtutil gl Security | wevtutil.exe | 2 min per host |
| WEF subscription filter errors | Test XPath with Get-WinEvent -FilterXPath | PowerShell | 15 min |
| WEC server dropping events | ETW Buffers Lost performance counter | Get-Counter | 10 min |
| WEF delivery mode too slow | wecutil gs <subscription> DeliveryMaxLatency | wecutil.exe | 5 min |
| Stale WEF sources | wecutil gr <subscription> LastHeartbeat | wecutil.exe | 15 min |
| EventRecordID gaps in SIEM | Compare source RecordId vs. SIEM query | PowerShell + SIEM | 30 min |
| Volume baseline deviation | SIEM query comparing last hour to 7-day avg | SIEM | Ongoing |
| Audit log cleared (1102) | Alert rule in SIEM | SIEM | Deploy now |
| Audit policy tampered (4719) | Alert rule in SIEM | SIEM | Deploy now |
| ETW tampering | Synthetic injection test | Scheduled PowerShell | Deploy weekly |
References
- Microsoft Learn: "Use Windows Event Forwarding to help with intrusion detection"
- Palantir: windows-event-forwarding GitHub repository production WEF architecture
- Elastic: "The Essentials of Central Log Collection with WEF/WEC"
- MITRE ATT&CK T1562.002: Impair Defenses Disable Windows Event Logging
- MITRE ATT&CK T1070.001: Indicator Removal Clear Windows Event Logs
- Microsoft Learn: Event ID 1102 and 4719 documentation
- NSA/CISA: "Windows Event Logging and Forwarding" (NSA-CSI-18-130)
- Malware Archaeology: Windows Logging Cheat Sheet v2019
- Roberto Rodriguez (Cyb3rWard0g): ThreatHunter-Playbook ETW research
Further Reading
- Network Forensics Without a Tap when event logs are disabled or cleared, reconstruct movement from DNS, NetFlow, and DHCP
- How APT Groups Pivot from Initial Access to Domain Dominance in Under 4 Hours the specific Event IDs that expose each stage of an APT attack chain
- How Attackers Abuse Entra ID & OAuth Without Malware cloud identity events that require separate collection pipelines from on-prem logs
All commands in this post are standard Windows administrative utilities and PowerShell built-ins. They operate on logs you have administrative access to. This is a defensive operations guide.