Files
Terminal.Gui/Tests/StressTests/InvokeLeakTest_Timing_Diagram.md
Copilot cb748a1c09 Fix InvokeLeakTest debugger failure on x64 with high-resolution timing (#4304)
* Initial plan

* Add comprehensive analysis of InvokeLeakTest debugger failure

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Add XML documentation to InvokeLeakTest about debugger issues

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Add visual timing diagrams for InvokeLeakTest analysis

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Add executive summary of InvokeLeakTest investigation

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Update analysis with x64 vs ARM confirmation from @tig

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Implement Stopwatch-based timing in TimedEvents to fix x64 race condition

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Update documentation to reflect fix implementation

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Fix test issues and increase TimeSpan.Zero buffer for debugger safety

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Add MainLoop.Wakeup() call in Invoke and remove problematic test

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Auto-detect debugger and increase test timeout to 500ms

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Remove unnecessary MainLoop.Wakeup() call for v2 drivers

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Move analysis documents to Tests/StressTests folder

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Add test coverage for multiple drivers per @BDisp's suggestion

Co-authored-by: tig <585482+tig@users.noreply.github.com>

* Revert multi-driver test coverage changes per @tig request

Co-authored-by: tig <585482+tig@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tig <585482+tig@users.noreply.github.com>
2025-10-21 19:41:03 -06:00

260 lines
9.6 KiB
Markdown

# InvokeLeakTest Timing Diagram
## Normal Operation (Without Debugger)
```
Timeline (milliseconds):
0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms 80ms 90ms 100ms
|------|------|------|------|------|------|------|------|------|------|
│ Background Thread 1: Task.Run → Sleep(2-4ms) → Invoke()
│ ↓
│ Background Thread 2: Task.Run → Sleep(2-4ms) → Invoke()
│ ↓
│ Background Thread 3: Task.Run → Sleep(2-4ms) → Invoke()
│ ↓
│ [All added to _timeouts]
│ Main Loop: ──────────[Iter]───────[Iter]───────[Iter]───────[Iter]────
│ ↓ ↓ ↓ ↓
│ RunTimers RunTimers RunTimers RunTimers
│ ↓ ↓ ↓ ↓
│ Execute 0 Execute 5 Execute 10 Execute 15
│ Counter: 0 ───────→ 0 ─────→ 5 ───────→ 15 ────→ 30 ────→ 45 ─────→ 50
│ Test Check: ✓ PASS
│ └──────────────100ms window────────────────┘
```
**Result**: All invocations execute within 100ms → Test passes
---
## Problem Scenario (With Debugger - @BDisp's Machine)
```
Timeline (milliseconds):
0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms 80ms 90ms 100ms 110ms
|------|------|------|------|------|------|------|------|------|------|------|
│ Background Threads: 500 Tasks launch rapidly
│ ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓ (burst of invocations)
│ All added to _timeouts within same DateTime.UtcNow tick
│ Timestamps: T-100, T-99, T-98, T-97, ... (NudgeToUniqueKey)
│ Main Loop (SLOW due to debugger overhead):
│ ────────────────[Iter 1]────────────────────[Iter 2]──────────────
│ 25ms 60ms
│ ↓ ↓
│ RunTimers RunTimers
│ ↓ ↓
│ DateTime.UtcNow DateTime.UtcNow
│ = T0 = T1
│ ↓ ↓
│ Check: k < now? Check: k < now?
│ ↓ ↓
│ ┌─────────────────┴──────────────┐ │
│ │ Some timeouts: k >= now ! │ │ Execute some
│ │ These are NOT executed │ │ timeouts
│ │ Added back to _timeouts │ │
│ └────────────────────────────────┘ ↓
│ Counter += 300
│ Counter: 0 ────────────────→ 0 ──────────────────────→ 300 ────────────→ 450
│ Test Check at 100ms: ✗ FAIL
│ └──────────100ms window──────┘
│ Counter = 300, Expected = 500
│ Missing 200 invocations!
│ (Those 200 invocations execute later, around 110ms)
```
**Result**: Not all invocations execute within 100ms → TimeoutException
---
## The DateTime.UtcNow Resolution Problem
```
Real Time: 0.000ms 0.001ms 0.002ms 0.003ms ... 15.6ms 15.7ms
│ │ │ │ │ │
DateTime.UtcNow: T0───────────────────────────────────────→T1─────→T2
└─────────────15.6ms tick──────────────────┘
All invocations here get T0
```
**When 100 invocations happen within 15.6ms:**
```
Invoke #1: k = T0 - 100 ticks
Invoke #2: k = T0 - 99 ticks (NudgeToUniqueKey increments)
Invoke #3: k = T0 - 98 ticks
...
Invoke #100: k = T0 + 0 ticks (This is T0!)
```
**When RunTimers() checks at time T0 + 50 ticks:**
```
Invoke #1-50: k < now → Execute ✓
Invoke #51-100: k >= now → NOT executed! Added back to queue ✗
```
---
## Why Debugger Makes It Worse
### Without Debugger
```
Main Loop Iteration Time: ~10-20ms
│ Invoke batch: 10 tasks ────→ Execute 10 ────→ Next batch: 10 tasks
│ 10ms 10ms
│ Small batches processed quickly
```
### With Debugger
```
Main Loop Iteration Time: ~25-50ms (2-5x slower!)
│ Invoke batch: 100 tasks (burst!) ────────→ Execute 50 ──→ 50 still queued
│ 50ms ↓
│ Need another 50ms!
│ Large batches accumulate, processing delayed
```
**Effect**: More invocations queue up between iterations, increasing likelihood of timestamp collisions and race conditions.
---
## The Race Condition Explained
```
Thread Timeline:
Background Thread Main Thread
───────────────── ───────────
[Call Invoke()]
[Lock _timeoutsLockToken]
k = DateTime.UtcNow.Ticks
= 1000
k -= 100 (= 900)
[Add to _timeouts with k=900]
[Release lock] [Lock _timeoutsLockToken]
[Copy _timeouts]
[Release lock]
now = DateTime.UtcNow.Ticks
= 850 ⚠️ (Timer hasn't updated!)
Check: k < now?
900 < 850? → FALSE!
[Timeout NOT executed]
[Added back to _timeouts]
```
**Problem**: Between when `k` is calculated (900) and when it's checked (now=850), the system timer hasn't updated! This can happen because:
1. DateTime.UtcNow has coarse resolution (~15ms)
2. Thread scheduling can cause the check to happen "early"
3. Debugger makes timing less predictable
---
## Solution Comparison
### Current: DateTime.UtcNow
```
Resolution: ~15.6ms (Windows), varies by platform
Precision: Low
Stability: Affected by system time changes
Debugger: Timing issues
Time: 0ms ────────────→ 15.6ms ────────────→ 31.2ms
Reading: T0 T1 T2
└─────All values here are T0─────┘
```
### Proposed: Stopwatch.GetTimestamp()
```
Resolution: ~1 microsecond (typical)
Precision: High
Stability: Not affected by system time changes
Debugger: More reliable
Time: 0ms → 0.001ms → 0.002ms → 0.003ms → ...
Reading: T0 T1 T2 T3 ...
Each reading is unique and monotonic
```
**Benefit**: With microsecond resolution, even 1000 rapid invocations get unique timestamps, eliminating the NudgeToUniqueKey workaround and race conditions.
---
## Test Scenarios
### Scenario 1: Fast Machine, No Debugger
```
Iteration time: 5-10ms
Invoke rate: 20-30/ms
Result: ✓ PASS (plenty of time margin)
```
### Scenario 2: Normal Machine, No Debugger
```
Iteration time: 10-20ms
Invoke rate: 10-20/ms
Result: ✓ PASS (adequate time margin)
```
### Scenario 3: ARM Machine, Debugger (@tig's ARM Windows)
```
Iteration time: 20-30ms
Invoke rate: 15-20/ms
ARM timer resolution: Better than x64
Result: ✓ PASS (ARM timer architecture handles it)
```
### Scenario 4: x64 Machine, Debugger (@BDisp's x64, @tig's x64 Windows) - CONFIRMED
```
Iteration time: 30-50ms
Invoke rate: 10-15/ms
DateTime.UtcNow resolution: 15-20ms (x64 TSC/HPET timer)
Result: ✗ FAIL (exceeds 100ms window)
CONFIRMED: @tig reproduced on x64 but NOT on ARM
```
---
## Recommendations
### Immediate Fix (Test-Level)
```csharp
// Increase tolerance for debugger scenarios
#if DEBUG
private const int POLL_MS = Debugger.IsAttached ? 500 : 100;
#else
private const int POLL_MS = 100;
#endif
```
### Long-Term Fix (Production Code)
```csharp
// In TimedEvents.cs, replace DateTime.UtcNow with Stopwatch
private static long GetTimestampTicks()
{
return Stopwatch.GetTimestamp() * (TimeSpan.TicksPerSecond / Stopwatch.Frequency);
}
// Use in AddTimeout:
long k = GetTimestampTicks() + time.Ticks;
```
This provides microsecond resolution and eliminates the race condition entirely.