Taming the Beast: Making uWSGI's Harakiri Less Murderous
Back in the ancient times of 2021, when we were still writing all of our code and doing all troubleshooting without AI assistance, I found myself wrestling with a particularly angry piece of infrastructure: uWSGI’s harakiri mechanism.
The Problem: When Harakiri Goes Too Far
Picture this: You’re running a Python web app, and suddenly one request decides to take a leisurely stroll through molasses. Maybe it’s waiting on a slow database query, or perhaps it’s gotten lost in an infinite loop of contemplation about the meaning of life (and JSON parsing).
Enter harakiri - uWSGI’s built-in executioner. When a request exceeds the harakiri timeout, it doesn’t politely ask the process to wrap up. No, it brings out the big guns: SIGKILL (ie kill -9). Instant death. No last words. No chance to flush those precious tracing spans to Jaeger, datadog or sentry.
While this is a good way to reduce the infrastructure impact of slower requests, it was a nightmare for observability. We’d sometimes have bursts of harakiri happening without a clear indicator of why. Going after the root cause could take from a few hours to multiple days due to the lack of traces.
The Solution: Harakiri with Manners
My pull request #2311 introduced three new options to give harakiri a conscience:
harakiri-graceful-timeout- Instead of immediate death, give the process a chance to catch its breath and shut down gracefullyharakiri-graceful-signal- Choose which signal to send for that graceful shutdown (default: SIGTERM, but you can get fancy with SIGSYS or whatever floats your boat)harakiri-queue-threshold- Only trigger harakiri when the listen queue is actually backed up, preventing false alarms during brief spikes
The magic happens in a two-stage process:
- First harakiri attempt: Send your graceful signal, start the graceful timeout clock
- Second harakiri attempt (if needed): Bring out the SIGKILL for the stubborn processes that refuse to die peacefully
Python Signal Handling Example
Here’s the test case I created to verify the functionality (tests/harakiri.py):
# ./uwsgi --master --http :8080 --harakiri 1 --wsgi-file tests/harakiri.py --harakiri-graceful-timeout 1 --py-call-osafterfork --lazy-apps --enable-threads --threads 2 --harakiri-graceful-signal 31
import time
import uwsgi
import signal
import sys
import atexit
def sig_handler(n, fp):
print("[Python App] attempting graceful shutdown triggered by harakiri (signal %d)" % n)
exit(1)
def application(e, s):
print("[Python App] sleeping")
time.sleep(3)
s('200 OK', [('Content-Type', 'text/html')])
return [b"OK"]
def exit_handler():
time.sleep(3)
# Should not reach this line (graceful harakiri deadline expired)
print("[Python App] exiting now")
atexit.register(exit_handler)
signal.signal(signal.SIGSYS, sig_handler)C Implementation Snippet
The core change lives in core/master_checks.c where I added the uwsgi_master_check_harakiri function:
int uwsgi_master_check_harakiri(int w, int c, int harakiri) {
/**
* Triggers a harakiri when the following conditions are met:
* - harakiri timeout > current time
* - listen queue pressure (ie backlog > harakiri_queue_threshold)
*
* The first harakiri attempt on a worker will be graceful if harakiri_graceful_timeout > 0,
* then the worker has harakiri_graceful_timeout seconds to shutdown cleanly, otherwise
* a second harakiri will trigger a SIGKILL
*/
#ifdef __linux__
int backlog = uwsgi.shared->backlog;
#else
int backlog = 0;
#endif
if (harakiri == 0 || harakiri > (time_t) uwsgi.current_time) {
return 0;
}
// no pending harakiri for the worker and no backlog pressure, safe to skip
if (uwsgi.workers[w].pending_harakiri == 0 && backlog < uwsgi.harakiri_queue_threshold) {
uwsgi_log_verbose("HARAKIRI: Skipping harakiri on worker %d. Listen queue is smaller than the threshold (%d < %d)\n",
w, backlog, uwsgi.harakiri_queue_threshold);
return 0;
}
trigger_harakiri(w);
if (uwsgi.harakiri_graceful_timeout > 0) {
uwsgi.workers[w].harakiri = harakiri + uwsgi.harakiri_graceful_timeout;
uwsgi_log_verbose("HARAKIRI: graceful termination attempt on worker %d with signal %d. Next harakiri: %d\n",
w, uwsgi.harakiri_graceful_signal, uwsgi.workers[w].harakiri);
}
return 1;
}And the modified trigger_harakiri function in core/master_utils.c:
void trigger_harakiri(int i) {
int j;
uwsgi_log_verbose("*** HARAKIRI ON WORKER %d (pid: %d, try: %d, graceful: %s) ***\n", i,
uwsgi.workers[i].pid,
uwsgi.workers[i].pending_harakiri + 1,
uwsgi.workers[i].pending_harakiri > 0 ? "no": "yes");
if (uwsgi.harakiri_verbose) {
#ifdef __linux__
int proc_file;
#endif
char buf[512];
}
uwsgi_dump_worker(i, "HARAKIRI");
if (uwsgi.workers[i].pending_harakiri == 0 && uwsgi.harakiri_graceful_timeout > 0) {
kill(uwsgi.workers[i].pid, uwsgi.harakiri_graceful_signal);
} else {
kill(uwsgi.workers[i].pid, SIGKILL);
}
if (!uwsgi.workers[i].pending_harakiri)
uwsgi.workers[i].harakiri_count++;
uwsgi.workers[i].pending_harakiri++;
}Testing
Reproducing the actual issue in production wasn’t easy as it would happen intermittently, so I created a small python script to simulate the issue, where I could easily simulate slow requests and confirm the graceful harakiri works
With a short graceful timeout (1 second):
[Python App] attempting graceful shutdown triggered by harakiri (signal 31)
DAMN ! worker 1 (pid: 585212) died, killed by signal 9 :( trying respawn ...With a longer graceful timeout (3+ seconds):
[Python App] attempting graceful shutdown triggered by harakiri (signal 31)
[Python App] exiting now
DAMN ! worker 1 (pid: 627092) died :( trying respawn ...Notice the difference? In the second case, our Python app actually got to print “exiting now” before meeting its maker. That’s observability gold right there.
Why This Matters
Before this change, if you wanted any kind of clean shutdown during harakiri, you were out of luck. Now you can:
- Give your tracing libraries a chance to flush their buffers and submit data
- Close database connections properly (if you’re into that sort of thing)
- Generally not feel like you’re running a digital slaughterhouse
It’s completely backwards compatible. If you don’t use any of the new options, harakiri behaves exactly as it did before - immediate SIGKILL, no questions asked.
So next time your uWSGI workers are misbehaving, remember: you don’t have to choose between system stability and debugging sanity. With harakiri-graceful-timeout, you can have your cake and eat it too - just make sure the process gets a chance to wipe its mouth first.
PR originally created May 15, 2021 • Merged March 17, 2023