Hanging Exec in Multiprocessing

Ich versuche herauszufinden, warum meine Subprozessausführung manchmal hängt. Ich bin mit dem folgenden Code einen Befehl mit Privileg Drop auszuführen:Hanging Exec in Multiprocessing

def safe_exec(q, uid, gid): 
    try: 
     os.setgroups([]) 
     os.setregid(gid, gid) 
     os.setreuid(uid, uid) 

     print("dropped") 
     res = subprocess.check_output(['nm', '-D', '/lib/x86_64-linux-gnu/libc.so.6']) 
     print("executed") 
     q.put(res) 
    except Exception as e: 
     q.put(e) 

if __name__ == "__main__": 
    nobody = pwd.getpwnam('nobody') 
    q = multiprocessing.Queue() 
    p = multiprocessing.Process(
      target=safe_exec, 
      args=(q, nobody.pw_uid, nobody.pw_gid)) 

    p.start() 
    p.join(10) 
    res = q.get(False) 
    if isinstance(res, Exception): 
     raise res 
    else: 
     print(res)

Das hängt nicht mit allen Befehlen passieren, aber ich kann es zuverlässig mit nm -D .../libc.so.6 auf meinem Rechner reproduzieren. Das Problem ist, dass der Hang am Ende von passiert - ich kann sehen, sowohl "fallengelassen" und "ausgeführt" ausgedruckt, während der Prozess immer noch hängt.

Strace zeigt Folgendes (abgekürzt). Das Kind beginnt und INiTS als erwartet:

48858 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f9aa3eab9d0) = 48859 
48859 setgroups(0, [])     = 0 
48859 setregid(65534, 65534)   = 0 
48859 setreuid(65534, 65534)   = 0 
48859 fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(136, 6), ...}) = 0 
48859 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9aa3ec2000 
48859 write(1, "dropped\n", 8)   = 8

Die Eltern beginnt versucht jetzt beitreten:

48858 wait4(48859, <unfinished ...> 
... 
48858 wait4(48859, 0x7fffa52cc07c, WNOHANG, NULL) = 0 
48859 <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 48860

Die subprocess.check_output() Enden und "ausgeführt" wird berichtet:

48859 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=48860, si_status=0, si_utime=0, si_stime=0} --- 
48859 write(1, "executed\n", 9)   = 9

Hauptprozess beginnt, die Timeouts zu erhöhen und nur zu warten. select/wait4 Zyklen bis 10 Sekunden verstreichen.

48858 wait4(48859, 0x7fffa52cc07c, WNOHANG, NULL) = 0 
48858 select(0, NULL, NULL, NULL, {0, 16000}) = 0 (Timeout) 
48858 wait4(48859, 0x7fffa52cc07c, WNOHANG, NULL) = 0 
48858 select(0, NULL, NULL, NULL, {0, 32000}) = 0 (Timeout) 
48858 wait4(48859, 0x7fffa52cc07c, WNOHANG, NULL) = 0 
48858 select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) 
48858 wait4(48859, 0x7fffa52cc07c, WNOHANG, NULL) = 0 
48858 select(0, NULL, NULL, NULL, {0, 50000}) = 0 (Timeout) 
[ + many more ]

Bis safe_exec endet schließlich direkt nach dem Hauptprozess auf .join(10) gibt:

48858 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=48859, si_status=0, si_utime=0, si_stime=0} --- 
48858 wait4(48859, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 48859

Und das ist es - normaler Prozess zu beenden. Ich kann nicht sehen, was in diesem Fall falsch ist. Nichts scheint offensichtlich zu versagen.

#0 0x00007fde4486b6d3 in __select_nocancel() at ../sysdeps/unix/syscall-template.S:81 
#1 0x000000000062dbd6 in floatsleep (secs=0) at ../Modules/timemodule.c:948 
#2 0x000000000062c843 in time_sleep (self=0x0, args=(<float at remote 0x1e15620>,)) at ../Modules/timemodule.c:206 
#3 0x00000000004896f9 in PyCFunction_Call (func=<built-in function sleep>, arg=(<float at remote 0x1e15620>,), kw=0x0) at ../Objects/methodobject.c:81 
#4 0x000000000052ff77 in call_function (pp_stack=0x7fff39cdc9f0, oparg=1) at ../Python/ceval.c:4356 
#5 0x000000000052a7f7 in PyEval_EvalFrameEx (
    f=Frame 0x1f643d0, for file /usr/lib/python2.7/multiprocessing/forking.py, line 165, in wait (self=<Popen(returncode=None, pid=67389) at remote 0x7fde43a83ed0>, timeout=20, deadline=<float at remote 0x1e15558>, delay=<float at remote 0x1e15620>, res=None, remaining=<float at remote 0x1e154e0>), throwflag=0) at ../Python/ceval.c:2993

und das Kind auf den Abschluss der Verbindung wird auch warten:

Stapel

#0 0x00007f96798520c9 in futex_abstimed_wait (cancel=true, private=<optimised out>, abstime=0x0, expected=0, futex=0x1c109c0) at sem_waitcommon.c:42 
#1 do_futex_wait ([email protected]=0x1c109c0, abstime=0x0) at sem_waitcommon.c:208 
#2 0x00007f9679852164 in __new_sem_wait_slow (sem=0x1c109c0, abstime=0x0) at sem_waitcommon.c:277 
#3 0x00007f967985220a in __new_sem_wait (sem=<optimised out>) at sem_wait.c:28 
#4 0x00000000005787f5 in PyThread_acquire_lock (lock=0x1c109c0, waitflag=1) at ../Python/thread_pthread.h:324 
#5 0x000000000062aa9f in lock_PyThread_acquire_lock (self=0x7f9679c21c70, args=()) at ../Modules/threadmodule.c:52 
#6 0x00000000004896f9 in PyCFunction_Call (func=<built-in method acquire of thread.lock object at remote 0x7f9679c21c70>, arg=(), kw=0x0) at ../Objects/methodobject.c:81 
#7 0x000000000052ff77 in call_function (pp_stack=0x7ffda823dce0, oparg=0) at ../Python/ceval.c:4356 
#8 0x000000000052a7f7 in PyEval_EvalFrameEx (
    f=Frame 0x1c107b0, for file /usr/lib/python2.7/threading.py, line 340, in wait (self=<_Condition(_Verbose__verbose=False, _Condition__lock=<thread.lock at remote 0x7f9679c21c20>, acquire=<built-in method acquire of thread.lock object at remote 0x7f9679c21c20>, _Condition__waiters=[<thread.lock at remote 0x7f9679c21c70>], release=<built-in method release of thread.lock object at remote 0x7f9679c21c20>) at remote 0x7f9676cec1b0>, timeout=None, waiter=<thread.lock at remote 0x7f9679c21c70>, saved_state=None), throwflag=0) 
    at ../Python/ceval.c:2993

Die Python in

mit GDB auf die laufenden Prozesse der Suche wird die Eltern auf wait innen join stecken das Kind ist:

_Condition.wait 
Thread.join 
Queue._finalize_join 
Finalize.__call__

ich würde ap ziehe irgendwelche Ideen dazu, warum dieser Prozess so lange dauert.

Der Code wurde auf Python laufen betrug 2,7 und 3,4

Quelle

2016-04-08 viraptor

Sieht aus, als wäre es mit einem 'Queue'-Objekt verbunden - entweder Ihres oder eines anderen, das' multiprocess' im Hintergrund verwendet. Z.B. Wird 'q.put (res)' ausgeführt? –

Das Problem, das kleine oder leere Ergebnisse in der Warteschlange korrekt gesendet behandelt wurden, aber lange Ergebnisse waren nicht und waren blockiert der Prozess beizutreten.

Das Problem wurde behoben, indem das Ergebnis vor dem Beitritt aus der Warteschlange abgerufen wurde.

Quelle

2016-04-10 04:16:33 viraptor

Antwort

Verwandte Themen