日欧美国产在线成人,国产在线视频真实,久热国产在线

平時看博客或者學知識，學到的東西比較零散，沒有獨立的知識模塊概念，而且學了之后很容易忘。于是我建立了一個自己的筆記倉庫 (一個我長期維護的筆記倉庫，感興趣的可以點個star~你的star是我寫作的巨大大大大的動力)，將平時學到的東西都歸類然后放里面，需要的時候呢也方便復習。

僅做學習和記錄，方案非原創(chuàng)。

1. ANR是什么

ANR全稱是Applicatipon No Response，Android設計ANR的用意，是系統(tǒng)通過與之交互的組件以及用戶交互進行超時監(jiān)控，用來判斷應用進程是否存在卡死或響應過慢的問題，通俗來說就是很多系統(tǒng)中看門狗(watchdog)的設計思想。

2. 導致ANR的原因

耗時操作導致ANR，并不一定是app的問題，實際上，有很大的概率是系統(tǒng)原因?qū)е碌腁NR。下面簡單分析一下哪些操作是應用層導致的ANR，哪些是系統(tǒng)導致的ANR。

應用層導致ANR：

函數(shù)阻塞：如死循環(huán)、主線程IO、處理大數(shù)據(jù)
鎖出錯：主線程等待子線程的鎖
內(nèi)存緊張：系統(tǒng)分配給一個應用的內(nèi)存是有上限的，長期處于內(nèi)存緊張，會導致頻繁內(nèi)存交換，進而導致應用的一些操作超時

系統(tǒng)導致ANR：

CPU被搶占：一般來說，前臺在玩游戲，可能會導致你的后臺廣播被搶占
系統(tǒng)服務無法及時響應：比如獲取系統(tǒng)聯(lián)系人等，系統(tǒng)的服務都是Binder機制，服務能力也是有限的，有可能系統(tǒng)服務長時間不響應導致ANR
其他應用占用大量內(nèi)存

3. 線下拿到ANR日志

adb pull /data/anr/
adb bugreport

缺陷：

只能線下，用戶反饋時，無法獲取ANR日志
可能沒有堆棧信息

4. ANR場景

Service Timeout:比如前臺服務在20s內(nèi)未執(zhí)行完成，后臺服務Timeout時間是前臺服務的10倍，200s；
BroadcastQueue Timeout：比如前臺廣播在10s內(nèi)未執(zhí)行完成，后臺60s
ContentProvider Timeout：內(nèi)容提供者,在publish過超時10s;
InputDispatching Timeout: 輸入事件分發(fā)超時5s，包括按鍵和觸摸事件。

//ActiveServices.java
// How long we wait for a service to finish executing.
static final int SERVICE_BACKGROUND_TIMEOUT = SERVICE_TIMEOUT * 10;
// How long the startForegroundService() grace period is to get around to
// calling startForeground() before we ANR + stop it.
static final int SERVICE_START_FOREGROUND_TIMEOUT = 10*1000;

//ActivityManagerService.java
// How long we allow a receiver to run before giving up on it.
static final int BROADCAST_FG_TIMEOUT = 10*1000;
static final int BROADCAST_BG_TIMEOUT = 60*1000;
// How long we wait until we timeout on key dispatching.
static final int KEY_DISPATCHING_TIMEOUT = 5*1000;

5. ANR觸發(fā)流程

ANR觸發(fā)流程大致可分為2種，一種是Service、Broadcast、Provider觸發(fā)ANR，另外一種是Input觸發(fā)ANR。

5.1 Service、Broadcast、Provider觸發(fā)ANR

大體流程可分為3個步驟：

埋定時炸彈
拆炸彈
引爆炸彈

下面舉個startService的例子，詳細說說這3個步驟：

1.埋定時炸彈

在Activity中調(diào)用startService后，調(diào)用鏈：ContextImpl.startService()->ContextImpl.startServiceCommon()->ActivityManagerService.startService()->ActiveServices.startServiceLocked()->ActiveServices.startServiceInnerLocked()->ActiveServices.bringUpServiceLocked()->ActiveServices.realStartServiceLocked()

//com.android.server.am.ActiveServices.java
private final void realStartServiceLocked(ServiceRecord r,
        ProcessRecord app, boolean execInFg) throws RemoteException {
    ......
    //發(fā)個延遲消息給AMS的Handler
    bumpServiceExecutingLocked(r, execInFg, "create");

    ......
    try {
        //IPC通知app進程啟動Service，執(zhí)行handleCreateService
        app.thread.scheduleCreateService(r, r.serviceInfo,
                mAm.compatibilityInfoForPackage(r.serviceInfo.applicationInfo),
                app.getReportedProcState());
    } catch (DeadObjectException e) {
    } finally {
    }
}

private final void bumpServiceExecutingLocked(ServiceRecord r, boolean fg, String why) {
    scheduleServiceTimeoutLocked(r.app);
    .....
}

final ActivityManagerService mAm;

// How long we wait for a service to finish executing.
static final int SERVICE_TIMEOUT = 20*1000;

// How long we wait for a service to finish executing.
static final int SERVICE_BACKGROUND_TIMEOUT = SERVICE_TIMEOUT * 10;

void scheduleServiceTimeoutLocked(ProcessRecord proc) {
    //mAm是AMS，mHandler是AMS里面的一個Handler
    Message msg = mAm.mHandler.obtainMessage(
            ActivityManagerService.SERVICE_TIMEOUT_MSG);
    msg.obj = proc;
    //發(fā)個延遲消息給AMS里面的一個Handler
    mAm.mHandler.sendMessageDelayed(msg,
            proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
}

在startService流程中，在通知app進程啟動Service之前，會進行預埋一個炸彈，也就是延遲發(fā)送一個消息給AMS的mHandler。當AMS的這個Handler收到SERVICE_TIMEOUT_MSG這個消息時，就認為Service超時了，觸發(fā)ANR。也就是說，特定時間內(nèi)，沒人來拆這個炸彈，這個炸彈就會爆炸。

2. 拆炸彈

在AMS校驗通過后，app這邊可以啟動Service，于是來到了ApplicationThread的scheduleCreateService方法，該方法是運行在binder線程里面的，所以得切到主線程去執(zhí)行，也就是ActivityThread的handleCreateService方法：

//android.app.ActivityThread.java
@UnsupportedAppUsage
private void handleCreateService(CreateServiceData data) {
    ......
    Service service = null;
    try {
        //1. 初始化Service
        ContextImpl context = ContextImpl.createAppContext(this, packageInfo);
        Application app = packageInfo.makeApplication(false, mInstrumentation);
        java.lang.ClassLoader cl = packageInfo.getClassLoader();
        service = packageInfo.getAppFactory()
                .instantiateService(cl, data.info.name, data.intent);
        ......
        service.attach(context, this, data.info.name, data.token, app,
                ActivityManager.getService());
        //2. Service執(zhí)行onCreate，啟動完成
        service.onCreate();
        mServices.put(data.token, service);
        try {
            //3. Service啟動完成，需要通知AMS
            ActivityManager.getService().serviceDoneExecuting(
                    data.token, SERVICE_DONE_EXECUTING_ANON, 0, 0);
        } catch (RemoteException e) {
        }
    } catch (Exception e) {
    }
}

在app進程這邊啟動完Service之后，需要IPC通信告知AMS我這邊已經(jīng)啟動完成了。AMS.serviceDoneExecuting()->ActiveServices.serviceDoneExecutingLocked()

private void serviceDoneExecutingLocked(ServiceRecord r, boolean inDestroying,
        boolean finishing) {
    ......
    mAm.mHandler.removeMessages(ActivityManagerService.SERVICE_TIMEOUT_MSG, r.app);
    ......
}

很清晰，就是把之前延遲發(fā)送的SERVICE_TIMEOUT_MSG消息給移除掉，也就是拆炸彈。只要在規(guī)定的時間內(nèi)把炸彈拆了，那就沒事，要是沒拆，炸彈就要爆炸，觸發(fā)ANR。

3. 引爆炸彈

之前延遲給AMS的handler發(fā)送了一個消息，mAm.mHandler.sendMessageDelayed(msg,proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);，下面我們來看一下這條消息的邏輯

//com.android.server.am.ActivityManagerService.java

final MainHandler mHandler;

final class MainHandler extends Handler {
    @Override
    public void handleMessage(Message msg) {
        switch (msg.what) {
        ......
        case SERVICE_TIMEOUT_MSG: {
            //這個mServices是ActiveServices
            mServices.serviceTimeout((ProcessRecord)msg.obj);
        } break;
        }
        ......
    }
    ......
}

//com.android.server.am.ActiveServices.java
void serviceTimeout(ProcessRecord proc) {
    String anrMessage = null;
    synchronized(mAm) {
        //計算是否有service超時
        final long now = SystemClock.uptimeMillis();
        final long maxTime =  now -
                (proc.execServicesFg ? SERVICE_TIMEOUT : SERVICE_BACKGROUND_TIMEOUT);
        ServiceRecord timeout = null;
        for (int i=proc.executingServices.size()-1; i>=0; i--) {
            ServiceRecord sr = proc.executingServices.valueAt(i);
            if (sr.executingStart < maxTime) {
                timeout = sr;
                break;
            }
        }
        if (timeout != null && mAm.mProcessList.mLruProcesses.contains(proc)) {
            anrMessage = "executing service " + timeout.shortInstanceName;
        }
    }

    if (anrMessage != null) {
        //有超時的Service,mAm是AMS，mAnrHelper是AnrHelper
        mAm.mAnrHelper.appNotResponding(proc, anrMessage);
    }
}

AMS這邊如果收到了SERVICE_TIMEOUT_MSG消息，也就是超時了，沒人來拆炸彈，那么它會讓ActiveServices確認一下是否有Service超時，有的話，再利用AnrHelper來觸發(fā)ANR。

void appNotResponding(ProcessRecord anrProcess, String activityShortComponentName,
        ApplicationInfo aInfo, String parentShortComponentName,
        WindowProcessController parentProcess, boolean aboveSystem, String annotation) {
    //添加AnrRecord到List里面
    synchronized (mAnrRecords) {
        mAnrRecords.add(new AnrRecord(anrProcess, activityShortComponentName, aInfo,
                parentShortComponentName, parentProcess, aboveSystem, annotation));
    }
    startAnrConsumerIfNeeded();
}
private void startAnrConsumerIfNeeded() {
    if (mRunning.compareAndSet(false, true)) {
        //開個子線程來處理
        new AnrConsumerThread().start();
    }
}

private class AnrConsumerThread extends Thread {
    @Override
    public void run() {
        AnrRecord r;
        while ((r = next()) != null) {
            ......
            //這里的r就是AnrRecord
            r.appNotResponding(onlyDumpSelf);
            ......
        }
    }
}
private static class AnrRecord {
    void appNotResponding(boolean onlyDumpSelf) {
        //mApp是ProcessRecord
        mApp.appNotResponding(mActivityShortComponentName, mAppInfo,
                mParentShortComponentName, mParentProcess, mAboveSystem, mAnnotation,
                onlyDumpSelf);
    }
}

開了個子線程，然后調(diào)用ProcessRecord的appNotResponding方法來處理ANR的流程（彈出app無響應彈窗、dump堆棧什么的），具體流程下面會細說。到這里，炸彈就完全引爆了，觸發(fā)了ANR。

5.2 Input觸發(fā)ANR

input的超時檢測機制跟Service、Broadcast、Provider截然不同，并非時間到了就一定被爆炸，而是處理后續(xù)上報事件的過程才會去檢測是否該爆炸，所以更像是掃雷的過程。

input超時機制為什么是掃雷，而非定時爆炸？由于對于input來說即便某次事件執(zhí)行時間超過Timeout時長，只要用戶后續(xù)沒有再生成輸入事件，則不會觸發(fā)ANR。這里的掃雷是指當前輸入系統(tǒng)中正在處理著某個耗時事件的前提下，后續(xù)的每一次input事件都會檢測前一個正在處理的事件是否超時（進入掃雷狀態(tài)），檢測當前的時間距離上次輸入事件分發(fā)時間點是否超過timeout時長。如果沒有超過，則會重置anr的Timeout，從而不會爆炸。

5.3 哪些路徑會引發(fā)ANR？

從埋下炸彈到拆炸彈之間的任何一個或多個路徑執(zhí)行慢都會導致ANR。這里以Service為例，如：

Service的生命周期的回調(diào)方法執(zhí)行慢
主線程的消息隊列存在其他耗時消息讓Service回調(diào)方法遲遲得不到執(zhí)行
sp操作執(zhí)行慢
system_server進程的binder線程繁忙而導致沒有及時收到拆炸彈的指令

5.4 ANR dump主要流程

ANR流程基本是在system_server系統(tǒng)進程完成的，系統(tǒng)進程的行為我們很難監(jiān)控到，想要監(jiān)控這個事情就得從系統(tǒng)進程與應用進程溝通的邊界著手，看邊界上有沒有可以操作的地方。

不管是怎么發(fā)生的ANR，最后都會走到appNotResponding ，比如輸入超時的路徑

ActivityManagerService#inputDispatchingTimedOut
AnrHelper#appNotResponding
AnrConsumerThread#run
AnrRecord#appNotResponding
ProcessRecord#appNotResponding

那我們直接分析這個appNotResponding 方法：

//com.android.server.am.ProcessRecord.java
void appNotResponding(String activityShortComponentName, ApplicationInfo aInfo,
        String parentShortComponentName, WindowProcessController parentProcess,
        boolean aboveSystem, String annotation, boolean onlyDumpSelf) {
    ArrayList<Integer> firstPids = new ArrayList<>(5);
    SparseArray<Boolean> lastPids = new SparseArray<>(20);

    mWindowProcessController.appEarlyNotResponding(annotation, () -> kill("anr",
                ApplicationExitInfo.REASON_ANR, true));

    long anrTime = SystemClock.uptimeMillis();
    if (isMonitorCpuUsage()) {
        mService.updateCpuStatsNow();
    }

    final boolean isSilentAnr;
    synchronized (mService) {
        //注釋1
        // PowerManager.reboot() can block for a long time, so ignore ANRs while shutting down.
        //正在重啟
        if (mService.mAtmInternal.isShuttingDown()) {
            Slog.i(TAG, "During shutdown skipping ANR: " + this + " " + annotation);
            return;
        } else if (isNotResponding()) {
            //已經(jīng)處于ANR流程中
            Slog.i(TAG, "Skipping duplicate ANR: " + this + " " + annotation);
            return;
        } else if (isCrashing()) {
            //正在crash的狀態(tài)
            Slog.i(TAG, "Crashing app skipping ANR: " + this + " " + annotation);
            return;
        } else if (killedByAm) {
            //app已經(jīng)被killed
            Slog.i(TAG, "App already killed by AM skipping ANR: " + this + " " + annotation);
            return;
        } else if (killed) {
            //app已經(jīng)死亡了
            Slog.i(TAG, "Skipping died app ANR: " + this + " " + annotation);
            return;
        }

        // In case we come through here for the same app before completing
        // this one, mark as anring now so we will bail out.
        //做個標記
        setNotResponding(true);

        // Log the ANR to the event log.
        EventLog.writeEvent(EventLogTags.AM_ANR, userId, pid, processName, info.flags,
                annotation);

        // Dump thread traces as quickly as we can, starting with "interesting" processes.
        firstPids.add(pid);

        // Don't dump other PIDs if it's a background ANR or is requested to only dump self.
        //注釋2
        //沉默的anr : 這里表示后臺anr
        isSilentAnr = isSilentAnr();
        if (!isSilentAnr && !onlyDumpSelf) {
            int parentPid = pid;
            if (parentProcess != null && parentProcess.getPid() > 0) {
                parentPid = parentProcess.getPid();
            }
            if (parentPid != pid) firstPids.add(parentPid);

            if (MY_PID != pid && MY_PID != parentPid) firstPids.add(MY_PID);
                        
            //選擇需要dump的進程
            for (int i = getLruProcessList().size() - 1; i >= 0; i--) {
                ProcessRecord r = getLruProcessList().get(i);
                if (r != null && r.thread != null) {
                    int myPid = r.pid;
                    if (myPid > 0 && myPid != pid && myPid != parentPid && myPid != MY_PID) {
                        if (r.isPersistent()) {
                            firstPids.add(myPid);
                            if (DEBUG_ANR) Slog.i(TAG, "Adding persistent proc: " + r);
                        } else if (r.treatLikeActivity) {
                            firstPids.add(myPid);
                            if (DEBUG_ANR) Slog.i(TAG, "Adding likely IME: " + r);
                        } else {
                            lastPids.put(myPid, Boolean.TRUE);
                            if (DEBUG_ANR) Slog.i(TAG, "Adding ANR proc: " + r);
                        }
                    }
                }
            }
        }
    }

    ......

    int[] pids = nativeProcs == null ? null : Process.getPidsForCommands(nativeProcs);
    ArrayList<Integer> nativePids = null;

    if (pids != null) {
        nativePids = new ArrayList<>(pids.length);
        for (int i : pids) {
            nativePids.add(i);
        }
    }

    // For background ANRs, don't pass the ProcessCpuTracker to
    // avoid spending 1/2 second collecting stats to rank lastPids.
    StringWriter tracesFileException = new StringWriter();
    // To hold the start and end offset to the ANR trace file respectively.
    final long[] offsets = new long[2];
    //注釋4
    File tracesFile = ActivityManagerService.dumpStackTraces(firstPids,
            isSilentAnr ? null : processCpuTracker, isSilentAnr ? null : lastPids,
            nativePids, tracesFileException, offsets);
        ......
}

代碼比較長，我們一步一步來看。

注釋1處首先是針對幾種特殊情況：正在重啟、已經(jīng)處于ANR流程中、正在crash、app已經(jīng)被killed和app已經(jīng)死亡了，不用處理ANR，直接return。

注釋2處isSilentAnr是表示當前是否為一個后臺ANR，后臺ANR跟前臺ANR表現(xiàn)不同，前臺ANR會彈出無響應的Dialog，后臺ANR會直接殺死進程。什么是前臺ANR：發(fā)生ANR的進程對用戶來說有感知，就是前臺ANR，否則就是后臺ANR。

注釋3處，選擇需要dump的進程。發(fā)生ANR時，為了方便定位問題，會dump很多信息到Trace文件中。而Trace文件里包含著與ANR相關聯(lián)的進程的Trace信息，因為產(chǎn)生ANR的原因有可能是其他的進程搶占了太多資源，或者IPC到其他進程的時候卡住導致的。需要被dump的進程分為3類：

firstPids：firstPids是需要首先dump的重要進程，發(fā)生ANR的進程無論如何是一定要被dump的，也是首先被dump的，所以第一個被加到firstPids中。如果是SilentAnr（即后臺ANR），不用再加入任何其他的進程。如果不是，需要進一步添加其他的進程：如果發(fā)生ANR的進程不是system_server進程的話，需要添加system_server進程；接下來輪詢AMS維護的一個LRU的進程List，如果最近訪問的進程包含了persistent的進程，或者帶有 *BIND_TREAT_LIKE_ACTVITY* 標簽的進程，都添加到firstPids中。
extraPids：LRU進程List中的其他進程，都會首先添加到lastPids中，然后lastPids會進一步被選出最近CPU使用率高的進程，進一步組成extraPids；
nativePids：nativePids最為簡單，是一些固定的native的系統(tǒng)進程，定義在WatchDog.java中

注釋4處，拿到需要dump的所有進程的pid后，AMS開始按照firstPids、nativePids、extraPids的順序dump這些進程的堆棧。這里比較重要，我們需要跟進去看看具體做了什么。

public static Pair<Long, Long> dumpStackTraces(String tracesFile, ArrayList<Integer> firstPids,
        ArrayList<Integer> nativePids, ArrayList<Integer> extraPids) {

    // 最多dump 20秒
    long remainingTime = 20 * 1000;

    // First collect all of the stacks of the most important pids.
    if (firstPids != null) {
        int num = firstPids.size();
        for (int i = 0; i < num; i++) {
            final int pid = firstPids.get(i);
            final long timeTaken = dumpJavaTracesTombstoned(pid, tracesFile, remainingTime);
            remainingTime -= timeTaken;
            if (remainingTime <= 0) {
                Slog.e(TAG, "Aborting stack trace dump (current firstPid=" + pid
                        + "); deadline exceeded.");
                return firstPidStart >= 0 ? new Pair<>(firstPidStart, firstPidEnd) : null;
            }
        }
    }
    ......
}

就是根據(jù)順序取出前面?zhèn)魅氲膄irstPids、nativePids 、extraPids 的pid，然后逐一去dump這些進程中所有的線程，當然這是一個非常重的操作，一個進程就有那么多線程，更別說這么多進程了。所以，這里規(guī)定了個最長dump時間為20秒，超過則及時返回，這樣可以確保ANR彈窗可以及時彈出（或者被kill掉）。接下來我們接著跟進dumpJavaTracesTombstoned。經(jīng)過一連串的邏輯：ActivityManagerService#dumpJavaTracesTombstoned() → Debug#dumpJavaBacktraceToFileTimeout() → android_os_Debug#android_os_Debug_dumpJavaBacktraceToFileTimeout() → android_os_Debug#dumpTraces() → debuggerd_client#dump_backtrace_to_file_timeout() → debuggerd_client#debuggerd_trigger_dump()。

bool debuggerd_trigger_dump(pid_t tid, DebuggerdDumpType dump_type, unsigned int timeout_ms, unique_fd output_fd) {
    //pid是從AMS那邊傳過來的，即需要dump堆棧的進程
        pid_t pid = tid;
    //......

    // Send the signal.
        //從android_os_Debug_dumpJavaBacktraceToFileTimeout過來的，dump_type為kDebuggerdJavaBacktrace
    const int signal = (dump_type == kDebuggerdJavaBacktrace) ? SIGQUIT : BIONIC_SIGNAL_DEBUGGER;
    sigval val = {.sival_int = (dump_type == kDebuggerdNativeBacktrace) ? 1 : 0};
        //sigqueue：在隊列中向指定進程發(fā)送一個信號和數(shù)據(jù)，成功返回0
    if (sigqueue(pid, signal, val) != 0) {
      log_error(output_fd, errno, "failed to send signal to pid %d", pid);
      return false;
    }
    //......
    LOG(INFO) << TAG "done dumping process " << pid;
    return true;
}

注意，這里相當于是AMS進程間接給需要dump堆棧那個進程發(fā)送了一個SIGQUIT信號，那個進程收到SIGQUIT信號之后便開始dump。這里也就是前面所說的邊界。現(xiàn)在看起來是當一個進程發(fā)生ANR時，則會收到SIGQUIT信號。如果，我們能監(jiān)控到系統(tǒng)發(fā)送的SIGQUIT信號，也許就能感知到發(fā)生了ANR，達到監(jiān)控的目的。

關于進程信號的處理，這里簡單提一下：除Zygote進程外，每個進程都會創(chuàng)建一個SignalCatcher守護線程，用于捕獲SIGQUIT、SIGUSR1信號，并采取相應的行為。

//art/runtime/signal_catcher.cc
void* SignalCatcher::Run(void* arg) {
  SignalCatcher* signal_catcher = reinterpret_cast<SignalCatcher*>(arg);
  CHECK(signal_catcher != nullptr);
  Runtime* runtime = Runtime::Current();
  //檢查當前線程是否依附到Android Runtime
  CHECK(runtime->AttachCurrentThread("Signal Catcher", true, runtime->GetSystemThreadGroup(), !runtime->IsAotCompiler()));

  Thread* self = Thread::Current();
  DCHECK_NE(self->GetState(), kRunnable);
  {
    MutexLock mu(self, signal_catcher->lock_);
    signal_catcher->thread_ = self;
    signal_catcher->cond_.Broadcast(self);
  }

  SignalSet signals;
  signals.Add(SIGQUIT); //添加對信號SIGQUIT的處理
  signals.Add(SIGUSR1); //添加對信號SIGUSR1的處理
    
    //死循環(huán)，不斷等待監(jiān)聽2個信號的dao'l
  while (true) {
    //等待信號到來，這是個阻塞操作
    int signal_number = signal_catcher->WaitForSignal(self, signals);
    //當信號捕獲需要停止時，則取消當前線程跟Android Runtime的關聯(lián)。
    if (signal_catcher->ShouldHalt()) {
      runtime->DetachCurrentThread();
      return nullptr;
    }
    switch (signal_number) {
    case SIGQUIT:
      signal_catcher->HandleSigQuit(); //輸出線程trace
      break;
    case SIGUSR1:
      signal_catcher->HandleSigUsr1(); //強制GC
      break;
    default:
      LOG(ERROR) << "Unexpected signal %d" << signal_number;
      break;
    }
  }
}

在SignalCatcher線程里面，死循環(huán)，通過WaitForSignal監(jiān)聽SIGQUIT和SIGUSR1信號的到來，前面系統(tǒng)進程system_server進程發(fā)送的SIGQUIT信號也就是在這里被監(jiān)聽到，然后開始dump堆棧。

現(xiàn)在，我們整理一下整個ANR的流程：

系統(tǒng)監(jiān)控到app發(fā)生ANR后，收集了一些相關進程pid（包括發(fā)生ANR的進程），準備讓這些進程dump堆棧，從而生成ANR Trace文件
系統(tǒng)開始向這些進程發(fā)送SIGQUIT信號，進程收到SIGQUIT信號之后開始dump堆棧

整個過程的示意圖：

ANR流程示意圖

圖片轉(zhuǎn)自微信客戶端技術團隊

可以看到，一個進程發(fā)生ANR之后的整個流程，只有dump堆棧的行為會發(fā)生在發(fā)生ANR的進程中，其他過程全在系統(tǒng)進程進行處理的，我們無法感知。這個過程從收到SIGQUIT信號開始到使用socket寫Trace結(jié)束。然后繼續(xù)回到系統(tǒng)進程完成剩余的ANR流程，這2個邊界上我們可以做做文章。后面我們會詳細敘述。

6. ANR監(jiān)控

Android M(6.0) 版本之后，應用側(cè)無法直接通過監(jiān)聽 data/anr/trace 文件，監(jiān)控是否發(fā)生 ANR。目前了解到的能用的方案主要有下面2種：

6.1 WatchDog

開個子線程，不斷往主線程發(fā)送消息，并設置超時檢測，如果超時還沒執(zhí)行相應消息，則判定為可能發(fā)生ANR。需要進一步從系統(tǒng)服務獲取相關數(shù)據(jù)（可通過ActivityManagerService.getProcessesInErrorState()方法獲取進程的ANR信息），進一步判定是否真的發(fā)生了ANR。

這個方案對應的開源庫為ANR-WatchDog，源碼比較簡單，只有2個源文件。簡單解析一下核心代碼：


private final Handler _uiHandler = new Handler(Looper.getMainLooper());
private final int _timeoutInterval;
private volatile long _tick = 0;
private volatile boolean _reported = false;

private final Runnable _ticker = new Runnable() {
    @Override public void run() {
        _tick = 0;
        _reported = false;
    }
};

@Override
public void run() {
    setName("|ANR-WatchDog|");

    //_timeoutInterval為設定的超時時長
    long interval = _timeoutInterval;
    while (!isInterrupted()) {
        //_tick為標志，主線程執(zhí)行了下面發(fā)送的_ticker這個Runnable, 那么_tick就會被置為0
        boolean needPost = _tick == 0;
        //在子線程里面需要把標志改為非0，待會兒主線程執(zhí)行了才知道
        _tick += interval;
        if (needPost) {
            //發(fā)個消息給主線程
            _uiHandler.post(_ticker);
        }

        //子線程睡一段時間，起來的時候要是標志位_tick沒有被改成0，說明主線程太忙了，或者卡頓了，沒來得及執(zhí)行該消息
        try {
            Thread.sleep(interval);
        } catch (InterruptedException e) {
            _interruptionListener.onInterrupted(e);
            return ;
        }

        // If the main thread has not handled _ticker, it is blocked. ANR.
        if (_tick != 0 && !_reported) {
            //noinspection ConstantConditions
            //排除debug的情況
            if (!_ignoreDebugger && (Debug.isDebuggerConnected() || Debug.waitingForDebugger())) {
                Log.w("ANRWatchdog", "An ANR was detected but ignored because the debugger is connected (you can prevent this with setIgnoreDebugger(true))");
                _reported = true;
                continue ;
            }

            //可以自定義一個Interceptor告訴watchDog，當前上下文環(huán)境是否可以進行上報
            interval = _anrInterceptor.intercept(_tick);
            if (interval > 0) {
                continue;
            }

            //上報線程堆棧
            final ANRError error;
            if (_namePrefix != null) {
                error = ANRError.New(_tick, _namePrefix, _logThreadsWithoutStackTrace);
            } else {
                error = ANRError.NewMainOnly(_tick);
            }
            //回調(diào)
            _anrListener.onAppNotResponding(error);
            interval = _timeoutInterval;
            _reported = true;
        }
    }
}

核心代碼非常簡潔，基本上就是上面方案的實現(xiàn)了。有一點需要補充的是，需要進一步從系統(tǒng)服務獲取相關數(shù)據(jù)（可通過ActivityManagerService.getProcessesInErrorState()方法獲取進程的ANR信息，具體實現(xiàn)方式下面會詳細說明），進一步判定是否真的發(fā)生了ANR?？梢宰远x一個_anrInterceptor，在里面實現(xiàn)這些內(nèi)容。

6.2 監(jiān)控SIGQUIT信號

這種方案才是真正的監(jiān)控ANR，matrix、xCrash都在使用這種方案。已經(jīng)在國民應用微信等app上檢驗過，穩(wěn)定性和可靠性都能得到保證。

在文章上面的ANR流程分析中，我們找到了系統(tǒng)與發(fā)生ANR進程之間的邊界（即下圖中的1和2）。我們能否監(jiān)聽到系統(tǒng)發(fā)送給我們的SIGQUIT信號呢？答案當然是可行的。

ANR流程示意圖

這里需要一點預備知識，首先我們得知道什么是SIGQUIT信號，前面我們提到了除Zygote進程以外的其他進程都有個Signal Catcher線程在不斷地通過sigwait監(jiān)聽SIGQUIT信號，當收到SIGQUIT信號時開始dump線程堆棧。我們需要攔截或者監(jiān)聽SIGQUIT信號，首先需要了解信號處理的相關函數(shù)，如kill、signal、sigaction、sigwait、pthread_sigmask等，本文就不詳細展開這些函數(shù)的具體使用了，如需詳細了解，推薦閱讀《UNIX環(huán)境高級編程》。

下面是我寫的監(jiān)控SIGQUIT信號demo的核心代碼，完整源碼在這里:

void signalHandler(int sig, siginfo_t *info, void *uc) {
    __android_log_print(ANDROID_LOG_DEBUG, "xfhy_anr", "我監(jiān)聽到SIGQUIT信號了,可能發(fā)生anr了");

    //在這里去dump主線程堆棧
}

extern "C"
JNIEXPORT jboolean JNICALL
Java_com_xfhy_watchsignaldemo_MainActivity_startWatch(JNIEnv *env, jobject thiz) {
    sigset_t set, old_set;
    sigemptyset(&set);
    sigaddset(&set, SIGQUIT);
        
    /*
     * 這里需要調(diào)用SIG_UNBLOCK，因為目標進程被Zogyte fork出來的時候，主線程繼承了
     * Zogyte的主線程的信號屏蔽關系，Zogyte主線程在初始化的時候，通過
     * pthread_sigmask SIG_BLOCK把SIGQUIT的信號給屏蔽了，因此我們需要在自己進程的主線程，
     * 設置pthread_sigmask SIG_UNBLOCK ，這會導致原來的SignalCatcher sigwait將失效，
     * 原因是SignalCatcher 線程會對SIGQUIT 信號處理
     */
    int r = pthread_sigmask(SIG_UNBLOCK, &set, &old_set);
    if (0 != r) {
        return false;
    }

    struct sigaction sa{};
    sa.sa_sigaction = signalHandler;
    sa.sa_flags = SA_ONSTACK | SA_SIGINFO | SA_RESTART;

    return sigaction(SIGQUIT, &sa, nullptr) == 0;
}

Android默認把SIGQUIT設置成了BLOCKED，所以只會響應Signal Catcher線程的sigwait監(jiān)聽SIGQUIT信號，我們用sigaction監(jiān)聽的則收不到，所以這里還需要處理一下。我們通過pthread_sigmask或者sigprocmask把SIGQUIT設置為UNBLOCK，那么再次收到SIGQUIT時，就一定會進入到我們的signalHandler方法中。

除了上面這個之外，還需要注意的是：我們用sigaction搶了Signal Catcher線程的SIGQUIT信號，那Signal Catcher線程就收不到該信號了，那原本的系統(tǒng)dump堆棧的流程就沒了，這是不太合適的。所以我們需要將該信號重新發(fā)送出去，讓Signal Catcher線程接收到該信號。

int tid = getSignalCatcherThreadId(); //遍歷/proc/[pid]目錄，找到SignalCatcher線程的tid
tgkill(getpid(), tid, SIGQUIT);

以上，咱們得到了一個不改變系統(tǒng)行為的前提下，比較完善的監(jiān)控SIGQUIT信號的機制，雖然不是特別完美，但這是監(jiān)控ANR的基礎。接下來我們慢慢完善。

6.2.1 完善的ANR監(jiān)控方案

監(jiān)控到SIGQUIT信號并不等于就監(jiān)控到了ANR。

6.2.1.1 誤報

發(fā)生ANR的進程一定會收到SIGQUIT信號；但是收到SIGQUIT信號的進程并不一定發(fā)生了ANR。

可能是下面2種情況：

其他進程的ANR：發(fā)生ANR之后，發(fā)生ANR的進程并不是唯一需要dump堆棧的進程，系統(tǒng)會收集許多其他的進程進行dump，也就是說當一個應用發(fā)生ANR的時候，其他的應用也有可能收到SIGQUIT信號。所以，我們收到SIGQUIT信號，可能是其他進程發(fā)生了ANR，這個時候上報的話就屬于是誤報了。
非ANR發(fā)送SIGQUIT：發(fā)送SIGQUIT信號非常容易，系統(tǒng)和應用級app都能輕易發(fā)送SIGQUIT信號：java層調(diào)用android.os.Process.sendSignal方法；Native層調(diào)用kill或者tgkill方法。我們收到SIGQUIT信號時，可能并非是ANR流程發(fā)送的SIGQUIT信號，也會產(chǎn)生誤報。

如何解決上面2個誤報的問題？回到ANR流程開始的地方細看

//com.android.server.am.ProcessRecord.java
void appNotResponding(String activityShortComponentName, ApplicationInfo aInfo,
        String parentShortComponentName, WindowProcessController parentProcess,
        boolean aboveSystem, String annotation, boolean onlyDumpSelf) {
    //......
    synchronized (mService) {
        //注意，如果是后臺ANR，直接就kill進程然后return了，并不會走到下面的makeAppNotRespondingLocked，當前進程也不會有NOT_RESPONDING這個flag
        if (isSilentAnr() && !isDebugging()) {
            kill("bg anr", ApplicationExitInfo.REASON_ANR, true);
            return;
        }

        // Set the app's notResponding state, and look up the errorReportReceiver
        makeAppNotRespondingLocked(activityShortComponentName,
                annotation != null ? "ANR " + annotation : "ANR", info.toString());

        // show ANR dialog ......
    }
}

private void makeAppNotRespondingLocked(String activity, String shortMsg, String longMsg) {
    setNotResponding(true);
    // mAppErrors can be null if the AMS is constructed with injector only. This will only
    // happen in tests.
    if (mService.mAppErrors != null) {
        notRespondingReport = mService.mAppErrors.generateProcessError(this,
                ActivityManager.ProcessErrorStateInfo.NOT_RESPONDING,
                activity, shortMsg, longMsg, null);
    }
    startAppProblemLocked();
    getWindowProcessController().stopFreezingActivities();
}

void setNotResponding(boolean notResponding) {
    mNotResponding = notResponding;
    mWindowProcessController.setNotResponding(notResponding);
}

在ANR彈窗前，會執(zhí)行makeAppNotRespondingLocked方法，在這里會給發(fā)生ANR的進程標記一個NOT_RESPONDING的flag，這個flag可以通過ActivityManager來獲?。?/p>

private static boolean checkErrorState() {
    try {
        Application application = sApplication == null ? Matrix.with().getApplication() : sApplication;
        ActivityManager am = (ActivityManager) application.getSystemService(Context.ACTIVITY_SERVICE);
        List<ActivityManager.ProcessErrorStateInfo> procs = am.getProcessesInErrorState();
        if (procs == null) return false;
        for (ActivityManager.ProcessErrorStateInfo proc : procs) {
            if (proc.pid != android.os.Process.myPid()) continue;
            if (proc.condition != ActivityManager.ProcessErrorStateInfo.NOT_RESPONDING) continue;
            return true;
        }
        return false;
    } catch (Throwable t){
        MatrixLog.e(TAG,"[checkErrorState] error : %s", t.getMessage());
    }
    return false;
}

監(jiān)控到SIGQUIT后，我們在20秒內(nèi)（20秒是ANR dump的timeout時間）不斷輪詢自己是否有NOT_RESPONDING的flag，一旦發(fā)現(xiàn)有這個flag，那么馬上就可以認定發(fā)生了一次ANR。

ps: 你可能會想，有這么方便的方法，監(jiān)控SIGQUIT信號不是多余么？我直接搞個死循環(huán)，不斷監(jiān)聽該flag，一旦發(fā)現(xiàn)不就監(jiān)控到ANR了么？可以是可以，但不優(yōu)雅，而且有缺陷（低效、耗電、不環(huán)保、無法解決下面提到的漏報問題）。

6.2.1.2 漏報

進程處于NOT_RESPONDING的狀態(tài)可以確認該進程發(fā)生了ANR。但是發(fā)生ANR的進程并不一定會被設置為NOT_RESPONDING狀態(tài)

下面2種是特殊情況：

后臺ANR（SilentAnr）：如果ANR被標記為了后臺ANR（即SilentAnr），那么殺死進程后就會直接return，不會執(zhí)行到makeAppNotRespondingLocked，那么該進程就不會有NOT_RESPONDING這個flag。這意味著，后臺的ANR沒辦法捕捉到，但后臺ANR的量也挺大的，并且后臺ANR會直接殺死進程，對用戶的體驗也是非常負面的，這么大一部分ANR監(jiān)控不到，當然是無法接受的。
閃退ANR：想當一部分機型（如OPPO、VIVO兩家的高Android版本的機型）修改了ANR的流程，即使是發(fā)生在前臺的ANR，也并不會彈窗，而是直接殺死進程，即閃退。

基于上面2種情況，我們需要一種機制，在收到SIGQUIT信號后，需要非?？焖俚膫刹槌鲎约菏欠褚呀?jīng)處于ANR的狀態(tài)，進行快速的dump和上報。此時我們可以通過主線程釋放處于卡頓狀態(tài)來判斷，怎么快速的知道主線程是否卡住了？可以通過Looper的mMessage對象，該對象的when變量，表示的是當前正在處理的消息入隊的時間，我們可以通過when變量減去當前時間，得到的就是等待時間，如果等待時間過長，就說明主線程是處于卡住的狀態(tài)。這時候收到SIGQUIT信號基本上就可以認為的確發(fā)生了一次ANR：

private static boolean isMainThreadStuck(){
    try {
        MessageQueue mainQueue = Looper.getMainLooper().getQueue();
        Field field = mainQueue.getClass().getDeclaredField("mMessages");
        field.setAccessible(true);
        final Message mMessage = (Message) field.get(mainQueue);
        if (mMessage != null) {
            long when = mMessage.getWhen();
            if(when == 0) {
                return false;
            }
            long time = when - SystemClock.uptimeMillis();
            long timeThreshold = BACKGROUND_MSG_THRESHOLD;
            if (foreground) {
                timeThreshold = FOREGROUND_MSG_THRESHOLD;
            }
            return time < timeThreshold;
        }
    } catch (Exception e){
        return false;
    }
    return false;
}

通過上面幾種機制來綜合判斷收到SIGQUIT信號后，是否真的發(fā)生了一次ANR，最大程度地減少誤報和漏報。

6.2.1.3 獲取ANR Trace

回到上面的ANR流程示意圖，Signal Catcher線程寫Trace也是一個邊界，它是通過socket的write方法來寫trace的。那我們可以直接hook這里的write，就能直接拿到系統(tǒng)dump的ANR Trace內(nèi)容。這個內(nèi)容非常全面，包括了所有線程的各種狀態(tài)、鎖和堆棧（包括native堆棧），對于我們排查問題十分有用，尤其是一些native問題和死鎖等問題。native hook采用PLT Hook方案，穩(wěn)得很，這種方案已經(jīng)在微信上驗證了其穩(wěn)定性。

int (*original_connect)(int __fd, const struct sockaddr* __addr, socklen_t __addr_length);
int my_connect(int __fd, const struct sockaddr* __addr, socklen_t __addr_length) {
    if (strcmp(__addr->sa_data, "/dev/socket/tombstoned_java_trace") == 0) {
        isTraceWrite = true;
        signalCatcherTid = gettid();
    }
    return original_connect(__fd, __addr, __addr_length);
}

int (*original_open)(const char *pathname, int flags, mode_t mode);
int my_open(const char *pathname, int flags, mode_t mode) {
    if (strcmp(pathname, "/data/anr/traces.txt") == 0) {
        isTraceWrite = true;
        signalCatcherTid = gettid();
    }
    return original_open(pathname, flags, mode);
}

ssize_t (*original_write)(int fd, const void* const __pass_object_size0 buf, size_t count);
ssize_t my_write(int fd, const void* const buf, size_t count) {
    if(isTraceWrite && signalCatcherTid == gettid()) {
        isTraceWrite = false;
        signalCatcherTid = 0;
        char *content = (char *) buf;
        printAnrTrace(content);
    }
    return original_write(fd, buf, count);
}

void hookAnrTraceWrite() {
    int apiLevel = getApiLevel();
    if (apiLevel < 19) {
        return;
    }
    if (apiLevel >= 27) {
        plt_hook("libcutils.so", "connect", (void *) my_connect, (void **) (&original_connect));
    } else {
        plt_hook("libart.so", "open", (void *) my_open, (void **) (&original_open));
    }

    if (apiLevel >= 30 || apiLevel == 25 || apiLevel ==24) {
        plt_hook("libc.so", "write", (void *) my_write, (void **) (&original_write));
    } else if (apiLevel == 29) {
        plt_hook("libbase.so", "write", (void *) my_write, (void **) (&original_write));
    } else {
        plt_hook("libart.so", "write", (void *) my_write, (void **) (&original_write));
    }
}

有幾點需要注意：

只Hook ANR流程：有些情況下，基礎庫中的connect/open/write方法可能調(diào)用的比較頻繁，我們需要把hook的影響降到最低。所以我們只會在接收到SIGQUIT信號后（重新發(fā)送SIGQUIT信號給Signal Catcher前）進行hook，ANR流程結(jié)束后再unhook。
只處理Signal Catcher線程open/connect后的第一次write：除了Signal Catcher線程中的dump trace的流程，其他地方調(diào)用的write方法我們并不關心，并不需要處理。
Hook點因API Level而不同：需要hook的write方法在不同的Android版本中，所在so庫也不同，需分別處理。

到此，matrix監(jiān)控SIGQUIT信號從而監(jiān)控ANR的方案的核心邏輯已全部呈現(xiàn)，更多詳細源碼請移步matrix倉庫。

總結(jié)一下，該方案通過去監(jiān)聽SIGQUIT信號，從而感知當前進程可能發(fā)生了ANR，需配合當前進程是否處于NOT_RESPONDING狀態(tài)以及主線程是否卡頓來進行甄別，以免誤判。注冊監(jiān)聽SIGQUIT信號之后，系統(tǒng)原來的Signal Catcher線程就監(jiān)聽不到這個信號了，需要把該信號轉(zhuǎn)發(fā)出去，讓它接收到，以免影響。當前進程的Signal Catcher線程要dump堆棧的時候，會通過socket的write向system server進程進行傳輸dump好的數(shù)據(jù)，我們可以hook這個write，從而拿到系統(tǒng)dump好的ANR Trace內(nèi)容，相當于我們并沒有影響系統(tǒng)的任何流程，還拿到了想要拿到的東西。這個方案完全是在系統(tǒng)的正常dump anr trace的過程中獲取信息，所以能拿到的東西更加全面，但是系統(tǒng)的dump過程其實是對性能影響比較大的，時間也比較久。

7. ANR分析

監(jiān)控固然重要，更重要的是分析是什么原因?qū)е碌腁NR，然后修復好。

7.1 trace文件分析

拿到trace文件，詳細分析下：

----- pid 7761 at 2022-11-02 07:02:26 -----
Cmd line: com.xfhy.watchsignaldemo
Build fingerprint: 'HUAWEI/LYA-AL00/HWLYA:10/HUAWEILYA-AL00/10.1.0.163C00:user/release-keys'
ABI: 'arm64'
Build type: optimized
Zygote loaded classes=11918 post zygote classes=729
Dumping registered class loaders
#0 dalvik.system.PathClassLoader: [], parent #1
#1 java.lang.BootClassLoader: [], no parent
#2 dalvik.system.PathClassLoader: [/system/app/FeatureFramework/FeatureFramework.apk], no parent
#3 dalvik.system.PathClassLoader: [/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk:/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk!classes2.dex:/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk!classes4.dex:/data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/base.apk!classes3.dex], parent #1
Done dumping class loaders
Intern table: 44132 strong; 436 weak
JNI: CheckJNI is off; globals=681 (plus 67 weak)
Libraries: /data/app/com.xfhy.watchsignaldemo-4tkKMWojrpHAf-Q3iecaHQ==/lib/arm64/libwatchsignaldemo.so libandroid.so libcompiler_rt.so libhitrace_jni.so libhiview_jni.so libhwapsimpl_jni.so libiAwareSdk_jni.so libimonitor_jni.so libjavacore.so libjavacrypto.so libjnigraphics.so libmedia_jni.so libopenjdk.so libsoundpool.so libwebviewchromium_loader.so (15)
//已分配堆內(nèi)存大小26M,其中2442kb醫(yī)用，總分配74512個對象
Heap: 90% free, 2442KB/26MB; 74512 objects

Total number of allocations 120222 //進程創(chuàng)建到現(xiàn)在一共創(chuàng)建了多少對象
Total bytes allocated 10MB         //進程創(chuàng)建到現(xiàn)在一共申請了多少內(nèi)存
Total bytes freed 8173KB           //進程創(chuàng)建到現(xiàn)在一共釋放了多少內(nèi)存
Free memory 23MB                   //不擴展堆的情況下可用的內(nèi)存
Free memory until GC 23MB          //GC前的可用內(nèi)存
Free memory until OOME 381MB       //OOM之前的可用內(nèi)存,這個值很小的話，說明已經(jīng)處于內(nèi)存緊張狀態(tài)，app可能是占用了過多的內(nèi)存
Total memory 26MB                  //當前總內(nèi)存（已用+可用）
Max memory 384MB                   //進程最多能申請的內(nèi)存

.....//省略GC相關信息


//當前進程共17個線程
DALVIK THREADS (17):

//Signal Catcher線程調(diào)用棧
"Signal Catcher" daemon prio=5 tid=4 Runnable
  | group="system" sCount=0 dsCount=0 flags=0 obj=0x18c84570 self=0x7252417800
  | sysTid=7772 nice=0 cgrp=default sched=0/0 handle=0x725354ad50
  | state=R schedstat=( 16273959 1085938 5 ) utm=0 stm=1 core=4 HZ=100
  | stack=0x7253454000-0x7253456000 stackSize=991KB
  | held mutexes= "mutator lock"(shared held)
  native: #00 pc 000000000042f8e8  /apex/com.android.runtime/lib64/libart.so (art::DumpNativeStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, int, BacktraceMap*, char const*, art::ArtMethod*, void*, bool)+140)
  native: #01 pc 0000000000523590  /apex/com.android.runtime/lib64/libart.so (art::Thread::DumpStack(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool, BacktraceMap*, bool) const+508)
  native: #02 pc 000000000053e75c  /apex/com.android.runtime/lib64/libart.so (art::DumpCheckpoint::Run(art::Thread*)+844)
  native: #03 pc 000000000053735c  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::RunCheckpoint(art::Closure*, art::Closure*)+504)
  native: #04 pc 0000000000536744  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::Dump(std::__1::basic_ostream<char, std::__1::char_traits<char>>&, bool)+1048)
  native: #05 pc 0000000000536228  /apex/com.android.runtime/lib64/libart.so (art::ThreadList::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+884)
  native: #06 pc 00000000004ee4d8  /apex/com.android.runtime/lib64/libart.so (art::Runtime::DumpForSigQuit(std::__1::basic_ostream<char, std::__1::char_traits<char>>&)+196)
  native: #07 pc 000000000050250c  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::HandleSigQuit()+1356)
  native: #08 pc 0000000000501558  /apex/com.android.runtime/lib64/libart.so (art::SignalCatcher::Run(void*)+268)
  native: #09 pc 00000000000cf7c0  /apex/com.android.runtime/lib64/bionic/libc.so (__pthread_start(void*)+36)
  native: #10 pc 00000000000721a8  /apex/com.android.runtime/lib64/bionic/libc.so (__start_thread+64)
  (no managed stack frames)

"main" prio=5 tid=1 Sleeping
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x73907540 self=0x725f010800
  | sysTid=7761 nice=-10 cgrp=default sched=1073741825/2 handle=0x72e60080d0
  | state=S schedstat=( 281909898 5919799 311 ) utm=20 stm=7 core=4 HZ=100
  | stack=0x7fca180000-0x7fca182000 stackSize=8192KB
  | held mutexes=
  at java.lang.Thread.sleep(Native method)
  - sleeping on <0x00f895d9> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:443)
  - locked <0x00f895d9> (a java.lang.Object)
  at java.lang.Thread.sleep(Thread.java:359)
  at android.os.SystemClock.sleep(SystemClock.java:131)
  at com.xfhy.watchsignaldemo.MainActivity.makeAnr(MainActivity.kt:35)
  at java.lang.reflect.Method.invoke(Native method)
  at androidx.appcompat.app.AppCompatViewInflater$DeclaredOnClickListener.onClick(AppCompatViewInflater.java:441)
  at android.view.View.performClick(View.java:7317)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:7291)
  at android.view.View.access$3600(View.java:838)
  at android.view.View$PerformClick.run(View.java:28247)
  at android.os.Handler.handleCallback(Handler.java:900)
  at android.os.Handler.dispatchMessage(Handler.java:103)
  at android.os.Looper.loop(Looper.java:219)
  at android.app.ActivityThread.main(ActivityThread.java:8668)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1109)

  ... //此處省略剩余的N個線程

trace參數(shù)詳細解讀：

"Signal Catcher" daemon prio=5 tid=4 Runnable
  | group="system" sCount=0 dsCount=0 flags=0 obj=0x18c84570 self=0x7252417800
  | sysTid=7772 nice=0 cgrp=default sched=0/0 handle=0x725354ad50
  | state=R schedstat=( 16273959 1085938 5 ) utm=0 stm=1 core=4 HZ=100
  | stack=0x7253454000-0x7253456000 stackSize=991KB
  | held mutexes= "mutator lock"(shared held)

第1行：

"Signal Catcher" daemon prio=5 tid=4 Runnable

"Signal Catcher" daemon ：線程名，有daemon表示守護線程
prio：線程優(yōu)先級
tid：線程內(nèi)部id
線程狀態(tài)：Runnable

ANR線程狀態(tài)對照表

ps: 一般來說：main線程處于BLOCK、WAITING、TIMEWAITING狀態(tài)，基本上是函數(shù)阻塞導致的ANR，如果main線程無異常，則應該排查CPU負載和內(nèi)存環(huán)境。

第2行：

| group="system" sCount=0 dsCount=0 flags=0 obj=0x18c84570 self=0x7252417800

group：線程所屬的線程組
sCount：線程掛起次數(shù)
dsCount：用于調(diào)試的線程掛起次數(shù)
obj：當前線程關聯(lián)的Java線程對象
self：當前線程地址

第3行：

| sysTid=7772 nice=0 cgrp=default sched=0/0 handle=0x725354ad50

sysTid：線程真正意義上的tid
nice：調(diào)度優(yōu)先級，值越小則優(yōu)先級越高
cgrp：進程所屬的進程調(diào)度組
sched：調(diào)度策略
handle：函數(shù)處理地址

第4行：

| state=R schedstat=( 16273959 1085938 5 ) utm=0 stm=1 core=4 HZ=100

state：線程狀態(tài)
schedstat：CPU調(diào)度時間統(tǒng)計（schedstat括號中的3個數(shù)字依次是Running、Runable、Switch，Running時間：CPU運行的時間，單位ns，Runable時間：RQ隊列的等待時間，單位ns，Switch次數(shù)：CPU調(diào)度切換次數(shù)）
utm/stm：用戶態(tài)/內(nèi)核態(tài)的CPU時間
core：該線程的最后運行所在核
HZ：時鐘頻率

第5行：

| stack=0x7253454000-0x7253456000 stackSize=991KB

stack：線程棧的地址區(qū)間
stackSize：棧的大小

第6行：

| held mutexes= "mutator lock"(shared held)

mutex：所持有mutex類型，有獨占鎖exclusive和共享鎖shared兩類

7.2 ANR案例分析

7.2.1 主線程無卡頓，處于正常狀態(tài)堆棧

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x74b38080 self=0x7ad9014c00
  | sysTid=23081 nice=0 cgrp=default sched=0/0 handle=0x7b5fdc5548
  | state=S schedstat=( 284838633 166738594 505 ) utm=21 stm=7 core=1 HZ=100
  | stack=0x7fc95da000-0x7fc95dc000 stackSize=8MB
  | held mutexes=
  kernel: __switch_to+0xb0/0xbc
  kernel: SyS_epoll_wait+0x288/0x364
  kernel: SyS_epoll_pwait+0xb0/0x124
  kernel: cpu_switch_to+0x38c/0x2258
  native: #00 pc 000000000007cd8c  /system/lib64/libc.so (__epoll_pwait+8)
  native: #01 pc 0000000000014d48  /system/lib64/libutils.so (android::Looper::pollInner(int)+148)
  native: #02 pc 0000000000014c18  /system/lib64/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+60)
  native: #03 pc 00000000001275f4  /system/lib64/libandroid_runtime.so (android::android_os_MessageQueue_nativePollOnce(_JNIEnv*, _jobject*, long, int)+44)
  at android.os.MessageQueue.nativePollOnce(Native method)
  at android.os.MessageQueue.next(MessageQueue.java:330)
  at android.os.Looper.loop(Looper.java:169)
  at android.app.ActivityThread.main(ActivityThread.java:7073)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:536)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:876)

比如這個主線程堆棧，看起來很正常，主線程是空閑的，因為它正處于nativePollOnce，正在等待新消息。處于這個狀態(tài)，那還發(fā)生了ANR，可能有2個原因：

dump堆棧時機太晚了，ANR已經(jīng)發(fā)生過了，才去dump堆棧，此時主線程已經(jīng)恢復正常了
CPU搶占或者內(nèi)存緊張等其他因素引起

遇到這種情況，要先去分析CPU、內(nèi)存的使用情況。其次可以關注抓取日志的時間和ANR發(fā)生的時間是否相隔太久，時間太久這個堆棧就沒有分析的意義了。

7.2.2 主線程執(zhí)行耗時操作

//模擬主線程耗時操作,View點擊的時候調(diào)用這個函數(shù)
fun makeAnr(view: View) {
    var s = 0L
    for (i in 0..99999999999) {
        s += i
    }
    Log.d("xxx", "s=$s")
}

當主線程執(zhí)行到makeAnr時，會因為里面的東西執(zhí)行太耗時而一直在這里進行計算，假設此時有其他事情要想交給主線程處理，則必須得等到makeAnr函數(shù)執(zhí)行完才行。主線程在執(zhí)行makeAnr時，輸入事件無法被處理，用戶多次點擊屏幕之后，就會輸入超時，觸發(fā)InputEvent Timeout，導致ANR。而如果主線程在執(zhí)行上面這段耗時操作的過程中，沒有其他事情需要處理，那其實是不會發(fā)生ANR的。

suspend all histogram:  Sum: 206us 99% C.I. 0.098us-46us Avg: 7.629us Max: 46us
DALVIK THREADS (16):
"main" prio=5 tid=1 Runnable
  | group="main" sCount=0 dsCount=0 flags=0 obj=0x73907540 self=0x725f010800
  | sysTid=32298 nice=-10 cgrp=default sched=1073741825/2 handle=0x72e60080d0
  | state=R schedstat=( 6746757297 5887495 256 ) utm=670 stm=4 core=6 HZ=100
  | stack=0x7fca180000-0x7fca182000 stackSize=8192KB
  | held mutexes= "mutator lock"(shared held)
  at com.xfhy.watchsignaldemo.MainActivity.makeAnr(MainActivity.kt:58)
  at java.lang.reflect.Method.invoke(Native method)
  at androidx.appcompat.app.AppCompatViewInflater$DeclaredOnClickListener.onClick(AppCompatViewInflater.java:441)
  at android.view.View.performClick(View.java:7317)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:7291)
  at android.view.View.access$3600(View.java:838)
  at android.view.View$PerformClick.run(View.java:28247)
  at android.os.Handler.handleCallback(Handler.java:900)
  at android.os.Handler.dispatchMessage(Handler.java:103)
  at android.os.Looper.loop(Looper.java:219)
  at android.app.ActivityThread.main(ActivityThread.java:8668)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1109)

從日志上看，主線程處于執(zhí)行狀態(tài)，不是空閑狀態(tài)，導致ANR了，說明com.xfhy.watchsignaldemo.MainActivity.makeAnr這里有耗時操作。

7.2.3 主線程被鎖阻塞

模擬主線程等待子線程的鎖：

fun makeAnr(view: View) {

    val obj1 = Any()
    val obj2 = Any()

    //搞個死鎖，相互等待

    thread(name = "臥槽") {
        synchronized(obj1) {
            SystemClock.sleep(100)
            synchronized(obj2) {
            }
        }
    }

    synchronized(obj2) {
        SystemClock.sleep(100)
        synchronized(obj1) {
        }
    }
}

"main" prio=5 tid=1 Blocked
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x73907540 self=0x725f010800
  | sysTid=19900 nice=-10 cgrp=default sched=0/0 handle=0x72e60080d0
  | state=S schedstat=( 542745832 9516666 182 ) utm=48 stm=5 core=4 HZ=100
  | stack=0x7fca180000-0x7fca182000 stackSize=8192KB
  | held mutexes=
  at com.xfhy.watchsignaldemo.MainActivity.makeAnr(MainActivity.kt:59)
  - waiting to lock <0x0c6f8c52> (a java.lang.Object) held by thread 22   //注釋1
  - locked <0x01abeb23> (a java.lang.Object)
  at java.lang.reflect.Method.invoke(Native method)
  at androidx.appcompat.app.AppCompatViewInflater$DeclaredOnClickListener.onClick(AppCompatViewInflater.java:441)
  at android.view.View.performClick(View.java:7317)
  at com.google.android.material.button.MaterialButton.performClick(MaterialButton.java:1219)
  at android.view.View.performClickInternal(View.java:7291)
  at android.view.View.access$3600(View.java:838)
  at android.view.View$PerformClick.run(View.java:28247)
  at android.os.Handler.handleCallback(Handler.java:900)
  at android.os.Handler.dispatchMessage(Handler.java:103)
  at android.os.Looper.loop(Looper.java:219)
  at android.app.ActivityThread.main(ActivityThread.java:8668)
  at java.lang.reflect.Method.invoke(Native method)
  at com.android.internal.os.RuntimeInit$MethodAndArgsCaller.run(RuntimeInit.java:513)
  at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:1109)

"臥槽" prio=5 tid=22 Blocked  //注釋2
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x12c8a118 self=0x71d625f800
  | sysTid=20611 nice=0 cgrp=default sched=0/0 handle=0x71d4513d50
  | state=S schedstat=( 486459 0 3 ) utm=0 stm=0 core=4 HZ=100
  | stack=0x71d4411000-0x71d4413000 stackSize=1039KB
  | held mutexes=
  at com.xfhy.watchsignaldemo.MainActivity$makeAnr$1.invoke(MainActivity.kt:52)
  - waiting to lock <0x01abeb23> (a java.lang.Object) held by thread 1
  - locked <0x0c6f8c52> (a java.lang.Object)  
  at com.xfhy.watchsignaldemo.MainActivity$makeAnr$1.invoke(MainActivity.kt:49)
  at kotlin.concurrent.ThreadsKt$thread$thread$1.run(Thread.kt:30)

......

注意看，下面幾行：

"main" prio=5 tid=1 Blocked
  - waiting to lock <0x0c6f8c52> (a java.lang.Object) held by thread 22
  - locked <0x01abeb23> (a java.lang.Object)

"臥槽" prio=5 tid=22 Blocked
  - waiting to lock <0x01abeb23> (a java.lang.Object) held by thread 1
  - locked <0x0c6f8c52> (a java.lang.Object)

主線程的tid是1，線程狀態(tài)是Blocked，正在等待0x0c6f8c52這個Object，而這個Object被thread 22這個線程所持有，主線程當前持有的是0x01abeb23的鎖。而臥槽的tid是22，也是Blocked狀態(tài)，它想請求的和已有的鎖剛好與主線程相反。這樣的話，ANR原因也就找到了：線程22持有了一把鎖，并且一直不釋放，主線程等待這把鎖發(fā)生超時。在線上環(huán)境，常見因鎖而ANR的場景是SharePreference寫入。

7.2.4 CPU被搶占

CPU usage from 0ms to 10625ms later (2020-03-09 14:38:31.633 to 2020-03-09 14:38:42.257):
  543% 2045/com.test.demo: 54% user + 89% kernel / faults: 4608 minor 1 major //注意看這里
  99% 674/android.hardware.camera.provider@2.4-service: 81% user + 18% kernel / faults: 403 minor
  24% 32589/com.wang.test: 22% user + 1.4% kernel / faults: 7432 minor 1 major
  ......

可以看到，該進程占據(jù)CPU高達543%，搶占了大部分CPU資源，因為導致發(fā)生ANR，這種ANR與我們的app無關。

7.2.5 內(nèi)存緊張導致ANR

如果一份ANR日志的CPU和堆棧都很正常，可以考慮是內(nèi)存緊張?？匆幌翧NR日志里面的內(nèi)存相關部分。還可以去日志里面搜一下onTrimMemory，如果dump ANR日志的時間附近有相關日志，可能是內(nèi)存比較緊張了。

10-31 22:37:19.749 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:37:33.458 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:38:00.153 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:38:58.731 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0
10-31 22:39:02.816 20733 20733 E Runtime : onTrimMemory level:80,pid:com.xxx.xxx:Launcher0

7.2.6 系統(tǒng)服務超時導致ANR

系統(tǒng)服務超時一般會包含BinderProxy.transactNative關鍵字，來看一段日志：

"main" prio=5 tid=1 Native
  | group="main" sCount=1 dsCount=0 flags=1 obj=0x727851e8 self=0x78d7060e00
  | sysTid=4894 nice=0 cgrp=default sched=0/0 handle=0x795cc1e9a8
  | state=S schedstat=( 8292806752 1621087524 7167 ) utm=707 stm=122 core=5 HZ=100
  | stack=0x7febb64000-0x7febb66000 stackSize=8MB
  | held mutexes=
  kernel: __switch_to+0x90/0xc4
  kernel: binder_thread_read+0xbd8/0x144c
  kernel: binder_ioctl_write_read.constprop.58+0x20c/0x348
  kernel: binder_ioctl+0x5d4/0x88c
  kernel: do_vfs_ioctl+0xb8/0xb1c
  kernel: SyS_ioctl+0x84/0x98
  kernel: cpu_switch_to+0x34c/0x22c0
  native: #00 pc 000000000007a2ac  /system/lib64/libc.so (__ioctl+4)
  native: #01 pc 00000000000276ec  /system/lib64/libc.so (ioctl+132)
  native: #02 pc 00000000000557d4  /system/lib64/libbinder.so (android::IPCThreadState::talkWithDriver(bool)+252)
  native: #03 pc 0000000000056494  /system/lib64/libbinder.so (android::IPCThreadState::waitForResponse(android::Parcel*, int*)+60)
  native: #04 pc 00000000000562d0  /system/lib64/libbinder.so (android::IPCThreadState::transact(int, unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+216)
  native: #05 pc 000000000004ce1c  /system/lib64/libbinder.so (android::BpBinder::transact(unsigned int, android::Parcel const&, android::Parcel*, unsigned int)+72)
  native: #06 pc 00000000001281c8  /system/lib64/libandroid_runtime.so (???)
  native: #07 pc 0000000000947ed4  /system/framework/arm64/boot-framework.oat (Java_android_os_BinderProxy_transactNative__ILandroid_os_Parcel_2Landroid_os_Parcel_2I+196)
  at android.os.BinderProxy.transactNative(Native method) ————————————————關鍵行?。?！
  at android.os.BinderProxy.transact(Binder.java:804)
  at android.net.IConnectivityManager$Stub$Proxy.getActiveNetworkInfo(IConnectivityManager.java:1204)—關鍵行！
  at android.net.ConnectivityManager.getActiveNetworkInfo(ConnectivityManager.java:800)
  at com.xiaomi.NetworkUtils.getNetworkInfo(NetworkUtils.java:2)
  at com.xiaomi.frameworkbase.utils.NetworkUtils.getNetWorkType(NetworkUtils.java:1)
  at com.xiaomi.frameworkbase.utils.NetworkUtils.isWifiConnected(NetworkUtils.java:1)

從日志堆棧中可以看到是獲取網(wǎng)絡信息發(fā)生了ANR：getActiveNetworkInfo。系統(tǒng)的服務都是Binder機制（16個線程），服務能力也是有限的，有可能系統(tǒng)服務長時間不響應導致ANR。如果其他應用占用了所有Binder線程，那么當前應用只能等待。可進一步搜索：blockUntilThreadAvailable關鍵字：

at android.os.Binder.blockUntilThreadAvailable(Native method)

如果有發(fā)現(xiàn)某個線程的堆棧，包含此字樣，可進一步看其堆棧，確定是調(diào)用了什么系統(tǒng)服務。此類ANR也是屬于系統(tǒng)環(huán)境的問題，如果某類型手機上頻繁發(fā)生此問題，應用層可以考慮規(guī)避策略。

8. ANR影響因素

即使我們利用上面的一系列騷操作，在發(fā)生ANR時，我們拿到了Trace堆棧。但實際情況下這些Trace堆棧中，有很多不是導致ANR的根本原因。Trace堆棧提示某個Service或Receiver導致的ANR，但其實很可能并不是這些組件自身的問題導致的ANR，至于為什么，下面一一道來。

影響ANR的本質(zhì)要素大體來說分為2個：應用內(nèi)部環(huán)境和系統(tǒng)環(huán)境。當系統(tǒng)負載正常，但是應用內(nèi)部主線程消息過多或耗時驗證；另外一類是系統(tǒng)或應用內(nèi)部其他線程或資源負載過高，主線程調(diào)度被嚴重搶占。

系統(tǒng)負載高咱們沒有辦法，但系統(tǒng)負載正常時，主線程的調(diào)度問題主要有下面幾個：

當前Trace堆棧所在業(yè)務耗時嚴重
當前Trace堆棧所在業(yè)務耗時并不嚴重，但歷史調(diào)度有一個嚴重耗時
當前Trace堆棧所在業(yè)務耗時并不嚴重，但歷史調(diào)度有多個消息耗時
當前Trace堆棧所在業(yè)務耗時并不嚴重，但是歷史調(diào)度存在巨量重復消息（業(yè)務頻繁發(fā)送消息）
當前Trace堆棧業(yè)務邏輯并不耗時，但是其他線程存在嚴重資源搶占，如IO、Mem、CPU；
當前Trace堆棧業(yè)務邏輯并不耗時，但是其他進程存在嚴重資源搶占，如IO、Mem、CPU。

請注意，這里的6個影響因素中，除了第一個以外，其他的根據(jù)ANR Trace有可能無法進行判別。這就會導致很多時候看到的ANR Trace里面主線程堆棧對應的業(yè)務其實并不耗時（因為可能是前面的消息導致的耗時，但它已經(jīng)執(zhí)行完了），如何解決這個問題？

9. 彌補不足

字節(jié)跳動內(nèi)部有一個監(jiān)控工具：Raster，這個庫專門解決上面的問題。有一點可惜的是該工具暫時還沒開源，但是我們從字節(jié)發(fā)出來的Raster原理相關的文章能了解到該庫的詳細原理。原文 : 今日頭條 ANR 優(yōu)化實踐系列 - 監(jiān)控工具與分析思路

Raster的大致原理：該工具主要是在主線程消息調(diào)度過程進行監(jiān)控，并按照一定的策略聚合，以保證監(jiān)控工具本身對應用性能和內(nèi)存抖動影響降至最低。比較耗時的消息會抓取主線從堆棧，這樣可以知道那個耗時的消息具體是在干什么，從而針對性優(yōu)化。同時對應用四大組件消息執(zhí)行過程進行監(jiān)控，便于對這類消息的調(diào)度及耗時情況進行跟蹤和記錄。另外對當前正在調(diào)度的消息及消息隊列中待調(diào)度消息進行統(tǒng)計，從而在發(fā)生問題時，可以回放主線程的整體調(diào)度情況。此外，該庫將系統(tǒng)服務的CheckTime機制遷移到應用側(cè)，應用為線程CheckTime機制，以便于系統(tǒng)信息不足時，從線程調(diào)度及時性推測過去一段時間系統(tǒng)負載和調(diào)度情況。因此該工具用一句話來概括就是：由點到面，回放過去，現(xiàn)在和將來。

細說一下線程 Checktime：通過借助其他子線程的周期檢測機制，在每次調(diào)度前獲取當前系統(tǒng)時間，然后減去我們設置延遲的時間，即可得到本次線程調(diào)度前的真實間隔時間，如設置線程每隔300ms調(diào)度一次，結(jié)果發(fā)現(xiàn)實際響應時間間隔有時會超過300ms，如果偏差越大則說明線程沒有及時調(diào)度，進一步反映系統(tǒng)響應能力變差。通過這樣的方式，即使線上環(huán)境獲取不到系統(tǒng)日志，也可以從側(cè)面反映不同時段系統(tǒng)負載對線程調(diào)度影響。當連續(xù)發(fā)生多次嚴重Delay時，說明線程調(diào)度受到了影響。

通過上訴監(jiān)控能力，我們就可以清晰的知道ANR發(fā)生時主線程歷史消息調(diào)度以及耗時嚴重消息的采樣堆棧，同時可以知道正在執(zhí)行消息的耗時，以及消息隊列中調(diào)度消息的狀態(tài)。同時通過線程CheckTime機制從側(cè)面反映線程調(diào)度響應能力，由此完成了應用側(cè)監(jiān)控信息從點到面的覆蓋。

有大佬根據(jù)該文章的原理實現(xiàn)了一個類似的開源庫： MoonlightTreasureBox，MoonlightTreasureBox 開源地址。

10. QA

10.1 在Activity#onCreate中sleep會導致ANR嗎？

不會，ANR的場景只有下面4種：Service Timeout、BroadcastQueue Timeout、ContentProvider Timeout、InputDispatching Timeout。

當然，如果在Activity#onCreate中sleep的過程中，用戶點擊了屏幕，那是有可能觸發(fā)InputDispatching Timeout的。

11. 小結(jié)

很榮幸地恭喜你，讀完了整篇文章。

ANR是老生常談的問題了，本文從定義、原因、發(fā)生場景、觸發(fā)流程、監(jiān)控與分析等多方面入手，盡力補全ANR這塊的知識。

ANR的發(fā)生場景只有4種：Service Timeout、BroadcastQueue Timeout、ContentProvider Timeout、InputDispatching Timeout，但導致ANR的原因是多種多樣的，可能是App這邊導致的，也可能是系統(tǒng)那邊導致的。觸發(fā)ANR的過程大致又可以分為2種，一種是Service、Broadcast、Provider觸發(fā)ANR：埋炸彈、拆炸彈、引爆炸彈，另外一種是Input觸發(fā)ANR：處理后續(xù)時檢測之前的。觸發(fā)ANR之后，會走dump ANR Trace的流程，收集相關進程的堆棧信息寫入文件。我們可以監(jiān)聽SIGQUIT信號，感知到系統(tǒng)在走dump ANR Trace的流程，我們可以進一步確認一下當前進程是否處于ANR的狀態(tài)，然后通過hook系統(tǒng)與App的邊界，從而通過socket拿到系統(tǒng)dump好的ANR Trace內(nèi)容。拿到ANR Trace內(nèi)容之后，當然就是分析了，詳細請看文章。但是有時候，拿到的ANR Trace并不能把真正的ANR原因給分析出來，這時就得上字節(jié)內(nèi)部的大殺器了：Raster，雖然暫時還沒開源，但字節(jié)已將其原理一五一十的分享出來了。Raster主要是能知道主線程的消息調(diào)度在過去、現(xiàn)在、將來的具體情況，配合線程 CheckTime 感知線程調(diào)度能力，要比單單分析 ANR Trace要方便很多。

12. 資料

感謝以下所有大佬的精彩文章。

卡頓、ANR、死鎖，線上如何監(jiān)控？ https://juejin.cn/post/6973564044351373326#heading-34
你管這破玩意叫 IO 多路復用？https://mp.weixin.qq.com/s?__biz=Mzk0MjE3NDE0Ng==&mid=2247494866&idx=1&sn=0ebeb60dbc1fd7f9473943df7ce5fd95&chksm=c2c5967ff5b21f69030636334f6a5a7dc52c0f4de9b668f7bac15b2c1a2660ae533dd9878c7c&mpshare=1&scene=1&srcid=04239yXVUr6ekmLg7ZSKlFpa&sharer_sharetime=1619147468052&sharer_shareid=2498540345d210ebc4198a40ae94e9ec#rd
epoll或者kqueue的原理是什么? https://www.zhihu.com/question/20122137/answer/14049112
Gityuan 理解Android ANR的信息收集過程 http://gityuan.com/2016/12/02/app-not-response/
Gityuan 理解Android ANR的觸發(fā)原理 http://gityuan.com/2016/07/02/android-anr
Gityuan Input系統(tǒng)—ANR原理分析 http://gityuan.com/2017/01/01/input-anr/
Gityuan 徹底理解安卓應用無響應機制 http://gityuan.com/2019/04/06/android-anr/
Gityuan Input系統(tǒng)—事件處理全過程 http://gityuan.com/2016/12/31/input-ipc/
微信Android客戶端的卡頓監(jiān)控方案 https://mp.weixin.qq.com/s/3dubi2GVW_rVFZZztCpsKg
Touch事件如何傳遞到Activity http://m.itdecent.cn/p/7d442ed0a355
淺析 Android 輸入事件處理（一） https://zhuanlan.zhihu.com/p/26893970
【Android】事件處理系統(tǒng) https://www.cnblogs.com/lcw/p/3373214.html
Android 輸入系統(tǒng) & ANR機制的設計與實現(xiàn) https://mp.weixin.qq.com/s/OyyP_BQqz0gLOfmZffoD1A
Android PLT hook 概述 https://github.com/iqiyi/xHook/blob/master/docs/overview/android_plt_hook_overview.zh-CN.md
Android 輸入系統(tǒng) & ANR機制的設計與實現(xiàn) https://mp.weixin.qq.com/s/OyyP_BQqz0gLOfmZffoD1A
今日頭條 ANR 優(yōu)化實踐系列 - 設計原理及影響因素 https://mp.weixin.qq.com/s/ApNSEWxQdM19QoCNijagtg
今日頭條 ANR 優(yōu)化實踐系列 - 監(jiān)控工具與分析思路 https://mp.weixin.qq.com/s/_Z6GdGRVWq-_JXf5Fs6fsw
Matrix - ANR 原理解析 https://www.dalvik.work/2021/12/03/matrix-anr/
西瓜視頻穩(wěn)定性治理體系建設三：Sliver 原理及實踐https://mp.weixin.qq.com/s/LW3eMK9O2tfFtZcu5eqitg （這篇文章提到，looper消息分發(fā)和監(jiān)控Signal信號有可能無法監(jiān)控到真正的ANR，可能dump堆棧時已經(jīng)錯過真正的時機，需要獲取到dump堆棧時的前面的消息堆棧，好像matrix有，到時看一下）
西瓜卡頓 & ANR 優(yōu)化治理及監(jiān)控體系建設 https://mp.weixin.qq.com/s/2sjG5qkrUNQsI0jEsnh4kQ
微信Android客戶端的ANR監(jiān)控方案監(jiān)控signal信號 https://blog.csdn.net/stone_cold_cool/article/details/119464855
今日頭條 ANR 優(yōu)化實踐系列分享 - 實例剖析集錦 https://mp.weixin.qq.com/s/4-_SnG4dfjMnkrb3rhgUag
今日頭條 ANR 優(yōu)化實踐系列 - Barrier 導致主線程假死 https://mp.weixin.qq.com/s/OBYWrUBkWwV8o6ChSVaCvw
今日頭條 ANR 優(yōu)化實踐系列 - 告別 SharedPreference 等待 https://mp.weixin.qq.com/s/kfF83UmsGM5w43rDCH544g
理解殺進程的實現(xiàn)原理 - Gityuan博客 | 袁輝輝的技術博客
理解Android進程創(chuàng)建流程 - Gityuan博客 | 袁輝輝的技術博客
「ANR」Android SIGQUIT(3) 信號攔截與處理_阿里巴巴終端技術的博客-CSDN博客
干貨：ANR日志分析全面解析 https://zhuanlan.zhihu.com/p/378902923
Android ANR http://m.itdecent.cn/p/487771a67d1b

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

ANR 觸發(fā)、監(jiān)控、分析一網(wǎng)打盡

ANR 觸發(fā)、監(jiān)控、分析一網(wǎng)打盡

1. ANR是什么

2. 導致ANR的原因

3. 線下拿到ANR日志

4. ANR場景

5. ANR觸發(fā)流程

5.1 Service、Broadcast、Provider觸發(fā)ANR

5.2 Input觸發(fā)ANR

5.3 哪些路徑會引發(fā)ANR？

5.4 ANR dump主要流程

6. ANR監(jiān)控

6.1 WatchDog

6.2 監(jiān)控SIGQUIT信號

6.2.1 完善的ANR監(jiān)控方案

6.2.1.1 誤報

6.2.1.2 漏報

6.2.1.3 獲取ANR Trace

7. ANR分析

7.1 trace文件分析

7.2 ANR案例分析

7.2.1 主線程無卡頓，處于正常狀態(tài)堆棧

7.2.2 主線程執(zhí)行耗時操作

7.2.3 主線程被鎖阻塞

7.2.4 CPU被搶占

7.2.5 內(nèi)存緊張導致ANR

7.2.6 系統(tǒng)服務超時導致ANR

8. ANR影響因素

9. 彌補不足

10. QA

10.1 在Activity#onCreate中sleep會導致ANR嗎？

11. 小結(jié)

12. 資料

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

ANR 觸發(fā)、監(jiān)控、分析 一網(wǎng)打盡

1. ANR是什么

2. 導致ANR的原因

3. 線下拿到ANR日志

4. ANR場景

5. ANR觸發(fā)流程

5.1 Service、Broadcast、Provider觸發(fā)ANR

5.2 Input觸發(fā)ANR

5.3 哪些路徑會引發(fā)ANR？

5.4 ANR dump主要流程

6. ANR監(jiān)控

6.1 WatchDog

6.2 監(jiān)控SIGQUIT信號

6.2.1 完善的ANR監(jiān)控方案

6.2.1.1 誤報

6.2.1.2 漏報

6.2.1.3 獲取ANR Trace

7. ANR分析

7.1 trace文件分析

7.2 ANR案例分析

7.2.1 主線程無卡頓，處于正常狀態(tài)堆棧

7.2.2 主線程執(zhí)行耗時操作

7.2.3 主線程被鎖阻塞

7.2.4 CPU被搶占

7.2.5 內(nèi)存緊張導致ANR

7.2.6 系統(tǒng)服務超時導致ANR

8. ANR影響因素

9. 彌補不足

10. QA

10.1 在Activity#onCreate中sleep會導致ANR嗎？

11. 小結(jié)

12. 資料

相關閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

ANR 觸發(fā)、監(jiān)控、分析一網(wǎng)打盡

5.1 Service、Broadcast、Provider觸發(fā)ANR

5.3 哪些路徑會引發(fā)ANR？

7.2.1 主線程無卡頓，處于正常狀態(tài)堆棧

10.1 在Activity#onCreate中sleep會導致ANR嗎？