Closed Bug 1122119 Opened 9 years ago Closed 9 years ago

[System] Device randomly white-screens, becomes unresponsive, and spams computer (if connected) with 6 device folders.

Categories

(Core :: Hardware Abstraction Layer (HAL), defect)

37 Branch
ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

VERIFIED FIXED
2.2 S5 (6feb)
blocking-b2g 2.2+
Tracking Status
b2g-v2.0 --- unaffected
b2g-v2.1 --- unaffected
b2g-v2.2 --- verified
b2g-master --- verified

People

(Reporter: jmitchell, Assigned: gsvelto)

References

Details

(4 keywords, Whiteboard: [3.0-Daily-Testing][fromAutomation])

Attachments

(8 files)

Description:
 This has been seen by 5 different testers, all doing something completely different. The effects are an immediate white-screen; everything is unresponsive - Homescreen button has no effect and no tactile feedback, volume rocker does nothing, power button does nothing. If the device is connected (or is connected afterwards) to a computer it will open the following drives:
1) Internal SD
2) 2.2 GB System
3) 75 MB Filesystem  with: Cache2, lost+found folder
4) 377 MB Filesystem with: b2g, bin, etc, fonts, lib, lost+found, tts, usr, vendor folders
5) 67 MB Filesystem  with: image, verinfo folders
6) 34 MB Filesystem  with: lost+found*, svoperapps* folders and WCNSS_qcom_wlan_nv.bin* file
* denotes file has lock symbol over it - presuming locked (Unbuntu Laptop)

This occurs even if USB sharing is set to 'off'
User must remove battery to recover


Repro Steps:
1) Update a Flame to 20150115010229
2) Do stuff

Actual:
Device White-screens, Everything becomes unresponsive (home-button, volumes, power button)

Expected:
None of that

Notes: 
This has been encountered under the following conditions
1) 3.0 Flame device (v18d-1) while in browser: searching Youtube.com (portrait mode) with wifi (possibly panning), not plugged in, two sims USB sharing NOT turned on
2) 2.2 Flame device (v18d-1) going from Music to Homescreen to Settings; not plugged in, Sim in slot 1, USB sharing NOT turned on.
3) 2.2 device while in browser in landscape mode
4) 3.0 Flame right after fresh flash - in FTU, typing in password for wi-fi
5) Flame 3.0, OTA from prior day
6) Flame 3.0, in task manager, flashed build


Environmental Variables:
Device: Flame Master
Build ID: 20150115010229
Gaia: bcc76f93f5659ac1eb8a769167109fd2d7ca4fbd
Gecko: c1f6345f2803
Gonk: a814b2e2dfdda7140cb3a357617dc4fbb1435e76
Version: 38.0a1 (Master)
Firmware Version: V18d-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0


Repro frequency: 6 times - random
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(pbylenga)
Could be related to bug 1121374.

Going to nominate to block based upon severity, need to find STRs.
blocking-b2g: --- → 2.2?
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(pbylenga)
I managed to reproduce this locally using our automation suite.

Here is a video of the issue:
https://www.dropbox.com/s/dylaf16mr13sdu0/VID_20150119_170223.mp4?dl=0

This issue is hitting our automation grid very hard. We have to manually reset the devices every day.

Also as Joshua described in C0 the issue is very intermittent and we currently don't have any STR.

I have reproduce this using build:
Gaia-Rev        0f65b258bceddd9d479b3c027d9bd234c1e99aaf
Gecko-Rev       https://hg.mozilla.org/mozilla-central/rev/6446c26b45f9
Build-ID        20150119010205
Version         38.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150119.044519
FW-Date         Mon Jan 19 04:45:30 EST 2015
Bootloader      L1TC10011880

While running 
Also the battery drain while the screen is white is very high. I lost 25% of battery in 15 - 20 min

After the reboot (Remove the battery) abd is no longer available; it was active before I reproduced the issue. Also USB storage is disabled
Blocks: 1121374
Geo this is the same issue we are seeing in our tests.
I managed to reproduce this while running: test_sms_with_attachments.py
Flags: needinfo?(gmealer)
Keywords: qablocker
If this issue is actually the on we hit in automation, this crash also prevents smoketest from being run efficiently.
Keywords: smoketest
This is incredibly urgent for us. I'm having to bring up almost the entire device lab again every day because the crash drains the batteries until the phones turn off.

How can we get this prioritized?
Flags: needinfo?(gmealer) → needinfo?(bbajaj)
@vchang The log has some interesting wifi stuff in it (including an uncaught exception) in modifyRoute is this potentially an issue?
Flags: needinfo?(vchang)
I think this one is going to be hard to window because it's very intermittent. It occurs extremely often across the lab as a whole, but that's with a bunch of phones running builds all day.

That said, the first time I saw a phone die because of this was 15.1 on Friday 1/9, where I asked davehunt about it in email. On Monday, I came in and half the lab was down. That's a pretty good bet we introduced it either Thursday or Friday.

So, I did go through all the machines I restarted in bug 1120307 (last Monday). Here's information for the earliest failures, in hopes that this provides at least a hint of a window.

One caveat: I think it takes awhile to show up in the phone. Just because I found an earliest build doesn't mean it was introduced in that build. It might have been introduced well before that.

Also, getting the Gecko revision is tricky. Unfortunately, we always download and test against "/latest," so there's no hint of which actual revision got downloaded. 

So, I've included the timestamp of when we downloaded against latest and tried to trace this back to a revision via the last-modified timestamp on the server. Links into the internal ftp have been redacted.

All times are PST, aside from the Gecko download directories, which are UTC-stamped.

First failure I think is related:

b2g-19.1:

Failed Build #3554 (Jan 9, 2015 1:48:28 PM)
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3554/

Gaia: 8e16fa9e481eceefa1ea6fd7fa86c32969cde41f origin/master

Gecko:

http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.b2g-inbound.download/2112/
--2015-01-09 13:48:15--  https://xxx/tinderbox-builds/b2g-inbound-flame-kk-eng/latest/flame-kk.zip

I think this is /20150109122431, Mercurial-Information: <project name="https://hg.mozilla.org/integration/b2g-inbound" path="gecko" remote="hgmozillaorg" revision="83a760aa8fc5"/>

Next failure I found:

b2g-15.1:

Failed Build #2014 (Jan 9, 2015 2:33:28 PM)
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/2014/

Gaia: 
60b1002a771e289f2dd4195908973a0f5f65ab20 origin/master

Gecko:

http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.b2g-inbound.download/2113/
--2015-01-09 14:33:14--  https://xxx/tinderbox-builds/b2g-inbound-flame-kk-eng/latest/flame-kk.zip

I think this is /20150109135741, Mercurial-Information: <project name="https://hg.mozilla.org/integration/b2g-inbound" path="gecko" remote="hgmozillaorg" revision="40853b2ad772"/>

Finally, three I'm not sure are related:

These were device dropouts from adb, similar to above, but they were far enough back I don't think it's the same issue. Conversely, all the other nodes I checked failed sometime between the above and 2015-01-11. However, like I said, this might have taken a bit to manifest so these might be relevant.

b2g-20.1:

Failed Build #1952 (Jan 7, 2015 2:14:44 AM) 
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/1952/

Gaia: 2735cad8e836318d493c8c47e72991bc06b8019a origin/master

Gecko:

http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.b2g-inbound.download/2051/
--2015-01-07 02:14:31--  https://xxx/tinderbox-builds/b2g-inbound-flame-kk-eng/latest/flame-kk.zip

I think this is /20150107002802, Mercurial-Information: <project name="https://hg.mozilla.org/integration/b2g-inbound" path="gecko" remote="hgmozillaorg" revision="37b3e957eda8"/>

b2g-27.2:

Failed Build #559 (Jan 7, 2015 10:14:37 PM) 
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.monkey/559/

Gaia: fd6b86eb3d443f26539520580805715b1d980465 origin/master

Gecko:

http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.b2g-inbound.debug.download/571/
--2015-01-07 22:14:24--  https:/xxx/tinderbox-builds/b2g-inbound-flame-kk-eng-debug/latest/flame-kk.zip

I think this is /20150107202757, Mercurial-Information: <project name="https://hg.mozilla.org/integration/b2g-inbound" path="gecko" remote="hgmozillaorg" revision="5db76886bbd2"/>

b2g-28.1:

Failed Build #1988 (Jan 8, 2015 9:04:43 AM) 
http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.ui.functional.smoke/1988/

Gaia: f14b329b8596592ecb563a8718efaabbdf491f57 origin/master

Gecko:

http://jenkins1.qa.scl3.mozilla.com/job/flame-kk.b2g-inbound.download/2087/
--2015-01-08 09:04:31--  https://xxx/tinderbox-builds/b2g-inbound-flame-kk-eng/latest/flame-kk.zip

I think this is /20150108072311, Mercurial-Information: <project name="https://hg.mozilla.org/integration/b2g-inbound" path="gecko" remote="hgmozillaorg" revision="b4926227268f"/>

Hope this at least narrows things considerably.
One other thing to look at would be if/when releng started mixing 18d-1 binaries into the builds. 

We've been doing fully automated testing on 188-1 while we piloted 18d-1 at desk. However, if the releng packages have 18d-1 stuff in them we're overwriting part (but maybe not all) of the 188-1 code. That means a 18d-1-only issue might still manifest for the lab, or even that a package mismatch might be behind our higher frequency.
Could we get the dmesg log please?  That log might give us more of a clue on what happened after the crash/reboot.

adb shell dmesg

I've seen issues where plugging in the USB causes LMK to occur; I wasn't sure if it was just my build or not since I was making special builds.
Flags: needinfo?(jmitchell)
Found one in the lab in a white-screened state (smoking gun!)

Here's an adb logcat -v threadtime -d. Note the jump near the end from 09:55 to 13:08, which I think is probably when we white-screened (I rebooted at 13:08).

Last few lines look like it was trying and failing to poll for an update due to offline network.

01-20 09:55:50.650   205   205 I Gecko   : *** AUS:SVC UpdateService:onError - error during background update. error code: 111, status text: Network is offline (go online)
01-20 09:55:50.650   205   205 I GeckoConsole: AUS:SVC UpdateService:onError - error during background update. error code: 111, status text: Network is offline (go online)
01-20 09:55:50.650   205   205 I Gecko   : *** AUS:SVC UpdateService:_registerOnlineObserver - waiting for the network to be online, then forcing another check
01-20 09:55:50.650   205   205 I GeckoConsole: AUS:SVC UpdateService:_registerOnlineObserver - waiting for the network to be online, then forcing another check
01-20 09:55:52.910   209   209 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
01-20 09:55:56.490   205   205 I GeckoBluetooth: ServiceChanged: 1 client 0. new mClientId 0
01-20 09:55:57.920   209   209 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]

I went through Bebe's log a little to see if I could find something that matched. Searching on "error during background update" finds two instances, one of which looks like it might be right before the restart. Hard to tell without the timestamps though.
Thanks Geo!

clearing NI
Flags: needinfo?(jmitchell)
Worth uploading another one if the problem reappears. It'd be nice to have two to compare. Same with the adb logcat (please use "-v threadtime" per my command above)
Flags: needinfo?(jmitchell)
Attached file logcat after Reboot
Flags: needinfo?(jmitchell)
Attached file dmesg after reboot
Marty S. and I have been seeing this repro FAIRLY consistently in the Clock App

STR
1) Launch Clock App
2) Switch to Stopwatch tab
3) Hit Start

I have gotten the whitescreen 4 or 5 times today just letting the device soak on the stopwatch screen. Once at the 54 second mark but also at about the 60 minute mark so no consistency on duration so far.
Hrm.  Reboot reset the logging I think.  When you white screen do you still have adb access?  If so, can you get the logcat and dmesg logs at that point?
Flags: needinfo?(jmitchell)
Removing 'smoketest' keyword since there are no consistent steps to reproduce yet.
Keywords: smoketest
(In reply to Naoki Hirata :nhirata (please use needinfo instead of cc) from comment #20)
> Hrm.  Reboot reset the logging I think.  When you white screen do you still
> have adb access?  If so, can you get the logcat and dmesg logs at that point?

When you have white-screen you do not have adb access.
Flags: needinfo?(jmitchell)
Attached file logcat_1433.txt
There is no fixed steps to reproduce this issue. Sometimes, the screen will turn white when we do some operations in different apps. This is my logcat in Bug 1123205, hope this can help you.
Hi Henry, can you help to check if this is caused by Bug 1104664 landed recently for L porting?
Flags: needinfo?(vchang)
(In reply to Geo Mealer [:geo] from comment #10)
> Also, getting the Gecko revision is tricky. Unfortunately, we always
> download and test against "/latest," so there's no hint of which actual
> revision got downloaded. 

When we flash we report the various revisions, so you should be able to determine this from any downstream jobs. 

For example, your first job: http://jenkins1.qa.scl3.mozilla.com/job/flame-kk-319.b2g-inbound.perf.gaia/3554/

Has the following in the console log:

Showing version details
 0:02.00 LOG: MainThread mozversion INFO application_buildid: 20150109122431
 0:02.00 LOG: MainThread mozversion INFO application_changeset: 83a760aa8fc5
 0:02.00 LOG: MainThread mozversion INFO application_display_name: B2G
 0:02.00 LOG: MainThread mozversion INFO application_id: {3c2e2abc-06d4-11e1-ac3b-374f68613e61}
 0:02.00 LOG: MainThread mozversion INFO application_name: B2G
 0:02.00 LOG: MainThread mozversion INFO application_remotingname: b2g
 0:02.00 LOG: MainThread mozversion INFO application_repository: https://hg.mozilla.org/integration/b2g-inbound
 0:02.00 LOG: MainThread mozversion INFO application_vendor: Mozilla
 0:02.00 LOG: MainThread mozversion INFO application_version: 37.0a1
 0:02.00 LOG: MainThread mozversion INFO build_changeset: e0c735ec89df011ea7dd435087a9045ecff9ff9e
 0:02.00 LOG: MainThread mozversion INFO device_firmware_date: 1420837769
 0:02.00 LOG: MainThread mozversion INFO device_firmware_version_base: L1TC10011880
 0:02.00 LOG: MainThread mozversion INFO device_firmware_version_incremental: eng.cltbld.20150109.160920
 0:02.00 LOG: MainThread mozversion INFO device_firmware_version_release: 4.4.2
 0:02.00 LOG: MainThread mozversion INFO device_id: flame
 0:02.00 LOG: MainThread mozversion INFO gaia_changeset: 2c7d14040149e1f9b1bb3972ff150be0472fa6b6
 0:02.00 LOG: MainThread mozversion INFO gaia_date: 1420821049
 0:02.00 LOG: MainThread mozversion INFO platform_buildid: 20150109122431
 0:02.00 LOG: MainThread mozversion INFO platform_changeset: 83a760aa8fc5
 0:02.00 LOG: MainThread mozversion INFO platform_repository: https://hg.mozilla.org/integration/b2g-inbound
I can confirm the STR form comment #19

STR
1) Launch Clock App
2) Switch to Stopwatch tab
3) Hit Start

After a few min 1 to 20 in my case we get the white screen crash.
From my side, I have never seen this issue on my device because I'm used to shallow flash due to bugs like bug 1104338. 

Now that we seem to have STRs (comment 19), I compared 2 builds: one shallow-flashed[1] and one full-flashed[2]. The shallow flashed has run on the stopwatch for an hour now. The full flashed has crashed 4 times in the same amount of time.

Like inferred in comment 24, there is something wrong with Gonk. Moving the bug there.

Adding Smoketest keyword now that we have STRs.

[1] Gaia-Rev        5e98dc164b17fd6decb48a9eaddef0e55b82e249
Gecko-Rev       https://hg.mozilla.org/mozilla-central/rev/540077a30866
Build-ID        20150121010204
Version         38.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  65
FW-Date         Mon Dec 15 18:51:29 CST 2014
Bootloader      L1TC000118D0

[2] Gaia-Rev        5e98dc164b17fd6decb48a9eaddef0e55b82e249
Gecko-Rev       https://hg.mozilla.org/mozilla-central/rev/540077a30866
Build-ID        20150121010204
Version         38.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150121.043209
FW-Date         Wed Jan 21 04:32:21 EST 2015
Bootloader      L1TC000118D0
Component: Gaia::System → GonkIntegration
Keywords: steps-wantedsmoketest
See Also: 1123205
NI :henry here per comment #24. As stated, this is urgent QA blocking and need help on this.
blocking-b2g: 2.2? → 2.2+
Flags: needinfo?(bbajaj) → needinfo?(hchang)
(In reply to bhavana bajaj [:bajaj] from comment #28)
> NI :henry here per comment #24. As stated, this is urgent QA blocking and
> need help on this.

I don't think it has anything to do with Bug 1104664 but I am willing 
to try to get rid of that freaking error message.
Flags: needinfo?(hchang)
*** STR
Same as in comment 19, https://bugzilla.mozilla.org/show_bug.cgi?id=1122119#c19
Sometimes 5 minutes, sometimes it needs 30+ mins.

*** Log before white screen & during
whitescreen-take-02.log
Reproduce with logcat on (wifi output, bt output, gaia debug, console enabled)

06-03 02:10:03.082   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
06-03 02:10:08.092   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
06-03 02:10:13.102   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
06-03 02:10:18.112   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
06-03 02:10:23.122   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
06-03 02:10:26.572  1911  1911 I Clock   : Content JS LOG: [Clock] =====================================
06-03 02:10:26.572  1911  1911 I Clock   :  
06-03 02:10:26.572  1911  1911 I Clock   :     at logForAlarmDebugging (app://clock.gaiamobile.org/js/startup.js:5394:4)
06-03 02:10:26.572  1911  1911 I Clock   : Content JS LOG: [Clock] Alarm Debug: {"now":"2014-06-02T21:10:26.538Z","tz":-300} 
06-03 02:10:26.572  1911  1911 I Clock   :     at logForAlarmDebugging (app://clock.gaiamobile.org/js/startup.js:5395:0)
06-03 02:10:26.592  1911  1911 I Clock   : Content JS LOG: [Clock] ===== Raw IndexedDB Alarm Data: =====
06-03 02:10:26.592  1911  1911 I Clock   :  
06-03 02:10:26.592  1911  1911 I Clock   :     at logForAlarmDebugging/< (app://clock.gaiamobile.org/js/startup.js:5401:6)
06-03 02:10:26.592  1911  1911 I Clock   : Content JS LOG: [Clock] -------------------------------------
06-03 02:10:26.592  1911  1911 I Clock   :  
06-03 02:10:26.592  1911  1911 I Clock   :     at logForAlarmDebugging/< (app://clock.gaiamobile.org/js/startup.js:5406:6)
06-03 02:10:26.602  1911  1911 I Clock   : Content JS LOG: [Clock] ======= Remaining mozAlarms: ========
06-03 02:10:26.602  1911  1911 I Clock   :  
06-03 02:10:26.602  1911  1911 I Clock   :     at logForAlarmDebugging/request.onsuccess (app://clock.gaiamobile.org/js/startup.js:5411:6)
06-03 02:10:26.602  1911  1911 I Clock   : Content JS LOG: [Clock] -------------------------------------
06-03 02:10:26.602  1911  1911 I Clock   :  
06-03 02:10:26.602  1911  1911 I Clock   :     at logForAlarmDebugging/request.onsuccess (app://clock.gaiamobile.org/js/startup.js:5421:6)
06-03 02:10:28.132   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
06-03 02:10:33.142   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]
06-03 02:10:38.152   214   214 V WLAN_PSA: NL MSG, len[048], NL type[0x11] WNI type[0x5050] len[028]


*** Flame KK + v18D + Full flash
Serial: e47cd843 (State: device)
Build ID               20150120162501
Gaia Revision          b9da64ae4e704476b69102b874f1282ed2fd9598
Gaia Date              2015-01-20 20:30:31
Gecko Revision         https://hg.mozilla.org/releases/mozilla-b2g37_v2_2/rev/bf3726b91827
Gecko Version          37.0a2
Device Name            flame
Firmware(Release)      4.4.2
Firmware(Incremental)  eng.cltbld.20150120.194830
Firmware Date          Tue Jan 20 19:48:41 EST 2015
Bootloader             L1TC000118D0

*** Reference
http://youtu.be/yOIDK0VW7C0
Another log with membuster eating memory, not sure if this is relevent.
Hi Henry, please let us know if the log helps, thanks..
Flags: needinfo?(hchang)
(In reply to Eric Chang [:ericcc] [:echang] from comment #32)
> Hi Henry, please let us know if the log helps, thanks..

Unfortunately none of logs has relevant information and none of
logs points to any potential component which causes the white screen....
Flags: needinfo?(hchang)
I'm not sure it's a GonkIntegration bug since I can reproduce it with v18D + shallow gecko/gaia
Here's the version information:

Gaia-Rev        966b3a7a13a7f0d5b86cbc9e64cb78d43ec7dba8
Gecko-Rev       f92b976370502c2a6cf91a38ede0fa7e1459e57a
Build-ID        20150122154757
Version         38.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  65
FW-Date         Mon Dec 15 18:51:29 CST 2014
Bootloader      L1TC000118D0

When issue happened, it looks like kernel panic bug
[  738.832088] kernel BUG at /local/build/foxfone-one-v1.4-release/v18D/kernel/kernel/workqueue.c:1137!
[  738.841205] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP ARM
[  738.847019] Modules linked in: wlan(O)
[  738.850753] CPU: 1    Tainted: G        W  O  (3.4.0-g6dc52cf-00031-g20e5d55 #1)
[  738.858137] PC is at queue_delayed_work_on+0x10c/0x12c
[  738.863252] LR is at queue_delayed_work_on+0x24/0x12c
[  738.868288] pc : [<c01925b8>]    lr : [<c01924d0>]    psr: 00000013
[  738.868293] sp : ee2b3f50  ip : 00000000  fp : ee7e8d80
[  738.879753] r10: c06018fc  r9 : c0ee77fc  r8 : c0e67df0
[  738.884953] r7 : 00000005  r6 : 00000001  r5 : ee28ff00  r4 : c19a7724
[  738.891536] r3 : c19a7728  r2 : 00000000  r1 : c19a7724  r0 : 00000000
…
[  739.735494] [<c01925b8>] (queue_delayed_work_on+0x10c/0x12c) from [<c06018fc>] (dbs_sync_thread+0x1f0/0x208)
[  739.745294] [<c06018fc>] (dbs_sync_thread+0x1f0/0x208) from [<c0197ae0>] (kthread+0x84/0x90)
[  739.753714] [<c0197ae0>] (kthread+0x84/0x90) from [<c0106820>] (kernel_thread_exit+0x0/0x8)
[  739.762041] Code: e1a00008 e1a0100a ebffd25a eaffffd3 (e7f001f2)
[  739.778449] ---[ end trace 4458f004cff388cb ]---
[  739.782152] Kernel panic - not syncing: Fatal exception
[  739.787260] CPU0: stopping
[  739.789958] [<c010bd74>] (unwind_backtrace+0x0/0xf8) from [<c010af7c>] (handle_IPI+0x1b8/0x1f8)
[  739.798625] [<c010af7c>] (handle_IPI+0x1b8/0x1f8) from [<c0100438>] (gic_handle_irq+0xa0/0xe4)
[  739.807223] [<c0100438>] (gic_handle_irq+0xa0/0xe4) from [<c08899bc>] (__irq_usr+0x3c/0x60)
[  739.815547] Exception stack(0xebfdbfb0 to 0xebfdbff8)
[  739.820585] bfa0:                                     00000000 00000000 00000000 00000000
[  739.828744] bfc0: ae3c48c0 ae8c2938 ae3c48a0 b6815794 b3a0e998 b3a0ea24 b3a0e9f2 ffffffff
[  739.836903] bfe0: 00000001 b3a0e7b8 b50d27e9 b5c10ea8 800f0030 ffffffff
[  740.843568] wcnss crash shutdown 0
[  740.845952] Rebooting in 5 seconds..
[  745.849278] Going down for restart now
[  745.852803] Calling SCM to disable SPMI PMI

However, it should not kernel issue since I can run over 3 hrs if I only flash my local boot.img on pure v18D.
https://copy.com/zpB4FbaqPu0UaCY9
It should be some gecko/gaia changes cause this bug.
For checking combination of gaia, gecko & base image..
https://taiwan.etherpad.mozilla.org/BUG1122119
FYI, I found white screen issue based on below criteria:  
1. Update Flame base image to v188.
2. Update today's gecko/gaia built by myself not fully flash. 
3. The wifi is turned off so I assume there is no wifi driver running.
4. Running the Stopwatch app. 

I can't reproduce the problem if I update the gecko to earlier commit. I think we can bisect the gecko commits to find the problem.
Flags: needinfo?(echang)
(In reply to Vincent Chang[:vchang] from comment #36)
> I can't reproduce the problem if I update the gecko to earlier commit.
Which gecko commit are you referring to?
Flags: needinfo?(vchang)
Hi Norry, Could you show us the info from your tests today? Thank you.
Flags: needinfo?(echang) → needinfo?(fan.luo)
There are some missing in our comment 34
1) Since v18D is using gecko/gaia v2.0, so we can't reproduce this issue in pure v18D
2) so far we can't reproduce in v18D + v2.1 full images but it happened easily in v18D + v2.2 full images.
   The next step should be find out the difference between 2.1 and 2.2.

Not sure if the 2.2 gecko/gaia change hit the timing issue in kernel, need more time to find out.

*note that 2.1 and 2.2 use same kernel: Linux version 3.4.0-g19b9a16-00073-g0865bc4 (mock_mozilla@bld-linux64-spot-080.build.releng.use1.mozilla.com) (gcc version 4.7 (GCC) ) #1 SMP PREEMPT Thu Jan 22 20:11:00 EST 2015
I suggest trying without this changeset: https://hg.mozilla.org/mozilla-central/rev/59452425e446

I've received other reports of weird kernel-related issues since it landed and I fear the changes in that code might have tripped over some kernel bug.
This bug affects our ability to keep reporting on the Gaia performance tests. Without these tests running and reporting, there is no simple way for us to know when regressions occur.
Here's my initial window based on STRs in https://bugzilla.mozilla.org/show_bug.cgi?id=1124612#c5. For me it frequently took much more than 5 minutes to reproduce the bug on YouTube. As I was unable to reproduce after approximately 1 hour in multiple devices, I decided the last working build is BuildID: 20150108170901.

Initial Regression Window:

Last Working Environmental Variables:
Device: Flame 2.2
BuildID: 20150108170901
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: b3f84cf78dc2
Version: 37.0a1 (2.2) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

First Broken Environmental Variables:
Device: Flame 2.2
BuildID: 20150109050134
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: ed280f6c7b39
Version: 37.0a1 (2.2) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

Last Working Gaia First Broken Gecko: Issue DOES reproduce 
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: ed280f6c7b39

First Broken Gaia Last Working Gecko: Issue does NOT reproduce
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: b3f84cf78dc2

http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b3f84cf78dc2&tochange=ed280f6c7b39
(I saw the changeset mentioned in Comment 40 in this pushlog.)

Please let me know if I should continue with the secondary window on Mozilla-inbound.
QA Whiteboard: [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
(In reply to Yeojin Chung [:YeojinC] from comment #42)
> Last Working Gaia First Broken Gecko: Issue DOES reproduce 
> Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
> Gecko: ed280f6c7b39
> 
> First Broken Gaia Last Working Gecko: Issue does NOT reproduce
> Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
> Gecko: b3f84cf78dc2

Bug 1082290 is within that range, I really think we're hitting a kernel bug here. I'll try testing this today with these two changesets backed out and report here:

https://hg.mozilla.org/mozilla-central/rev/f9625445803a
https://hg.mozilla.org/mozilla-central/rev/59452425e446

If others want to try this too you're welcome to do it as this seems to take a while to reproduce.
(In reply to Gabriele Svelto [:gsvelto] from comment #44)
> If others want to try this too you're welcome to do it as this seems to take
> a while to reproduce.

My duped bug has easy STR:

1- Make sure to have a SIM in SIM1 (or just a SIM in full stop on the Open C)
2- Make sure the SIM is connected to the network
3- Connect to a wifi network
4- Install and launch the YouTube application from the marketplace
5- Search for "Top 10 Funniest Movie Insults"
6- Watch the video from WatchMojo.com (14:26 long)

This will cause the problem anything from a few seconds to a few minutes, but on average around 2/3 minutes on the Open C and 4/5 on the Flame.

I'll try with those change-sets backed out.
Gabriele, this is what I told you the other day on IRC. I was not experiencing this with self-built and v180 or v188 blobs. This started with v18D.
Flags: needinfo?(gsvelto)
(In reply to Alexandre LISSY :gerard-majax from comment #46)
> Gabriele, this is what I told you the other day on IRC. I was not
> experiencing this with self-built and v180 or v188 blobs. This started with
> v18D.

Contrary to this, I see this on v188 and on the Open C, using my STR on comment #45. The random white-screen almost never happens (though I have seen it on the Open C), but playing a video triggers it quite easily. This makes me wonder what fix was actually employed to help with bug 975739...
Chris, I said v180 or v188 because I don't remember. Chances are I was still on v180 blobs :)
(In reply to Alexandre LISSY :gerard-majax from comment #46)
> Gabriele, this is what I told you the other day on IRC. I was not
> experiencing this with self-built and v180 or v188 blobs. This started with
> v18D.

I've tried on v188 + master and I can't seem to reproduce it on my Flame. I'll now  try with v18D + master. If this is really due to bug 1081871 then we're most likely looking at a kernel bug in the completely fair scheduler (CFS). Contrary to what I mentioned in the previous comment bug 1082290 is not what I think triggers the bug, it might just have made it more apparent.

(In reply to Chris Lord [:cwiiis] from comment #45)
> My duped bug has easy STR:
> 
> 1- Make sure to have a SIM in SIM1 (or just a SIM in full stop on the Open C)
> 2- Make sure the SIM is connected to the network
> 3- Connect to a wifi network
> 4- Install and launch the YouTube application from the marketplace
> 5- Search for "Top 10 Funniest Movie Insults"
> 6- Watch the video from WatchMojo.com (14:26 long)
> 
> This will cause the problem anything from a few seconds to a few minutes,
> but on average around 2/3 minutes on the Open C and 4/5 on the Flame.
> 
> I'll try with those change-sets backed out.

Thanks, I'll try your STR too. If my hunch is correct and this is due to a bug in the kernel CFS then any kind of task running continuously will trigger this sooner or later.
Flags: needinfo?(gsvelto)
(In reply to Chris Lord [:cwiiis] from comment #45)
> (In reply to Gabriele Svelto [:gsvelto] from comment #44)
> > If others want to try this too you're welcome to do it as this seems to take
> > a while to reproduce.
> 
> I'll try with those change-sets backed out.

With these change-sets backed out, I can't reproduce using my STR in comment #45.
(In reply to Chris Lord [:cwiiis] from comment #50)
> (In reply to Chris Lord [:cwiiis] from comment #45)
> > (In reply to Gabriele Svelto [:gsvelto] from comment #44)
> > > If others want to try this too you're welcome to do it as this seems to take
> > > a while to reproduce.
> > 
> > I'll try with those change-sets backed out.
> 
> With these change-sets backed out, I can't reproduce using my STR in comment
> #45.

Actually, looks like I spoke too soon - got the white screen of death while playing a song in the music app with the screen visible (and this is how I've got it previously too).
(In reply to Chris Lord [:cwiiis] from comment #51)
> Actually, looks like I spoke too soon - got the white screen of death while
> playing a song in the music app with the screen visible (and this is how
> I've got it previously too).

OK, I'm half relieved as this seems to rule out the cgroups & CFS (which would be disastrously bad) though I'll still try more testing in that direction because even though disabling bug 1081871 makes us not use cgroups, part of the Android bits we use still make use of them.

Unfortunately I cannot seem to reproduce any of the STRs of this bug on my phone, neither with v188+master nor with v18D+master.

I'm also curious about comment 34. Viral, do you always get that kernel stack trace when you experience this issue?
Flags: needinfo?(vwang)
(In reply to Gabriele Svelto [:gsvelto] from comment #52)
> (In reply to Chris Lord [:cwiiis] from comment #51)
> > Actually, looks like I spoke too soon - got the white screen of death while
> > playing a song in the music app with the screen visible (and this is how
> > I've got it previously too).
> 
> OK, I'm half relieved as this seems to rule out the cgroups & CFS (which
> would be disastrously bad) though I'll still try more testing in that
> direction because even though disabling bug 1081871 makes us not use
> cgroups, part of the Android bits we use still make use of them.
> 
> Unfortunately I cannot seem to reproduce any of the STRs of this bug on my
> phone, neither with v188+master nor with v18D+master.
> 
> I'm also curious about comment 34. Viral, do you always get that kernel
> stack trace when you experience this issue?

Yes, I have UART cable and I can get the kernel stack every time I met this issue.

I also revert the commit in comment 40 in 4 flames, they all run over 2 hours and didn't meet this bug (before I revert the commit, all 4 devices met white screen case less than one hour)
Flags: needinfo?(vwang)
(In reply to viral [:viralwang] from comment #53)
> Yes, I have UART cable and I can get the kernel stack every time I met this
> issue.
> 
> I also revert the commit in comment 40 in 4 flames, they all run over 2
> hours and didn't meet this bug (before I revert the commit, all 4 devices
> met white screen case less than one hour)

Thanks, then I'd say we have to revert both and reopen both bug 1081871 and bug 1082290 and see if it helps with automation. BTW I don't think this is inconsistent with the result in comment 51. Even without bug 1081871 the system is already using the CFS, so we might hit the bug even without my patch. It's just that once we start using CFS for all processes in the system we're probably increasing the chance of triggering the bug.

This puts us in a very nasty situation however: cgroup support is a committed feature for 2.2 and some of the work we'd have done on top of them is important for low-memory devices. Dave, what's your take on this? Without a way of triggering the crash consistently (and in a short time) debugging the kernel is going to be a nightmare.
Flags: needinfo?(dhylands)
I used this commit. It has run over the weekend without problem. http://hg.mozilla.org/mozilla-central/rev/379ff1cbb3ca
Flags: needinfo?(vchang)
(In reply to Gabriele Svelto [:gsvelto] from comment #54)
> (In reply to viral [:viralwang] from comment #53)
> > Yes, I have UART cable and I can get the kernel stack every time I met this
> > issue.
> > 
> > I also revert the commit in comment 40 in 4 flames, they all run over 2
> > hours and didn't meet this bug (before I revert the commit, all 4 devices
> > met white screen case less than one hour)
> 
> Thanks, then I'd say we have to revert both and reopen both bug 1081871 and
> bug 1082290 and see if it helps with automation. BTW I don't think this is
> inconsistent with the result in comment 51. Even without bug 1081871 the
> system is already using the CFS, so we might hit the bug even without my
> patch. It's just that once we start using CFS for all processes in the
> system we're probably increasing the chance of triggering the bug.

On the Open C, I'm hitting hard locks really frequently after reverting those patches - always while playing music and doing something on the phone (often when loading a new app)... But I definitely wasn't hitting this bug a few weeks before the 2.2 branch (and definitely not on 2.1). Do you know what other commits would have changed this behaviour so I can test reverting those too?
(In reply to Chris Lord [:cwiiis] from comment #56)
> On the Open C, I'm hitting hard locks really frequently after reverting
> those patches - always while playing music and doing something on the phone
> (often when loading a new app)... But I definitely wasn't hitting this bug a
> few weeks before the 2.2 branch (and definitely not on 2.1). Do you know
> what other commits would have changed this behaviour so I can test reverting
> those too?

I can't think of something else. Are you hitting hit more or less at the same frequency with that patch backed out?

It would be great if we could figure out if what you're experiencing on the ZTE Open C is the same bug as the one we're seeing on the Flame. There's two more things making this hard to debug:

- IIRC the ZTE Open C uses a different kernel version than the Flame since it's based on JB so we might be looking at a different issue altogether

- As I mentioned in my previous post my patch is exercising some functionality that is already there and it's already in use so you might be hitting the same bug even with my patch reverted
we encountered this bug in MTBF testing. It's bad and the power stop charging.
(In reply to Gabriele Svelto [:gsvelto] from comment #57)
> (In reply to Chris Lord [:cwiiis] from comment #56)
> > On the Open C, I'm hitting hard locks really frequently after reverting
> > those patches - always while playing music and doing something on the phone
> > (often when loading a new app)... But I definitely wasn't hitting this bug a
> > few weeks before the 2.2 branch (and definitely not on 2.1). Do you know
> > what other commits would have changed this behaviour so I can test reverting
> > those too?
> 
> I can't think of something else. Are you hitting hit more or less at the
> same frequency with that patch backed out?
> 
> It would be great if we could figure out if what you're experiencing on the
> ZTE Open C is the same bug as the one we're seeing on the Flame. There's two
> more things making this hard to debug:
> 
> - IIRC the ZTE Open C uses a different kernel version than the Flame since
> it's based on JB so we might be looking at a different issue altogether
> 
> - As I mentioned in my previous post my patch is exercising some
> functionality that is already there and it's already in use so you might be
> hitting the same bug even with my patch reverted

I may have been seeing a separate, unrelated hard-lock due to the vsync stuff (which I'd enabled to test a while ago) - disabling that to check.

Note that I'm seeing two hard-locks that are a bit different... The white screen one, as described, and a separate one where the phone stops responding completely, but whatever's on screen remains there and if there's a song playing, it plays to the end before stopping (but the next song won't start).

I suppose these are separate issues and I'll try to keep track over the next few days.
Nical had the same issue a couple of minutes ago, and he runs 2.1.
Flags: needinfo?(nical.bugzilla)
(In reply to Alexandre LISSY :gerard-majax from comment #60)
> Nical had the same issue a couple of minutes ago, and he runs 2.1.

As mentioned on IRC this would rule out bug 1081871. I'm really confused now :-/
(In reply to Gabriele Svelto [:gsvelto] from comment #61)
> (In reply to Alexandre LISSY :gerard-majax from comment #60)
> > Nical had the same issue a couple of minutes ago, and he runs 2.1.
> 
> As mentioned on IRC this would rule out bug 1081871. I'm really confused now
> :-/

So I have seen this on a Flame without bug 1081871 (like twice), but not with the frequency and ease with which you can reproduce it now. I'd never seen it on the Open C until around the time of bug 1081871, but I wouldn't necessarily conclude that that's the culprit.

This has been triggered before by other issues, (e.g. bug 975739). My steps in comment #45 were the new issue that I was seeing that I cannot reproduce in that way pre-2.2.
We're blind here: we cannot know for sure and easily if everyone is experiencing the same kernel crash. Bug 1025265 would help for this.
Depends on: 1025265
(In reply to Alexandre LISSY :gerard-majax from comment #60)
> Nical had the same issue a couple of minutes ago, and he runs 2.1.

Yep. Not sure what to add to answer this needinfo. Took the flame out of my pocket, screen was white (so not sure what steps to reproduce) and I had to remove the battery to get things to work again. It was the first time it happened. It's a 2.1 flame updated this morning.
Flags: needinfo?(nical.bugzilla)
I started to get a regression window. As we don't know when was the first broken but, taking into account that this bug was filed on Jan 15 and the automation lab started to blow up at the same time, I would expect the regression to happen no more than 7 days before January 15th. I set up of range of 7 full flashed devices running 1-day-spaced versions of master[1].

I'll update this thread once I get some crashes.

[1] Build IDs: 20150115010229, 20150114010205, 20150113010202, 20150112010228, 20150111010223, 20150110010205, 20150109010206
Looks like my guessing was not really correct. Every build but one since 1/10 has crashed. 1/9 is still running. I'll check builds from 20150109160257 and 20150104010206 from now on.

Here are the details:

Build ID       | Time until white screen
---------------|-------------------------
20150109010206 | Still running
20150110010205 | 3'30"
20150111010223 | 4'05"
20150112010228 | 2'30"
20150113010202 | 13'30"
20150114010205 | Still running
20150115010229 | 8'50"
I tend to confirm the dates in the regression window showed in comment 42 when I follow the STR in comment 19. With the current data I have, the regression happened on 1/9 between [1] and [2] on mozilla-central.

Build ID       | Time until white screen
---------------|-------------------------
20150104010206 | Still running after 40 minutes
20150105010205 | Idem
20150106010234 | Idem
20150107010216 | Idem
20150108010221 | Idem
20150109010206 | Still running after 100 minutes
20150109160257 | 2'15"
...
20150114010205 | 35'30"

I'll keep the device with 20150109010206 running and use the others to check the tinderbox builds.


[1] Gaia-Rev        2c7d14040149e1f9b1bb3972ff150be0472fa6b6
Gecko-Rev       https://hg.mozilla.org/mozilla-central/rev/86396560012
Build-ID        20150109160257
Version         37.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150109.192019
FW-Date         Fri Jan  9 19:20:28 EST 2015
Bootloader      L1TC000118D0


[2] Gaia-Rev        5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko-Rev       https://hg.mozilla.org/mozilla-central/rev/b3f84cf78dc2
Build-ID        20150109010206
Version         37.0a1
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20150109.043919
FW-Date         Fri Jan  9 04:39:31 EST 2015
Bootloader      L1TC000118D0
Being introduced early in the day on the 9th is consistent with what I saw in the lab, per my comments above.
After double-checking the regression window on mozilla-central, I tend to confirm again the one found in comment 42. I would sure after 12 hours. Here are the details:

Tinderbox Build ID | Time until white screen
(on moz-central)   |
-------------------|-------------------------
20150108060800     | Still running after 20 minutes
20150108165956     | Still running after 25 minutes
20150108170901     | Still running after 70 minutes
20150109050134     | 2'15
20150109051730     | 25'
20150109113431     | 2'50"
20150109114232     | 40 seconds
20150109114631     | 30 seconds
20150109185347     | 16'

If none of the 3 first builds above fail, the regression window in comment 42 would be confirmed. http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=b3f84cf78dc2&tochange=ed280f6c7b39. I'll leave them running for the next 12 hours. 

Yeojin, would you mind narrowing down the regression window in the meantime?

Extra note: mozilla-central bi-daily build 20150109010206 is still running after 185 minutes and 20150108010221 is too after 125 minutes.
Flags: needinfo?(ychung)
Mozilla-inbound Regression Window:

Last Working Environmental Variables:
Device: Flame Master
Build ID: 20150108234132
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: 8a01e921708c
Version: 37.0a1 (Master) 
Firmware Version: v18D-1

First Broken Environmental Variables:
Device: Flame Master
Build ID: 20150109001243
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: ed280f6c7b39
Version: 37.0a1 (Master) 
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

Last Working Gaia First Broken Gecko: Issue DOES reproduce 
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: ed280f6c7b39

First Broken Gaia Last Working Gecko: Issue does NOT reproduce
Gaia: 5f0dd37917c4a6d8fa8724715d4d3797419f9013
Gecko: 8a01e921708c

http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=8a01e921708c&tochange=ed280f6c7b39

In this pushlog, I see the changeset for bug 1081871 again.
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(ychung) → needinfo?(ktucker)
(In reply to Gabriele Svelto [:gsvelto] from comment #54)
> (In reply to viral [:viralwang] from comment #53)
> > Yes, I have UART cable and I can get the kernel stack every time I met this
> > issue.
> > 
> > I also revert the commit in comment 40 in 4 flames, they all run over 2
> > hours and didn't meet this bug (before I revert the commit, all 4 devices
> > met white screen case less than one hour)
> 
> Thanks, then I'd say we have to revert both and reopen both bug 1081871 and
> bug 1082290 and see if it helps with automation. BTW I don't think this is
> inconsistent with the result in comment 51. Even without bug 1081871 the
> system is already using the CFS, so we might hit the bug even without my
> patch. It's just that once we start using CFS for all processes in the
> system we're probably increasing the chance of triggering the bug.
> 
> This puts us in a very nasty situation however: cgroup support is a
> committed feature for 2.2 and some of the work we'd have done on top of them
> is important for low-memory devices. Dave, what's your take on this? Without
> a way of triggering the crash consistently (and in a short time) debugging
> the kernel is going to be a nightmare.

I think that committing to cgroups is a nice idea, but if the kernel has issues, then we probably have to back out, since for the most part, bugs in the kernel are beyond our control.

If possible, we should even try to take FxOS out of the picture, and see if we can reproduce the crash using a few standalone apps
Flags: needinfo?(dhylands)
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #69)
Update:

Tinderbox Build ID | Time until white screen
(on moz-central)   |
-------------------|-------------------------
20150108060800     | Still running after 15 hours
20150108165956     | Still running after 15 hours
20150108170901     | Still running after 15 hours
20150109050134     | 2'15
20150109051730     | 25'
...

I'll make builds within the regression range given in comment 70 to make sure if the cgroups are the issue or not.
(In reply to Johan Lorenzo [:jlorenzo] (QA) from comment #72)
> I'll make builds within the regression range given in comment 70 to make
> sure if the cgroups are the issue or not.

Thanks Johan this is appreciated.

(In reply to Dave Hylands [:dhylands] from comment #71)
> I think that committing to cgroups is a nice idea, but if the kernel has
> issues, then we probably have to back out, since for the most part, bugs in
> the kernel are beyond our control.

Yes, if it's confirmed then I think we should back out both bug 1081871 and bug 1082290, then I'll try to re-land bug 1081871 using nice values instead of cgroups but keeping more or less the same infrastructure so we can re-land bug 1082290 on top of it.

> If possible, we should even try to take FxOS out of the picture, and see if
> we can reproduce the crash using a few standalone apps

Yes, if the problem lies within the scheduler then just launching a bunch of shells running an empty loop should be enough to confirm it. The STR in comment 19 isn't doing much more expect for redrawing the screen but that already involves using the graphics stack.
One note about comment 70: Gonk hasn't changed from build 20150108234132 to build 20150109001243 (it remained at revision cd63c7ae655ee08ffac32ce36a188f8fefc4b272). 

Regarding the results I've had since yesterday, the maximum time until a crash was 35 minutes. The 2 builds before bug 1081871 have exceeded more than twice this amount of time. Hence, bug 1081871 is effectively the cause of the regression (I'll keep them running to be 100% sure).

git commit (release/gecko)               | Bug         | Time until white screen
-----------------------------------------|-------------|-------------------------
0d551ec8d96c37303e666bdaaab9b3da757b815e | Bug 966157  | Still running after 90 minutes
dbfb26f4e14eba6de27f4aa08b68d06541c431a4 | Bug 1112112 | Still running after 90 minutes
b173507d675a472a432ecfee99ed9ed335c01f35 | Bug 1081871 | 10'
03363ed023916397c77bee715b94b2c37ec38811 | Bug 1039884 | 22'
ea0e5ac1190ce786a54c3d40a03f461d3372605a | Bug 1024809 | 4'

Gabriele, would you mind doing the back out?
Blocks: 1081871
Component: GonkIntegration → Hardware Abstraction Layer (HAL)
Flags: needinfo?(gsvelto)
Keywords: crash
Product: Firefox OS → Core
Target Milestone: --- → mozilla37
Version: unspecified → 37 Branch
Flags: needinfo?(fan.luo)
I've been running with the reverts and vsync disabled on the Open C since my last comment and haven't seen any kind of hard locks so far, even while playing music and watching videos. Results seem pretty conclusive anyway, but in case further evidence was needed.

Out of interest, are we certain this is a kernel bug and not an obscure bug in our code? I'm wondering if there's a simpler test to see if cgroups cause this issue without making major changes in Gecko.
Thanks everybody for the thorough testing. I've flagged bug 1081871 for back out; we'll probably re-land it with an option to switch between cgroup and regular nice support so we can keep the feature but only enable it if we're sure it won't run us into trouble again.
Flags: needinfo?(gsvelto)
(In reply to Chris Lord [:cwiiis] from comment #75)
> Out of interest, are we certain this is a kernel bug and not an obscure bug
> in our code? I'm wondering if there's a simpler test to see if cgroups cause
> this issue without making major changes in Gecko.

Viral's stack trace in comment 34 is the smoking gun IMO plus the fact that when this happens we completely brick the phone until reboot. Doing so only from within Gecko is practically impossible. To add some more circumstantial evidence keep in mind that me and :jlebar tried enabling cgroup support in FxOS version 1.0 but ended up not doing it because it proved to be unstable on ICS. It seems that the situation did not improve much in KK; at least not with the kernels we're provided with.
Bug 1081871 backed out from trunk and b2g37.
Assignee: nobody → gsvelto
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: mozilla37 → 2.2 S5 (6feb)
One last update:

git commit (release/gecko)               | Bug         | Time until white screen
-----------------------------------------|-------------|-------------------------
0d551ec8d96c37303e666bdaaab9b3da757b815e | Bug 966157  | Still running after 7 hours
dbfb26f4e14eba6de27f4aa08b68d06541c431a4 | Bug 1112112 | Still running after 7 hours

I'll turn these device off from now on.
This issue is verified fixed on Flame 2.2 and Master.

Result: After playing a YouTube video(1 hour 40 min) three times, the white screen did not appear on Flame 2.2 and Master.

Device: Flame 2.2 (319mb, full flash)
Build ID: 20150128002506
Gaia: cd42b034fd2825c3675ace3a67f5775eb61c2d60
Gecko: d824c65a6a2b
Gonk: e7c90613521145db090dd24147afd5ceb5703190
Version: 37.0a2 (2.2)
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:37.0) Gecko/37.0 Firefox/37.0

Device: Flame Master (319mb, full flash)
Build ID: 20150128010234
Gaia: 1d53fb07984298253aad64bfa4236b7167ee3d4d
Gecko: b2b10231606b
Gonk: e7c90613521145db090dd24147afd5ceb5703190
Version: 38.0a1 (3.0)
Firmware Version: v18D-1
User Agent: Mozilla/5.0 (Mobile; rv:38.0) Gecko/38.0 Firefox/38.0
Status: RESOLVED → VERIFIED
QA Whiteboard: [QAnalyst-Triage+] → [QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Keywords: verifyme
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage+]
Flags: needinfo?(ktucker)
See Also: → 1130035
See Also: → 1161445
See Also: → 1177020
Adding from automation as it was also found in bug 1121374.
Whiteboard: [3.0-Daily-Testing] → [3.0-Daily-Testing][fromAutomation]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: