Closed Bug 1749908 Opened 2 years ago Closed 2 years ago

Infinite loop in HTTP3 hangs socket thread

Categories

(Core :: Networking, defect)

Desktop
All
defect

Tracking

()

VERIFIED DUPLICATE of bug 1749910

People

(Reporter: heftig, Unassigned)

Details

While browsing, pages suddenly stopped loading. A look with htop showed the main process' socket thread eating 100% CPU and a look with perf top showed it jumping around Http3 ReadSegments and OnReadSegment code.

I killed Firefox and restarted it, and it immediately reproduced the issue again. I closed it and let the shutdown hang handler take a dump:

https://crash-stats.mozilla.org/report/index/07b0fd37-7f94-4371-bb05-0b78f0220113
Thread 7 is the Socket Thread.
Built from https://hg.mozilla.org/mozilla-central/rev/9487d469939ee838cecf62a96acc5236716e6b3e

The next start was with http3 disabled, which did not reproduce the issue.

This appears to be affecting all Firefox upgraded overnight, e.g. https://old.reddit.com/r/firefox/comments/s2u7eg/is_firefox_down/ or https://news.ycombinator.com/item?id=29918052 or a search for Firefox on Twitter. I hope the auto-updater can bypass http3, otherwise I'm not sure how it's going to update to fix the issue.

It's not related to a specific version, we're getting reports ESR is even affected. Suspicion is around some long existing HTTP3 bug that's being triggered by an external service updating.

Hundreads of my customers are impacted, this is an apocalypse.

If anyone need to fix it, please open "about:config" in a new tab.
Search : "network.http.http3.enabled"
change to false, then restart firefox.

See Also: → foxstuck

That is a workaround. Not a fix. It will break Firefox when HTTP 2 is deprecated in the future.

Comment from a reddit thread [1]

Other workaround: Go to preferences -> Firefox Data Collection and uncheck everything. Then restart Firefox

If that's correct, it might point to a service that has been updated and is exposing this bug?

[1] https://old.reddit.com/r/firefox/comments/s2utvv/psa_solution_for_firefox_not_working_right_now/

OS: Linux → All
Hardware: x86_64 → Desktop

(In reply to Gian-Carlo Pascutto [:gcp] from comment #2)

It's not related to a specific version, we're getting reports ESR is even affected. Suspicion is around some long existing HTTP3 bug that's being triggered by an external service updating.

Anecdote but someone on 95.0.2 did not have an issue this morning until they tried to visit a Google doc and then it started.

Recent bugs that might be relevant:
https://bugzilla.mozilla.org/show_bug.cgi?id=1700703 (Recent Firefox Nightly with HTTP3 enabled has problems loading HTTPS sites on Cloudflare)
https://bugzilla.mozilla.org/show_bug.cgi?id=1734110 (HTTP/3 stalls when switching to network with MTU<=1350)

(In reply to Glenn Watson [:gw] from comment #6)

Comment from a reddit thread [1]

Other workaround: Go to preferences -> Firefox Data Collection and uncheck everything. Then restart Firefox

If that's correct, it might point to a service that has been updated and is exposing this bug?

[1] https://old.reddit.com/r/firefox/comments/s2utvv/psa_solution_for_firefox_not_working_right_now/

I disabled the Firefox Data Collection and firefox indeed started working again.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #2)

It's not related to a specific version, we're getting reports ESR is even affected. Suspicion is around some long existing HTTP3 bug that's being triggered by an external service updating.

Yes, disabling Firefox Data Collection here, and re-enabling http3, and I'm writing this in Firefox fine.

(In reply to Glenn Watson [:gw] from comment #6)

Comment from a reddit thread [1]

Other workaround: Go to preferences -> Firefox Data Collection and uncheck everything. Then restart Firefox

If that's correct, it might point to a service that has been updated and is exposing this bug?

[1] https://old.reddit.com/r/firefox/comments/s2utvv/psa_solution_for_firefox_not_working_right_now/

I disabled just "Allow Firefox to install and run studies" and it started working after a restart, potentially a study caused the error?

(In reply to Chris Hills from comment #15)

(In reply to mhoermann from comment #13)

I am all for limiting telemetry but lets not argue in bad faith here. At best telemetry triggered an existing bug, it did not cause it.

I agree it is not the root cause but it certainly caused a massive problem for many users of Firefox, the majority of whom are opted-in by default. If they had not been, they would not have been affected.

We have other services with the same type of load balancer in front of it and we currently suspect it is an HTTP/3 load balancing problem. Telemetry has nothing to do with this, it just happens to be one of the first services with H3 load balancer.

I think I'm seeing this issue as well. I have a website pinned the first, and that website is behind Cloudflare. If I open Firefox with that website open there, the network hangs. But if I close the tab, and open Firefox, then open that website, everything seems to work just fine.

I have all telemetry disabled and do not use dns over https. Still am affected by the bug as soon as I open a slack, ff starts to eat all cpu and page never loads. I don't think I updated since yesterday, as I am use the fedora package. Version 95.0.2

Our current suspicion is that a cloud provider or load balancer that fronts one of our own servers got an update that triggers an existing HTTP3 bug. Telemetry was first implicated because it's one of the first services a normal Firefox configuration will connect to, but presumably the bug will trigger with any other connection to such a server (so disabling telemetry is pointless). Our current plan is to disable HTTP3 to mitigate until we can locate the exact bug in the networking stack. The problem appears to be gone, we'll update on further steps.

(In reply to Xidorn Quan [:xidorn] UTC+11 from comment #18)

I think I'm seeing this issue as well. I have a website pinned the first, and that website is behind Cloudflare. If I open Firefox with that website open there, the network hangs. But if I close the tab, and open Firefox, then open that website, everything seems to work just fine.

Can you provide the URL so we can try to reproduce? Thanks!

Flags: needinfo?(xidorn+moz)

(In reply to Gian-Carlo Pascutto [:gcp] from comment #21)

Our current suspicion is that Google Cloud Load Balancer (or a similar CloudFlare service) that fronts one of our own servers got an update that triggers an existing HTTP3 bug. Telemetry was first implicated because it's one of the first services a normal Firefox configuration will connect first, but presumably the bug will trigger with any other connection to such a server. Our current plan is to disable HTTP3 to mitigate until we can locate the exact bug in the networking stack.

I can for example use cloudflare's HTTP/3 test page:
https://cloudflare-quic.com/

It works just fine.

"Does my browser support HTTP/3 & QUIC?
When loading this page from Cloudflare's edge network, your browser used HTTP/3."

(In reply to Christian Holler (:decoder) from comment #22)

(In reply to Xidorn Quan [:xidorn] UTC+11 from comment #18)

I think I'm seeing this issue as well. I have a website pinned the first, and that website is behind Cloudflare. If I open Firefox with that website open there, the network hangs. But if I close the tab, and open Firefox, then open that website, everything seems to work just fine.

Can you provide the URL so we can try to reproduce? Thanks!

I can no longer reproduce this issue with the website anymore. It seems it uses HTTP/2 now. Maybe Cloudflare has rolled back some deployment?

Flags: needinfo?(xidorn+moz)

(In reply to Michel Zehnder from comment #25)

I can for example use cloudflare's HTTP/3 test page:
https://cloudflare-quic.com/

It works just fine.

I can verify Michel's observation. HTTP/3 works on that page.

Hint to resolve issue by network.http.http3.enabled: false is already getting major social media coverage, so some plan to change this setting name or to revert it to true may be needed in the future.

PS Yeah, my firefox also suddenly stopped to work.

The problem is known, there's no need to add more evidence. How to deal with the fall-out of temporary workarounds will also be part of the conversation.

Please be mindful when commenting, and sending notifications to hundreds of people. At this point, comments should be limited to those helping to fix the specific issue.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → DUPLICATE
Flags: needinfo?(shansense5)
Restrict Comments: true
You need to log in before you can comment on or make changes to this bug.