Monday, September 17, 2012

Significant keep-alive performance flaw in iOS' NSURLConnection

The problem

I've recently done a lot of research to understand a perplexing performance problem haunting the iOS port of a commercial component I maintain at work (originally developed for Android).  Specifically, the problem presents itself when the user navigates repeatedly to a specific screen in the application which issues an HTTP HEAD request (via HTTPS) to determine if a particular static resource has been modified.  Normally, you would think, a quick and light-weight operation.

However, the performance penalties we were seeing in this simple usage of NSURLConnection was a measurable delay of approximately 600ms, despite a round-trip time to our servers of only ~80ms over our local Wi-Fi network.  Now I'm sure at this point a well informed reader is going to exclaim: but wait, this is HTTPS, there's a well known and significant handshaking penalty versus HTTP!  Of course, this same penalty exists for our Android version of the application, however this workflow on Android does not present any significant delay.

So, what's going on?

The short answer: the connection has been closed.  While NSURLConnection does support HTTP keep-alive, it has a client-imposed timeout of just 12 seconds even for HTTPS.  This may seem like a reasonable timeout when you look at the problem with 1995 goggles: loading a complete web page efficiently without requiring a separate connection for each resource on the page.  However, in today's world the model is much richer.  Web services dominate the landscape of mobile applications (even when the browser is the client), using the web for more discrete data exchange where appropriate and moving the rest client-side.

And this data exchange frequently happens over HTTPS.  Borrowing from Google's own SPDY proposal: the future of the web depends on a secure network connection.  SPDY requires SSL and yet is designed for efficiency.  A paradox only if you don't embrace long keep-alive timeouts.

Does it matter?

The practical implication of NSURLConnection's short timeouts for users is a noticeable but not immediately reproducible delay experienced within applications that utilize web services over HTTPS.  A common user scenario making for an easy repro case on a large number of mobile apps out there (go ahead, try yours) is the login screen.  Logging in implies a secure connection, and often users are expected to take some time to awkwardly type their cat's name: Mr!Cuddl3s.  I timed myself with this exercise: 13 seconds.  In this amount of time, any HTTPS connections already opened to your application's domain have been closed and so here comes the 600ms+ SSL connection penalty before the user sees a frustrating message: "Your username or password does not match".  Returning to the input field, you recheck your username, delete the password, and start again.  If you were fast enough, you're quickly ushered off to the main screen for the application.  If, however, you took more than 12 seconds again to retype things, you will see that penalty one more time.  This is likely to repeat many times in the user's interaction with your application: idling on various screens reading content, setting the phone down for just a moment, etc.

TCP/SSL handshaking overhead shown in the Amazon Mobile app (Android  left, iOS right)
The above image and linked YouTube video demonstrates the impact from the user's perspective of the 12 second connection timeout.  As you see, the first request is much slower on iOS because Android has re-used the previously established HTTPS connection set-up through prior interaction with the app.  Subsequent requests are identical in performance until after the user idles for 12 seconds again, bringing the favor back to Android once again.

But what about optimization best practices?

There is significant precedence on the web suggesting that this is not a reasonable default.  A variety of libraries that I sampled do not exhibit this behaviour (Apache HttpClient, Android's implementation of HttpURLConnection, and serf).  None had such a short timeout, let alone even a client-initiated shutdown at all.  The same conclusion holds true for HTTPS servers in production: a sampling of Google, Facebook, and Amazon servers suggest they are all willing to let HTTPS connections linger for minutes, not seconds.

Surely there is some reasoned thinking behind this behaviour, right?  Well, perhaps.  I can find little material publicly available except a brief thread on Apple's Mac Network Programming mailing list where an Apple engineer draws the conclusion that the closure is to support the radio's ability to enter an idle state.  While it is true that cellular radios have complex power saving state management (searching the web for material on Radio Resource Control [RRC] or Fast Dormancy has no shortage of white papers), I question the argument that the connection closure truly is in alignment with the details of the radio.

The first piece of evidence against this comes from the nature of RRC's lack of specificity or standard on timing windows.  Apple simply cannot know whether the network would ask the device to enter this idle state in 2 seconds, 10 seconds, or 15 seconds.  So a naive, hardcoded timeout of 12 seconds is seemingly just as likely to incur the worst possible performance as it is the best.

Furthermore, the hardcoded timeout is consistent between 3G and Wi-Fi connections which of course have very different radio stacks and performance optimizations.  In the case of Wi-Fi, this appears to have even changed significantly from iOS 4 to iOS 5, with no change to NSURLConnection's behaviour.  In my  extremely informal testing on iOS 5, I found that the radio entered a low power state in just a few seconds, indicating that the radio is to resume normal operation to handle the connection closure if the user switches off the screen just a few seconds after an HTTP request.  Hardly an edge case.

Admittedly, we're getting into white paper territory ourselves here to properly and convincingly prove that this timeout harms battery life (let alone carrier network performance!).  I'm not prepared to go through that much rigor just yet.  Instead, I'd like to simply make a call to Apple to reconsider this behaviour and re-evaluate their own internal research.  And, if you're reading Apple, please feel free to reach out to me if there is interest in a thorough and more academic study on this topic.  I'd be happy to help.

What next?

NSURLConnection is intentionally opaque, offering no features to customize the underlying mechanics, making it easy for Apple to adjust them without fear of breaking backward compatibility.  Unfortunately this means that there's no convenient way to work around the problem if your app uses NSURLConnection.  While it is possible to switch to a third party such as ASIHTTPRequest, I don't personally recommend this approach as this transition makes it more difficult for Apple to implement or enforce reasonable connection management policies.

The API is designed for Apple to implement best practices, so let's ask that they do exactly that.