« Web 2.0: TANSTAAFL! | Main | Pinging Power Laws: 1% generate 33% »

April 24, 2006

Comments

Phil Wilson

Bob, I notice that draft-saintandre-atompub-notify-04 has expired[1], seemingly without a replacement. I flagged this on the pubsub blog in January[2]. Will there be a new draft?

[1] http://www.xmpp.org/drafts/draft-saintandre-atompub-notify-04.html
[2] http://sandbox.pubsub.com/blog/?p=35

rob r


On the "push.. NAT .. Jabber" issues , it would be interesting to see more about using keep alive and push over "Comet"
[http://ajaxian.com/archives/comet-a-new-approach-to-ajax-applications].
Then, the implementation of "Observer" patterns
[http://en.wikipedia.org/wiki/Observer_pattern] might be transparent even when http is only protocol for the remote observer objects.

Dave Winer

FYI, the scaling issues on weblogs.com were on the ping-handling side, not in handling the requests from people polling changes.xml.

Also, you're right, the term "doomsayer" was not a good thing to say, and I apologize for any hurt it might have caused. However, you do a lot of name-calling in this piece, so much so that I'm reluctant to point to it, so as not to encourage this level of discourse. If you have something to say about this, can you do it without being so disrespectful about it?

Dave Winer

Also, do you really think we should re-do the network defined by RSS to build something different around notifications? Do you think RSS is failing to scale? Serious questions.

Randy Charles Morin

Sorry Bob, but the whole RSS ping thing is a failure. Both Technorati and PubSub rely on it and fail to index the vast majority of posts because the pings are simply dropped. My own evidence shows that PubSub responds to less than 10% of pings.

I've documented this countless times on The RSS Blog.
http://www.kbcafe.com/rss/?guid=20060409191336

Blogosphere search engines that use polling in addition to pings (IceRocket and Google Blog Search) are reliable and index most all of the content of feeds in their index. Blogosphere search engines that rely solely on pings, are not and index a small percentage of the actual content.

This is most evident this month with PubSub where you've experienced massive outages for days on end. A few weeks ago, I talked with Salim about this and he confirmed the ping infrastructure wasn't working.

Dave Winer

Randy: WOW!

G. Roper

Here's why I believe Dave is correct:

Notification:
To receive notification, the listener must have a socket pre-allocated. Suppose the listener receives 100 notifications simultaneously. The listener will fork or spawn a new thread, create a new socket and repeat that 100 times. Each thread/process must be processed. At some level N of simultaneous or near-simultaneous notifications, the listening system will be swamped: N may be 100, 1000, or 1 million, but at some level the system _will_ necessarily fail due to lack of resources (memory or CPU). Alternatively, notifying systems may timeout. The listening system has few choices: either die or discard notifications.

Polling:
Polling can be done at the leisure of the polling system. The number of connections Q can be limited to whatever the polling system can handle. There is zero, nada, NIL chance of the system being swamped, since at any time it is polling at most Q sites.

Summary:
While a polling system will be updated more slowly, it will scale linearly:
time to process M sites ~= M x (time to process 1 site)

Since the laws of probability _guarantee_ that a system based on notifications will eventually encounter a situation when it is saturated by notifications, using a notification system guarantees eventual failure.

Danny

It *is* remarkable how successful polling-based syndication has been to date, and says a lot for the design of the web's key specification, HTTP. Saying a software system "doesn't scale" without any kind of qualification is pretty meaningless - anything can be scaled by throwing more iron at it. But there's no getting around Bob's basic point - push is inherently more efficient than polling.

Randy, two cases of (presumed) failure of the ping approach doesn't mean the technique is flawed. I believe there is a lot of potential, it would be a shame for it to be neglected.

G. Roper, I'm afraid that analysis is a non-starter. It's just as reasonable to limit the number of connections at a push-based receiver as it is for a polling system to decide not to poll beyond its capabilities. Or if you prefer, "Since the laws of probability _guarantee_ that a system based on polling will eventually encounter a situation when it is saturated by subscriptions, using a polling system guarantees eventual failure."

Ok, that's a bit flippant. But polling can never be more efficient than push. Wanna proof? Consider a one-entry feed. To operate without dropping any entries a polling system would have to ensure its polling frequency is high enough that the window between polls is narrower than the *minimum* time between new entries. Over time, the number of bits that are transferred will be the sum of a value proportional to the polling frequency multiplied by the number of bits transferred for each 'miss' (and any transport overhead, presumably constant), plus the total number of information bits (and any transport overhead). The number of bits transferred in a push system will simply be of the order of the total number of information bits. So the amount of data that has to be transferred in a polling system will be *at least* as many as in a push system.

In practice the polling window is expanded considerably by allowing multiple entries in each document. So fewer calls are needed. But this comes at a price - when there are new entries all the already-received data in the feed gets transferred again (unless as Bob suggests, you only pass deltas).

It's been a long time since I read it, but I believe there are some of the relevant sums in Rohit Khare's dissertation (no coincidence that he's also the guy behind mod_pubsub and KnowNow).

There are trade-offs between the approaches, I suspect further down the road we'll see more interesting hybrids. What we won't see is a mathematical proof from Dave.

Eugene Y. Jen

The 'Mathematical Proof' bases on assumptions that a receiver has to maintain persistent connections in order to be notified. But just the assumptions that G. Roper employed in polling case, it can be done at receiver's PACE or CAPACITY. SMTP is the most popular push based notication system and it scales pretty well after almost 30 years introduction. The problem for push model is how to make an receiver's well-known endpoint visible outside NATs and Firewalls. As long as this problem is dealt with, the receiver's end can take notification in REST, XML-RPC, SOAP, SMTP or XMPP without problem at ITS own PACES. The problem will be how pushers schedule their updates to achieve minumum cost/maximum efficiency/shortest latency for notification delivery.

Danny

Oh yeah, what I forgot to mention was things get worse when you start chaining polling systems - the simplest case being synthesized feeds (aggregate & republish). They have to cache otherwise the polling window will have to be reduced at every stage to prevent missing entries. The "Reading List" case is slightly different, but I suspect that's unlikely to scale indefinitely unless you're prepared to accept either a certain proportion of lost data, or significant redundancy in caching. This isn't "doomsaying", such costs might be worth it, and in practice such systems as a whole may turn out to be more useful than their push-based counterparts. But there's no sense in pretending the scalability issues don't exist.

Salim Ismail

Hoy - Randy. I gotta argue with your quote. I don't remember ever saying the 'ping infrastructure wasn't working'. More likely I said something about spings or the fact that some blogs don't yet ping or something like that. The ping infrastructure is the foundation of anything to do with syndication and is constantly evolving. It took roughly 10 years each for SMTP and HTTP to fully iron themselves out - it'll also take a while for this new wave to do so. However, once it's fully there, a whole new class of applications and services gets enabled, which is what's so exciting.

Randy Charles Morin

Danny, your proof left out the reason the blogosphere ping has failed.
http://en.wikipedia.org/wiki/Sping

[Bob Wyman responds: Randy, the "blogosphere ping" has NOT failed. Sure, we receive lots of spam pings, but we can recognize most of them for what they are and we filter them out. All pings we receive at PubSub are verified before we forward them to the FeedMesh or other subscribers. (By "verify" I mean that we verify that the ping corresponds to an actual change in a feed.) If nothing else, spam pings are a great indicator of who the spammers are! The vast majority of sites that ping "too fast" or that send "fake pings" (pings for feeds that have not changed) are run by spammers. So, spammers who are spingers just draw attention to themselves and will end up being blocked.

bob wyman ]

Randy Charles Morin

Bob, if the spam pings are so easily filtered, then why is most of my referrers in PubSub full of splogs?

Here's a particularly bad day just a week ago.
http://www.pubsub.com/site_inlinks.php?site=kbcafe.com&linktype=in&date=20060417

[Bob Wyman responds:
Randy, spam pings and spam are two often related yet independent problems. Not all spammers produce spam pings -- some are quite proper in the way that they ping to notify us of updates to their spam... Not all spam pings are produced by spammers -- some otherwise respectable bloggers generate vast quantities of repeated pings, fake pings, etc..
We have a variety of methods to detect and filter ping spam and we have a different set of methods to detect and filter spam itself. Success in one of these two areas helps in the other but doesn't "solve" the other problem.


bob wyman ]

G. Roper

Danny,

bob wyman's original post stated:
"Winer claims that he can produce a "mathematical proof" that polling does, in fact, scale."

I showed that polling scales. I didn't argue that polling was "more efficient than push". Nor does my argument claim that polling systems cannot be saturated.

I forgot to mention the use of cacheing on the WWW, which improve the efficency of HTTP and polling. I am uncertain whether/how these caches work in the case of push technologies. Perhaps someone else can tell us?

Eugene Y. Jen

Caching does not make polling more efficent than pushing. A network of N nodes has to transmit at least N-1 time from one node to another to spread a piece of information from original node to every nodes in the network. What Danny pointed out is by applying Nyquist–Shannon-Kotelnikov sampling theorem, a polling system has to poll at a frequecy twice of a publisher's frequency of changes to catch up every update in a publisher. Therefore any polling system has to consume at least 200% of bandwidth that any push system does, if you expect a polling system to behave exactly as a push system.

I don't think that caching applies to push systems as to polling systems. Instead, a push system needs some QoS mechanism to avoid saturation and routing stragtegy to relay messages among nodes.

Randy Charles Morin

Bob, you keep telling us this and that works. Of course, we're not on your end, we have no choice but to believe you. What we see is that PubSub and Technorati go weeks and months without indexing blogs that are updated regularly. Somewhere in there is a disconnect. You'll have to tell us what's broken, because something is.

The comments to this entry are closed.