Back in February, 2004 I blogged about the "Hyperbole" number that we use to monitor PubSub's matching rate. In 2004, we were proud to say that we were regularly matching at a rate of 3 to 7 billion matches per day. But, times have changed... These days, the daytime Hyperbole number is usually over 1 trillion matches per day. A quick check, just a moment ago, shows a rate of 1,604,636,741,992 matches/day. We've seen the number go much higher.
As I wrote before, the Hyperbole number is meant to impress... It is "a number designed to sound impressive while also being completely truthful". But recently, we've found that the number is getting so large that it is hard for folk to get their heads around it. Even though it is much lower than we are designed to handle, the Hyperbole number has gotten to the point where it's beginning to sound like "science fiction" when initially presented to potential investors and others who are trying to evaluate our technology and approach. Fortunately, once folk recover from their initial shock (a trillion of anything is a lot...), it is fairly easy to show why the number makes sense. For instance, the LinkCount charts we publish on our site show that we've processed an average of 1.5 million new feed entries per day over the last month. During the last few days we've been handling over 2 million new entries per day and the number of new entries per day is only increasing. Since we service well over 300,000 subscriptions and since we match each new entry against every one of the subscriptions the instant each new entry is discovered, we're performing an average of close to 500 billion matches each day. Of course, the rate at which entries arrive during the day varies greatly. During "night-time" in the US, publishing rates are lower and they peak during the day. Thus, it shouldn't be hard to see that we often need to process new entries at a rate of 4x or more the average rate.
One solution to the problem we have in getting people to understand the Hyperbole number might be to change the time period over which it is computed. For instance, instead of reporting the rate of matches per day, we could report on matches/second. That would get us from "trillions" back down to the "millions" that people are more comfortable with. A rate of 1.6 trillion matches/day is only about 18 million matches/second. The two rates are mathematically equivelant, but the second is easier for folk to think about.
While even 18 million matches per second sounds impressive, that number will look tiny in just a year or two. Today, we only monitor 14 million blog feeds but we expect that by the end of the year we'll be monitoring around 100 million. Today, we only service several hundred thousand subscriptions but we expect to be servicing many millions of subscriptions in the very near future. Given the growth of blogging, data syndication, and structured publishing, a massive increase in the number of items/events that we -- and anyone else in this business -- must handle, is inevitable. Fortunately, we're designed to handle the load today. For instance, we regularly show people the results of benchmarks that we've done on our raw matching engine (with parsing, network, and delivery costs removed) that show that the engine can handle up to 3 billion matches per second on my desktop machine (2.4 Ghz single processor). Of course, we'll never be able to build a message handling system that can drive the matching engine at full speed, however, the power is there...
It is important to realize, when considering the Hyperbole number, that we don't actually do 18 million operations per second in order to get the effect of having done that many comparisons. The only reason this system works as it does is that we've got an algorithm that is "sub-linear" to the number of effective matches we perform. What I mean by this is that as the number of matches required increases, the work we need to do increases at a lower rate. Thus, a doubling in the number of effective matches required might only result in an increase of 10% or 30% in the amount of work that our machines need to do. Given this, you should understand the Hyperbole number as an indication of the "value" of the work we've done rather than an indication of the number of operations that were actually performed by our machines.
There are many ways to do matching. The simplest and most naive methods require that each subscription be explicitly and directly compared to every event or message. These naive approaches result in systems whose work-load is "linear" with respect to the number of messages and subscriptions processed since they will always do a number of operations which is approximated by multiplying the number of messages with the number of subscriptions. A sub-linear system, which uses more sophisticated algorithms will, on the other hand, normally do less than "one more unit of work" for each additional subscription added to the system. If a matching system is sub-linear then each new subscription is usually just a little bit cheaper to handle than any subscriptions that were registered before it. PubSub's matching algorithm is very sub-linear. If our algorithm wasn't sub-linear, we'd need massively more machines than we currently use for matching. (i.e. we'd need more than one...)
Much has changed for us since my first note on the Hyperbole number in February 2004. We've gone from several hundred subscriptions to several hundred thousand subscriptions. We've grown from four employees to around 20. And, while we were only monitoring a few hundred thousand feeds when we got started, we're now monitoring 14 million feeds. But, one thing that hasn't changed is that we are still using only one dual processor Pentium machine to do our matching and we've still got processing power to spare before we'll need to partition and scale. Of course, the other thing that hasn't changed is that we're still having a great deal of fun making this system work!
bob wyman
Nice try to make this sound like a big deal, but matching a few hundred thousand subscribers to 2 million entries per day is easy, even on a dual PC as you point out. I fail to see the hard problem that is being solved.
[Bob Wyman responds: Well, if you think it is easy, then try to do it yourself! If you can demonstrate it to us, we'll probably offer you a job. Or, identify a system that already does this... You'll find that it is harder than you think. Remember, you need to re-evaluate every one of the "few hundred thousand" subscriptions 2 million times each day in order to deliver the same low-latency matching that we do. Solutions that involve batching data, and thus increasing latency, aren't fair comparisons. You can only match one message at a time. Also, your system should be able to match arbitrarily complex boolean expressions on multi-element XML-formatted messages. In any case, remember as well that I wrote that we're currently running far below capacity... ]
Posted by: MikeT | August 05, 2005 at 16:49
Bob, Count me impressed. As you rightly point out, it's significant particularly if no one else is doing it, and there is a "market for it".
The key to the future of information management and delivery is personalization, and this is a significant step in that direction.
Q
Posted by: Peter Q | August 06, 2005 at 05:16
I'll take you up on that challenge, at least as far thinking up an algorithm. I'm not looking for a job, but this looks like a fun problem!
http://www.myelin.co.nz/post/2005/8/12/#200508122
Let me know what you think.
Posted by: Phillip Pearson | August 12, 2005 at 07:10
Nice one, Bob.
Posted by: BixbeyH | August 24, 2005 at 16:56
What's Google's corresponding Hyperbole Number?
Posted by: Firedog | September 25, 2005 at 04:21