SD TImes: Software Development Click for BZ Media
Search
NewsCurrent IssueBack IssuesColumnsOpinionsResource LinksSubscription services
About UsMedia KitSite MapContact UsHome

Advertisement







INTEGRATION WATCH:
Solving the Problems of Distributed Computing

By Andrew Binstock

Andrew Binstock
October 1, 2004 — Commercial distributed computing increasingly looks like grids or clusters and less like the traditional multitiered paradigm. More and more sites are forgoing the expensive monolithic server in favor of large numbers of inexpensive computers hooked together in a network of peers. The reason sites prefer this design is simple: price. There are other benefits as well, but price is the overriding consideration.

While distributed computing leverages inexpensive systems to distribute the computational load and thereby boost performance, it does little to address the bottleneck caused by database access. Moreover, in a distributed system, the possibility of multiple nodes needing the same data is great, meaning that the addition of new nodes will increase disk I/O and ultimately reduce overall performance rather than enhance it. As sites are beginning to appreciate the magnitude of the problem, systemwide caches of database query results are quickly becoming popular.

My column of Aug. 15 (“Caching In on Java Caches,” page 29) discussed advanced caching solutions that are emerging in the Java market to handle the bottleneck of the database access and disk I/O. Because Java is the de facto runtime for the business logic platforms, most commercial caching solutions are exclusively Java products. However, many Web sites today do not use Java. They use PHP, Perl, .NET or custom solutions to handle the Web requests and dynamically generate pages. For their caching, they need a solution that is not Java-based. Some of them are turning to a free, open-source product named memcached.

It was first conceived by Brad Fitzpatrick of the Danga Group. He was running a blogging site called LiveJournal.com that used approximately 70 PCs to support the more than 2.5 million accounts. Fitzpatrick quickly came to realize that the I/O thrash caused by hitting the database was by far the most important gating factor on performance. He had expected the caching mechanism in the database server to alleviate some of this load. However, he found that when tables in the database were updated, the cache entries for the entire table were invalidated. Because a journal site updates tables frequently, the database cache was mostly invalid, and so it offered little, if any, benefit.

In response, Fitzpatrick developed memcached, which is a coordinated caching system. You can run as many instances of it as you wish on different machines, and they will all work together to provide a very substantial caching resource. At LiveJournal, Fitzgerald puts memcached on most of the systems, including all 32-bit systems with more than 4GB of RAM. He uses the RAM above 4GB for cache slices. In this way, he provides an enormous cache that can be scaled up by the addition of more PCs or by increasing the RAM in existing systems. It’s clean, simple and very effective.

At LiveJournal, he cached the most popular 30GB of data and received a 92 percent hit rate—all of those hits being ones the database did not have to fulfill. The site averages about 20 million hits per day. After successful deployment there, memcached is now in production at the famed Slashdot Web site and at Wikipedia. Source Forge is about to begin deploying it as well. In other words, this is software that has been banged on and has proven itself.

The cache slices run on Linux systems, and sport an API for clients that use PHP, Python, Perl or Java. The API is simple and has only a few calls, consisting mostly of inserting and retrieving items from the cache plus a few housekeeping functions.

Configuration of the cache is not terribly difficult: To get it running, you use command-line parameters to specify the amount of memory and the TCP/IP port the cache should listen to for requests. From then on, memcached more or less runs by itself. Some sites report running it for months at a time without needing to tend it.

There is a certain robustness to memcached: If a node running an instance goes down, the other caches keep right on working. If a cache request is sent to the down node, the lack of response is translated as a cache miss, which is the equivalent of “go fetch this from the database.” When the node is returned to action, its cache slice is gradually replenished and the whole system keeps right on rolling.

As I mentioned earlier, memcached is open source and free. It’s distributed under the very permissive BSD license, which in rough terms states: Do what you want with this code and product, as long as you print the copyrights and waive liabilities and warranties. Because the product is open source, it’s attracting support for new platforms and being tested regularly on new projects, so it should only be improving. The home site for memcached is www.danga.com/memcached.


Andrew Binstock is the principal analyst at Pacific Data Works LLC.






Andrew Binstock is the principal analyst at Pacific Data Works LLC


Columns
Alan Watch

And Another Thing...

First Look


Industry Watch

INTEGRATION WATCH

Java Watch

Windows & .NET Watch

E-mail your comments to Andrew Binstock

Advertisement



Click here for a complete listing of Integration Watch Columns

Click here to see a complete Column Archive.


 


 

 

 

 

 

 

 

 

 

 

 

 

  





 Back to Top



news
| current issue | back issues | columns | opinions | resource links
about | site map | subscriptions | media kit | contact us

Copyright © 1999-2004 BZ Media, LLC, all rights reserved.
Phone: 631-421-4158 • E-mail: info@bzmedia.com