Contents:
* Web hosting comparison service
* Borland/Codegear is sponsoring the PHP Programming Innovation Award
* Surviving Digg
- 1. Avoid accessing databases
- 2. Cache Web pages
- 3. Avoid needless personalization
- 4. Queue tasks that may take too long
- 5. Move images, CSS and Javascript to a multi-threaded Web server
- 6. Minimize page serving time with page compression
- 7. Put the Web, mail and database servers in different partitions
- 8. Distribute the load when the servers limit is reached
* Your tips to handle traffic peaks
* Web hosting comparison service
Before proceeding to the main topic of this post, I would like to announce a new service available now to the PHPClasses site users.
Everybody needs to choose an hosting service. But which company provides the best hosting service for your needs? Price is one factor, but there are other factors, such as the Web server speed, up time, support response time, etc..
Starting this month, the PHPClasses site is partnering with company named RealMetrics. They audit Web hosting services and compile statistics that let you compare them, so you find which are the best hosts for your needs.
Now you may browse many of the available hosting companies in a sub-site of the PHPClasses site. If you forget the URL, just look at the PHPClasses home page and follow the link that says: "Web Hosting Comparisons".
http://hosting-metrics.phpclasses.org/
If you have a Web hosting company, send a message to hosting-metrics@phpclasses.org to learn how to get your company listed.
Keep in mind that this is an experimental partnership service. Feel free to post a comment on this article about this service, so we know how it goes.
* Borland/Codegear is sponsoring the PHP Programming Innovation Award
I also have another interesting announcement. Codegear, a division of Borland focused on software development tools, has just joined the already long list of sponsors of the PHP Programming Innovation Award.
If you have joined the PHPClasses recently and you are not aware about this award, it is an initiative meant to distinguish developers that submit the most innovative components to the PHPClasses site. Every month, awarded authors may win prizes of their choice from all those made available by the sponsors.
http://www.phpclasses.org/award/innovation/
Starting next month Codegear will be sponsoring the award providing a copy of Delphi for PHP. For those that are not familiar with Delphi for PHP, you may find here a review that I published some time ago. You may also find here translations of the review to German and Portuguese.
http://www.phpclasses.org/reviews/id/B000NOIR8U.html
Hopefully, the participation of another important sponsor providing a valuable prize will encourage more authors to make even more innovative contributions.
- Surviving Digg
Last month post about defensive programming practices was so appreciated that it reached the front page of Digg.com. Thank you to all users that voted on that article.
http://www.phpclasses.org/blog/post/65-8-defensive-programmi ...
When the post was approved to Digg's front page, all of the sudden thousands of users came to the post page. I never had an article in Digg's front page. So, I always wondered if this site was ready to handle traffic peaks caused by many users lead by busy news sites like Digg or Slashdot.
After watching how the site reacted to the traffic surge, it was with great relief that I could confirm that the site server is ready to handle significant traffic loads.
Over the years I invested a lot on measures that would minimize the load caused by serving many simultaneous users. However, only after surviving a flood of accesses of users sent by Digg, I can be confident that the site is ready survive significant traffic peaks.
Since the peak was caused by the approval of the article on Digg, in retribution to the users that voted on the article, I am glad to follow up with this article. Now I provide more details on the specific measures that I have implemented to make the site more robust and capable of handling this kind of traffic load.
- 1. Avoid accessing databases
On a database driven content sites like this, the slowest task is accessing the database to retrieve the content to display.
If a site is not fast enough to serve its pages, many simultaneous user accesses force the Web server to create more processes to handle all the requests. Consequently, more database connections need to be opened at the same time to generate the database driven content pages.
This is bad because it demands more server memory. Once the server RAM is exhausted, the machine starts using the virtual memory, making the server even slower, until it halts completely.
Nowadays, RAM chips are cheap but you can only put a limited amount of RAM on each server. So, adding more RAM to a server is only a viable solution until you reach the server RAM limit.
Furthermore, if you do not own your server machine, hosting companies may add significant charges to put more RAM in your server. So, saving RAM usage is very important.
Since RAM usage may be aggravated with excessive accesses to slow components, such as databases, it is better to avoid accessing the database, as much as possible.
I always says, the fastest database applications are those that avoid accessing the database as much as possible.
- 2. Cache Web pages
If the site needs to access the database to retrieve the content to display, what can we do to minimize the database accesses?
First, we need to focus on what kind of information the site retrieves from databases. The most common type of data is information used to build the HTML pages.
The fact is that the information in database does not change so frequently. As a matter of fact, usually different users see the exact same HTML content when they access the same page.
It is not unusual to execute many SQL queries to retrieve all the data that is necessary to build a single HTML page. So, it is evident that it would be much more efficient if the sites could cache the HTML of each page after it is accessed for the first time.
That is exactly what the PHPClasses site does. It uses this general purpose caching class to store whole pages, or specific page section excerpts, on server cache files.
http://www.phpclasses.org/filecache
Let me emphasize that the PHPClasses site does very aggressive server side caching. In practice this means that currently it is using 1.2GB of disk space to store over 160,000 cache files. So, when I mean aggressive caching, I mean practically all the site pages based on content retrieved from a database.
So, what if the database content changes? No problem, I just call a function of the class above that invalidates all the page caches that depend on content of the changed database table rows. This way, it forces the caches to be rebuilt on the next access to the site pages, and so the pages are always upto date.
- 3. Avoid needless personalization
What about pages that appear differently to each user that accesses them? No problem, I use separate cache files depending on the user accessing the pages.
However, it would useless if the site would use a separate cache file to store the HTML that each user sees. The benefit of reusing cached information would be lost.
Actually, in most cases I do not need to have separate cache file for each user, as users under the same circumstances often see the same information.
I use the concept of contextual user profiles. Many users may share the same contextual user profile. For instance, there are user profiles for: anonymous users (those that have not logged in), logged users, authors, administrator users, etc...
Under each page context, users of different profiles see different pages that are stored in distinct cache files.
To maximize the efficiency of this approach you should minimize the number of user profiles that may be used for each page context. Therefore, it is very important to avoid personalization as much as you can.
What I mean, is to avoid needless personalization details, like showing the date and hour, presenting personalized messages based on the user name like "You are logged in as Manuel Lemos", etc...
Site newsletters are another example of excessive personalization of arguable importance. Some times I see newsletters that start by "Dear John, .... If you want to unsubscribe send a message to unsubscribe-john@somesite.com". That may be nice, but it also means that you need to rebuild the newsletter message body with data from the database to send it for each subscriber.
The truth is that often that is not really necessary. To avoid having your newsletter messages be taken as spam, the To: header must contain the e-mail address of the newsletter subscriber. The message body does not really need to be personalized.
To make the newsletter deliveries more efficient, I use this MIME message composing and sending class. It can cache the message body when I deliver the newsletter message to the first subscriber. I can still change any headers to send the same newsletter to different subscribers.
http://www.phpclasses.org/mimemessage
This way I avoid the overhead of rebuilding the messages for each subscriber. Since I avoided personalizing the message body, I also do not need any more data from the database besides the subscribers e-mail addresses.
For users that want to unsubscribe by e-mail, I just provide a single unsubscription address which is associated to a POP3 mailbox. There is a script that regularly polls that mailbox using this POP3 client class.
http://www.phpclasses.org/pop3class
The script checks the From: header of the unsubscribe request message to figure which user is trying to unsubscribe. The script sends an unsubscribe confirmation request message to avoid bogus unsubscription attempts.
Back to the personalization issue, during the dot com bubble days, I used to hear a lot about personalization and how important it was for the user. It may be nice to treat the users by their names. However, when personalization imposes a very high maintenance site cost, it is something that should make you reconsider.
That does not mean that you should completely avoid personalization. For instance the PHPClasses home page for logged users is personalized.
For now it shows certain statistics about the usage of the site by the current user. I plan to make the home page more configurable letting each user see different site widgets according to personal preferences.
I have not yet decided if this level of personalization will be made available to all site users. I need to balance the pros and cons of each personalization aspect that is provided.
- 4. Queue tasks that may take too long
Caching is great to avoid repeated accesses to the same content stored in a database.
However, caching only applies to accesses that retrieve data from databases. Operations that update the database content may not benefit from caching. However, when done to frequently, database write accesses may cause server overload.
That problem happened with the PHPClasses site on three main situations. One of the situations was due to the fact that the site takes note of which packages are downloaded by each user.
This is useful to build accurate top download rankings, and also to send alert messages to users that downloaded a package when the author updates its files.
These are great site features, but as the site audience grows the tables that record package download accesses have obviously become very large.
Currently, one of the tables has 5.3 million rows. Just counting the number of rows in that table takes about 1m 40s . Updating this table when an user downloads a package is an operation that became too slow. It is not just changing the table rows. It also takes a lot of time to update the table indexes. Over time this situation became completely inviable.
The solution that I found is to not update that table in real time. Instead, I created a similar table that acted as a queue. The queue table has no indexes. Periodically, a background task started from cron, moves the queue table data to the main table.
The result in terms of efficiency gain was phenomenal. Recording package downloads has almost no overhead, as the queue table has no indexes to update. Updating the main table is also much faster, as there is only one process updating that table at a given time.
Another similar situation that made me use a queue table is the trackback submission handling. For those that are not familiar with trackbacks, those are signals sent by one site to notify another site, to let it know that there is a new page with a link pointing to the other site page.
Trackbacks are often used by blogs to notify each other about posts commenting on each other posts. The PHPClasses site supports trackback submission to keep track of blogs and forums that mention classes, reviews and blog articles posted in the PHPClasses.
http://www.phpclasses.org/tips.html?tip=trackback-links
http://www.phpclasses.org/blog/post/42-Classes-trackback-blo ...
That would be all fine and dandy, if it was not for trackback spammers. Unfortunately, there are spammers that use zombie computers infected by spyware to post fake trackback notices to sites that support the trackback protocol.
The PHPClasses site is able to detect trackback spam using its own system automatically to block spammer machines. There is also an alternative that was not available when the PHPClasses trackback system was implemented named Akismet. It is a product developed by the Wordpress developers.
http://akismet.com/
The problem of any trackback processing system is that it is very hard to keep up with all the trackback submissions sent by spammers. If you try to process each submission in real time, it takes a lot of time and your Web server starts increasing the number of processes to handle so many simultaneous requests.
Once again the solution is to avoid executing tasks that may take too long on each Web server request. Instead, I used a queue table that holds all trackback request information to be processed later periodically by a script started from a cron job.
The last situation of long tasks that had to be processed in the background is the delivery of site newsletters. This is a little different because it is a task that does not change the database.
However, querying a database table with hundreds of thousands of records to extract the list newsletters subscribers, is a task that may take too long. It may even put on hold other processes attempting to update the same database table. Obviously this is also a task that is handled by a script running in the background started from by cron job.
- 5. Move images, CSS and Javascript to a multi-threaded Web server
Another detail that may cause too many Web server accesses is the static content, such as site images, CSS and Javascript files.
Browsers usually cache static content, but when an user comes to a site for the first time, it generates a lot of requests to retrieve all the site images, CSS and Javascript.
If your site gets a flood of users from Digg, Slashdot or some other popular site, the simultaneous requests generated by all incoming users, may easily overload your server.
This would not be such a problem if PHP could run well in multi-threaded Web servers. A multi-thread Web server can handle many thousands of simultaneous requests without using too much memory.
Unfortunately, it is very hard to guarantee that PHP can run on multi-threaded Web servers without crashing its threads. The problem is not so much on PHP itself, but rather on external C libraries and extensions that are necessary, but those may not be thread-safe.
For now it is better to stay away from multi-threaded Web servers, unless you run PHP on CGI mode. That is a solution that is much slower that the traditional pre-forked Web servers. A FastCGI solution may be an option that I have not investigated.
On the other hand, serving static content using multi-threaded Web servers is ok and very recommended. If you move your images, CSS, Javascript to a separate multi-threaded Web server, not only it will be much faster, but it will take much less memory.
There are several well known Web servers that can run on multi-threaded mode, like Apache 2, Microsoft IIS, Tux, Boa, Lightttpd, thttpd, etc... The PHPClasses site uses thttpd, but probably most other alternatives would be fine.
http://www.acme.com/software/thttpd/
thttpd runs under the domain files.phpclasses.org. That domain is associated to a separate IP address. This is necessary to make sure all the static content is served via a Web server running on port 80. I tried in the past using the same IP address as www.phpclasses.org and used a different port, but that was not a good idea. Some corporations block Web accesses to any other port besides 80.
- 6. Minimize page serving time with page compression
Serving a Web page is not just a matter of generating the page content. The content must also be received by the browser. Until the user browser is not done receiving the page data from the server, that will hold on the Web server process.
This may hang on too many Web server processes for too long. Such Web server processes will be consuming too much memory, despite they are not doing anything besides waiting for the user browser to receive all the page data.
This is particularly bad if your site is being accessed by too many users from slow networks. So, it is very convenient to avoid serving Web pages that are too long. Actually the smaller the better.
One way to drastically reduce the amount of data sent to the browsers, is to use HTTP compression. I already mentioned this topic in a past post:
http://www.phpclasses.org/blog/post/58-Responsive-AJAX-appli ...
Nowadays all browsers support compression. Typical HTML compression rates are about 5:1. This means that a 50K page may be compressed to only 10K.
PHP can add compression to your pages easily using the output buffer gzip handler:
http://www.php.net/ob_gzhandler
If you do not want to spend too much time adding compression to your site pages, you can used mod_gzip with Apache 1.x or mod_deflate with Apache 2.x.
http://schroepl.net/projekte/mod_gzip/
http://httpd.apache.org/docs/2.2/mod/mod_deflate.html
The PHPClasses site uses mod_gzip as it can also be used to compress HTML and other types of data that is not generated by PHP.
- 7. Put the Web, mail and database servers in different partitions
Despite the PHPClasses site is often very busy, for now, it can still run under a single dedicated server machine.
Currently, there are three types of server programs that cause most of the load on the server machine: Web servers (Apache 1.x and thttpd), the mail server (Qmail) and the database server (MySQL).
All of these servers perform a lot of I/O operations, especially the database and mail servers. This means that they compete for disk access.
Despite there is only one disk (mirrored by hardware RAID 1), Greg Saylor, the systems administrator that has been selling me hosting services since 1998 when I launched my first PHP site, suggested that I split the main disk space in three partitions.
This is a good idea because not only it will be more secure to recover data in case of damaged partitions, but it will eventually be faster, as each partition will be accessed using separate processes, each handling the access to the data of separate server programs.
- 8. Distribute the load when the servers limit is reached
The PHPClasses site growth never ends. Soon or later the site server machine will reach the limits of its hardware. I may still upgrade some hardware parts, but that will only postpone some server architecture rethinking for a little while.
So what shall I do then? Obviously I must start distributing the load of the servers between more machines. The current partition is helpful, as it will allow easy migration of each group of servers to different machines.
However, certain solutions that were valid with a single server, will no longer be very efficient in a distributed architecture.
I can make the mail server be distributed between several machines using SMTP relays. I can make the MySQL run on a clustered architecture using TCP connections. I can balance the load of the Web accesses between several server machines using reverse proxies.
However, one problem that arises is the use of file based caches. That is a fast solution if files reside in the same machine. However, if I need to use NFS or a similar file distribution protocol, I am afraid it will have a significant impact on the site performance.
I could rethink the site architecture to use an application server approach. That would require a major restructure. It does not seem to be a viable solution.
An alternative solution is to move from file based caches to memcached based caches. For those not familiar with memcached, this is a distributed shared memory solution developed by the fine folks of LiveJournal.com .
http://www.danga.com/memcached/
You can have one or more server machines that store cached data in RAM instead of disk files. Despite the networking overhead, it can still be a fast solution. You can use PHP persistent socket connections to avoid part of the networking connections overhead, just like with persistent database connections.
It would not be hard to build a new version of the file cache class above to provide the same API to access memcached servers. That would minimize the refactoring efforts.
Of course this is all theory. I will have to get there and put it in practice to make sure it can work as I imagine. Anyway, the migration path seems that it will be smooth, as I am sure I will be able to migrate each component one at a time.
Source : PHPClasses
 
