- Internet Information Server 4.0, 5.0, 5.1, 6.0
- Windows NT 4.0, 2000, 2003 Server
There is an argument to be had over the relevance of this as an issue in the greater scheme of problems on the modern internet, you’ll either both agree with me and find some worth in the ideas presented here, or you’ll want to move along quite quickly. The issue is, however, in domain consistency and how you present your website domain as a branded entity on your marketing, media and the web.
Let me also begin by stating that this discussion is not one designed to focus on this, my personal site which does not adhere to the constants discussed forthwith, but more practically to corporate sites and targeted offerings such as by far my largest web creation, HPC:Factor.
Allow me to elaborate with a question. What do the following addresses all have in common?
The answer? Without too much expenditure of your grey mass you have probably worked out that they are all identical paths into HPC:Factor. Three separate, but crucially unique (from the perspective of a search engine) copies of the same site. So why should it matter?
Let me clarify the matter at hand with a practical example.
HPC:Factor was setup on its own IP range, on its own internally hosted server in 2003 and went live to the world as www.hpcfactor.com. Up until we started providing public forum and feedback systems, this is the way is remained. Search engines were harmonious in their listings from the site, because as everything on it was under editorial control – and root relative – everything resolved back to www.hpcfactor.com.
Two things changed this:
- The human condition – laziness: if I can, I will. Why type www. ?
- Troubleshooting users DNS issues we (I) would often be seen to suggest people visit http://22.214.171.124/ to find out if they were having a connection or plain DNS issue.
Once you start to see links without the www. or links to the IP, formatted as a hyperlink on the page body the search engine spiders will naturally take a look at the hyperlink, and, they will do what spiders do: follow them.
Except as a dumb object, the spider can only make one assumption, that being in the belief that it is now leaving the website www.hpcfactor.com and going off somewhere else, a new website and one which HPC:Factor is linking into.
The spider/bot now sees a completely “new” website (that is strangely familiar when compared to the previous one due to root relative URL’s). The links are followed, the site is indexed and content begins to appear on the listings.
Unfortunately, the listings can become impacted by this as your site fights itself to appear in the search results rankings. In 2006 searching for Handheld PC related content on Google would often produce identical pages three times; disparate copies of the same pages on the three domains outlined above.
Why does it matter?
The Vicious Cycle
It can therefore be seen that cross domain exposure can get out of control. The more it is displayed on the search engines using a name other than the one you intended, the more people click into your site on that name. The more people click in, the more that they bookmark – indelibly gluing their entry point to the secondary entry point. The more they remain on that entry point, the more likely people will copy/paste and distribute URL’s from your site to others using that entry point and we come full circle with the search engines using referrals to gauge page ranking.
The Vicious Cycle of multiple entry points
At this point the process begins to run away from itself, as the loop gets larger, the number of referential links increases and so the secondary entry points begin to rise in prevalence in the search rankings. When the spider finds a hard coded link back to the original site, it once more believes that it is being asked to exit the site and so the loop will continue.
For each entry point instance the loops will interconnect cyclically across the exit point acting detrimentally to the socially responsible presence your organisation has on the search engines and acting as a potentially negative influence over your page ranking.
The Cross Domain Interconnection Cycle
What happens next can take a long time, but when I began to see IP based URL’s out-ranking those listings indexed against the official domain name, enough was enough in our case.
Your site fights itself, your visitors become confused and perhaps worse it isn’t an impossible stretch of the imagination that clean-up filters may see the duplicate content as SPAM – wiping legitimate data from search engine listings.
HPC:Factor content has been cited in a large number of academic works, from secondary school through to PhD doctoral dissertations, something of which we are understandable incredibly proud. The problem is that as academic institutions quite rightly crack down on plagiarism, the automated system that are being brought in to catch the cheats are not quite up to the challenge of dealing with the domains.
Caching and search results for the same content can reveal the three different copies of the site at different stages, and to such systems is can look like HPC:Factor plagiarised its own content – and when you are backing up a PhD dissertation, this is not a position you want to find yourself in.
Having a controlled, universal entry point to the content is therefore advantageous in preventing problems in this regard.
If you use webmaster tools from different sources which rely upon verification and/or service provision based upon domain name, having several entry points may mean doing the work three times. As a practical example, I dislike the hot linking which is caused by Google/MSN/Yahoo’s image search features and requested all content from www.hpcfactor.com be removed from the index – it was worth it for the bandwidth savings!
All three sites did this (eventually) only for me to find hpcfactor.com and 126.96.36.199 still listed on Google and Yahoo.
To use a (crude) modern net culture paradigm, as far as the likes of Google are concerned (ignoring formatting and layout), the three addresses are akin to the rafts of website who re-host content from Wikipedia – same distributed content on a different domain name. Up until very recently, search engines have had no apparent capabilities in screening for these subtle naming differences. Google is the implied modern white knight here, with a checkbox on their webmaster tools to specify an override using your preference between http://www.hpcfactor.com and http://hpcfactor.com.
The potential issue with such a tool is that clearly Google isn’t the only search engine in the vast, international expanse of the World Wide Web. It may cover your base for a considerable percentage of your users, however if 11 years of web mastering has taught me anything it has been that it is the minority percentage who invariably spring out of the woodwork with the best challenges and innovations possibilities for the vigilant webmaster to seize upon.
Google’s facility is not unwelcome, however it avoids the complications of the third doorway, the numerical IP or indeed any other DNS based entry point into your web presence. Crucially (and above all) is the fact that the Google system is a process that removes control from the webmaster, placing control over your domain and its use rigidly into the hands of a multinational’s latest beta project.
Before the forums and the public feedback/community systems there were the statistics. We have always paid close attention to our traffic (because it costs us money) and way back in our darkest past the original problem stemmed from the interpretation of the statistics generation agent of the different DNS / non-DNS entry points into the site. Some of the reports were becoming cluttered with unnecessary data and being the stickler for order that I am, I clearly didn’t approve.
In the end I did script the issue out of the stats system; however it remains a valid part of the thought process on this issue.
“Image is nothing…” laughed the “Sprite” marketeer (all the way to the bank in his Armani suit, Ferrari and for adornment one or more vacuous blonds).
HPC:Factor’s front end web entry point was intended to exclusively be www.hpcfactor.com. This was a branding decision made by yours truly when we started, and has been a decision that we have diligently adhered to for the last 5 years.
Referring back to the list above, comparatively the ‘www.’ prefixed address is the more balanced, the www psychologically removed the alliterated h between http and hpc and clearly identifies the site as being of World Wide Web origin.
Don’t worry too much if that last statement seems alien, if you do not understand the use and origin of DNS understand that the domain prefix is often used to identify the intended protocol that the particular server is listening for – without having to front-end a single setup on a catch all domain.
www.hpcfactor.com has always been used universally by the owners and staff of the site as it is our “domain” brand. Signatures, letter heads, support posts, public events, even threatening letters to copyright violators have all maintained a consistency on the domain address.
You should not underestimate the power of consistency; it is a corner stone of Human Computer Interaction/Interface (HCI) doctrine – as well as marketing best practice – and one which has withstood the test of time.
Despite what Sprite may have tried to make us all believe; Image is EVERYTHING.
If you or your organisation have given this any thought, or worse paid an advertising agency to give it some thought, then you clearly have a desire to make the most of your trade mark as much and as often as you a can. Seeing 188.8.131.52 on a search engine fails to do your brand justice, if users copy/paste from their address bar onto other sites it is your brand that loses out on the exposure afforded through viral distribution.
Very few people possess a visually stimulated subconscious mind which can, at a glance assimilate an IP address, let alone mentally associate that IP address with your company.
‘www.’ is a recognisable comfort marker. For most people seeing ‘www.’ aids in the process of “recall not recollection” (Neilsen), there in there immediately and on a subconscious level perceive what they are reading to be a Web address. It is less likely to evoke instant associative recall in novice or new Internet users without this prefix – and atop this there is an even bigger debate to be had over using http:// on branding.
This part of the argument is of course very subjective, and revolves around the general competency of your user base.
The main point is however that the mind plays a part in your brand awareness. Consistency, symmetry and psychology all play a part in this, and if potential customers are seeing a rendition of your web address which you did not intend, then there may be unforeseen consequences due to loss of control – as happened with us.
Of course there is a very real argument for the psychological premise that on-line people dislike having to type sub-domains (‘www.’ is a sub-domain of the domain ‘hpcfactor’ of the root domain ‘.com’). There is a large, well established body of support for this, and in my own experience in respect of this as an issue over the use of ‘www.’, most people are conditioned on the use of www, even electing to add it when marketing and promotional media does not specifically reference it.
If therefore you brand hpcfactor.com and fail to activate www.hpcfactor.com you will almost certainly lose out on traffic when marketing through non-digital media.
If you are considering breaking away from sub-domains other than ‘www.’, you may want to glean some form of market research (at least informally gathered by your webmaster) over how your visitors’ best see your domain name presence: customers.mydomain.com vs. www.mydomain.com vs. mydomain.com.
In this instance size really does matter.
Domain Schema Continuity
It is important that as a matter of capital investment, your domain name work for you and be available to work for you when you want it.
Domains are not just for websites. You may not realise that entire companies can be structurally organised around a domain name, from security to workstations and servers, customers, clients and staff. They are not just for use as a World Wide Web anchor. If you suddenly need exclusive access to the root domain (hpcfactor.com) and have to pull it off of the web server, the traffic loss due to the failed search engine requests could be devastating for your business. Yes, technically there are things which can be done to work around and through this – but why suffer the expense, time and stress in the first place.
With the numeric IP listings, think what happens if you are forced to change IP or ISP. At the flick of a switch all your search engine listings are instantaneously useless, and could very easily wind up pointing to someone else’s website! Highly ranked index positions will no longer work for you and inexorably you will lose traffic to your site.
From experience, I can assure you that de-indexing listings can take months, and in some extreme cases years; especially when you no longer have delegative control of the DNS/IP involved.
A centralised search engine presence is the best solution to prevent this from occurring. Tight control around a unified policy – while allowing for entry via other mediums has to be considered the most efficient approach to your domain naming.
We will now explore the available options for IIS users, and discuss with reference to HPC:Factor the problems we have faced over the years as a result or action as well as inaction on our part.
What to do
Stopping website access to all but the intended domain
In this first option, barring access to the entry points you do not want people to use is certainly a possibility, in fact it is by far the most easily implemented of the suggestions I am going to offer; but not one I recommend.
If your site is already indexed against undesired entry points, terminating access will be detrimental to your visitor statistics. It will break swathes of search engine links for operators of large sites and could cause specific upset to users who have been used to accessing your cyber-presence through other means. Very simply it is clear that such harsh action can rapidly equate into missed sales opportunities for your business and even in people thinking that your company has vanished from the Internet altogether.
Existing link-through’ from other websites will likewise cease to work for you at the flick of a switch. Most modern search engines calculate listing ranking based upon “credible referrers”, and in presence of inactive links that are no longer working for your business your position on search engine listings will suffer.
Unless you are certain of the impact this will have, and are willing to suffer any pain, this is not an option I recommend you explore.
There is also an additional consideration to give thought to, and that once again relates to the idea of your domains working for you. While you may not want the specific traffic on the undesirable address, you almost certainly do want that traffic. If you stop serving requests against any particular entry point, your visitors lose the convenience – particularly prevalent in regards of your organisations root domain (hpcfactor.com) rather than the IP address.
People are lazy, it is a fact of nature. Visitors may try to access your site without ‘www.’ simply because they can and have experience of doing so from other websites. When presented with a 404 ‘not found’ error they will simply leave – losing the business opportunity for your organisation and indelibly associating your domain with failure in the mind of the consumer. An apt marketing anecdote by Smith from 1993 tells of the dangers of presenting bad cyber-service (on-line customer service). In his research he stated that when confronted with bad customer service, the average consumer may tell upwards of 11 others before feeling satisfied and alleviated of the injustice.
Although Smith’s research was not aimed towards Internet customer service, and particularly the unique customer service issues which a broken website can create. A look on any technology forum or consumer soapbox site will demonstrate just how vocal certain consumers can be at the slightest hint of a bad experience.
As you can see, in juxtaposition it is therefore important not to remove functionality in the reclamation of your net assets, but to find a compromise which gives you, the webmaster the flexibility you need, yet at the same time acknowledging the needs of the customer.
I know that from personal experience, upon finding domains which have inadequate DNS configuration – either lacking www. or failing to capitalise on the importance of the root domain – I have made mental notes about the poorly considered decisions made (presumably by inadequate IHP’s) and grumbling ALT+D’d back to the address bar.
Create Separate Websites for the Domains
There are a number of ways that you can approach a sub-site approach to controlling your on-line presence.
The first is to create a homepage on any entry point you do not want to use and very simply present the user with “click here to visit HPC:Factor” to drop them back to where you would like them to be.
A modification of this would be to use a META refresh – a process where the web browser itself automatically redirects the user without the need for any click through. This is more convenient for the user, and can even be set to instantly redirect as soon as the browser has parsed the script – but it adds its own set of complications in that the browser must be capable of providing meta redirect functionality. Contingencies must be made for those browsers which cannot support it; a very real case in point being HPC:Factor and its requirement for legacy browser support.
To further complicate matters, my browser is configured to deny meta refresh scripts from running due to concerns I have over nefarious purposes and advertising use – rendering the modifications moot.
The intrinsic flaw in this approach is two fold. The idea only captures hits to the sites index page (http://hpcfactor.com). If I were to click through from an existing link, or type into the address bar (http://hpcfactor.com/hcl/) I would be presented with a 404 File Not Found error message because the site is unaware of content found on the primary website.
The second problem is in how Search Engines perceive such scripts. They simply are not interested in the purpose of the script, and will see that website as a single page site, with no content on it; and it will be indexed as such.
Under these conditions a search for HPC:Factor would produce a page stating “Click here to continue”, a problem that SmartMobileAssets.com suffered with for many years, and for any corporate entity is ultimately very unprofessional and disliked by users who do not want to “click here” to view your homepage.
Taking a step up from the previous home page method is to manipulate the 404 error sent by your web server when any invalid request is sent from the client.
Unlike a static home page, the 404 error will catch all failed requests and give the user the option to “Click here” or be meta redirected to your home page.
There is naturally an issue here with URL preservation. While in accessing http://hpcfactor.com/hcl/ the server will have sent some content back to my browser, the redirect/link can only send me to one place – presumably www.hpcfactor.com.
Your consumer wanted to go to www.hpcfactor.com/hcl/ though, meaning that they must navigate through your site to the content or edit the page address themselves. Ultimately such a process is not eloquent or customer service friendly.
It does however have a very real, very effective purpose for webmasters, that being if you are making significant structural or infrastructure changes to your site.
The Static 404 error is nothing more significant than a web page; however behind the scenes of http, the client (be it a user or a spider/robot) has been informed that the page content is no longer available and the address no longer valid.
The notification to the search engine spiders will force a clean, gradual removal from the search engines of your invalid content – unlike the dirty removal process caused by disabling the domain.
Your site will not lose all the traffic, and in the interim users will be able to click back into the site. So what is meant by structural or infrastructure changes?
When we started HPC:Factor, reviews content was dropped unceremoniously into sub-folders beneath the root folder /reviews/. The directory quickly filled up, it looked untidy and there was no separation between hardware / software reviews and editorials.
We cleaned this up in the folder structure and migrated to the H/PC:QLink system, HPC:Factor’s perma-link engine to ensure that any future structural changes did not invalidate all our search engine review links – in fact all our content links on Google are QLink, and not root relative URL’s.
Infrastructure changes are a step up, and relate to changing the server make-up of your organisation. Specifically with the introduction of sub-domains to categorise content.
The best use for the static 404 system would be exemplified by Rich Hawley’s HPC:NEC site when he migrated from fully static content up to a Content Management System (CMS).
A static URL (http://hpcfactor.com/reviews/software/microsoft/activesync-4-0/) has absolutely no bearing on the CMS generated URL (http://www.hpcfactor.com/reviews/article.asp?aid=12874) and cannot be (sensibly) mapped at this juncture in a sites evolution to the new CMS address using a dynamic process.
It is therefore better to catch as many users as possible, get them back onto the new homepage and acclimatise them to the new system rather than to lose them altogether.
IIS 6.0 Configuration for Static 404 redirection
A benefit of using IIS in this regard is in that you can modify the 404 redirect functionality at both a site and a directory level. By combining the two, you can enhance the convenience of the process by setting content on your old site to 404 and redirect to the section (or as close as is possible) on the new. The redirect can of course be cross domain/site and if used with some thought provide a more convenient solution for your users.
In order to appreciate the difference between a static 404 and a dynamic one, it is important that you understand the process involved in the decision making and display processes surrounding HTTP error codes.
When the browser requests content from the web server, the browser initiates a path request to the server, which is parsed by the HTTPd (HTTP Daemon) server against the physical content map in the server file system.
The domain information is used only to get the request to the correct web server, and plays no further part in the request (ordinarily).
If the physical content is found, the server returns a HTTP state code of 200, and starts to process and transmit the file data. If however the physical file content cannot be found, the server transmits a HTTP state code of 404 (File Not Found) back to the client and transmits the specified 404 error page (usually the server default) as the request data.
Any information in the querystring is superfluous to the 404 request, as the querystring data is attended to by the scripting provider (e.g. ASP, PHP, Perl, Cold Fusion), and not directly by the HTTPd server. It is however of significance to webmasters of automated sites the likes of HPC:Factor.
The premise of a Dynamic 404 in this example is very simple. The HTTPd server receive the users request, acknowledges that the file is no longer available and returns the state code 404 to the client, just as with the static address.
Instead of transmitting static page content, a scripting provider is used to generate content for the 404 error in the process of being relayed to the client’s web browser.
The scripting provider takes the http address request, replaces the undesirable domain with the desired one and issues a command to the browser that it must go to the newly created address. For example, the script would turn http://hpcfactor.com/community/default.asp or http://184.108.40.206/community/default.asp into http://www.hpcfactor.com/community/default.asp without any involvement from the browser.
The dynamic redirect is performed at the server side, and assumes that the requested page content can be retrieved from the identical file location on the primary entry point domain. The user will be redirected immediately to the content without realising that the redirect has taken place; save for a change of domain name in the address bar.
Such systems preserve the convenience and functionality principles of good web site customer service while freeing up secondary entry points for use by the webmaster. In addition, because the 404 error is still being relayed, defunct or undesired links from search engines will slowly begin to be cleaned up as the spider performs index housekeeping. The spider will also receive instructions to perform the redirect to the new location, which may have additional benefits in terms of making your site more visible to the crawler, although it should be emphasised that this is speculation.
C:Amie’s simple Dynamic Redirect Script
‘ © Chris Tilley : C:Amie (not) Com (http://www.c-amie.co.uk/) & HPC:Factor (http://www.hpcfactor.com/).
' All Rights Reserved. 2007.
‘ This script is free for non-commerical use.
' Dim the variables
' Edit the below to match the full desired root URL with HTTP and a trailing /
strDestination = "http://www.hpcfactor.com/"
' Perform the content filter
strTemp = LCase(Request.ServerVariables("QUERY_STRING"))
strTemp = Trim(strTemp)
strTemp = Right(strTemp, Len(strTemp)-4)
strTemp = Replace(strTemp, "http://","")
strTemp = Replace(strTemp, Lcase(Request.ServerVariables("HTTP_HOST")), "")
if Left(strTemp, 1) = "/" then
strTemp = Right(strTemp, Len(strTemp)-1)
' Are we in a debug loop or redirecting the client?
' To use the debug just access a content document less URL thus http://mydomain.com/file/?debug=1
if Instr(strSetDebug,"debug=1") = False then
' In the event of an old browser, write out a hyperlink in simple HTML
Response.Write("<h1>HPC:Factor Content Redirect</h1><p>You have clicked on an invalid URL.")
Response.Write(" Please use the link below to be redirected to the content you requested.")
Response.Write("<br><a href=""" & strDestination & strTemp & """>" & strDestination & strTemp & "</a></p>")
Response.Redirect(strDestination & strTemp)
Response.Write("<table width=""95%"" align=""center"">")
Response.Write(" <h3><b>Server Variable</strong></h3></td>")
For Each var In Request.ServerVariables
Response.Write(" <td><b>" & var & "</strong></td>")
Response.Write(" <td>" & Request.ServerVariables(var) & "</td>")
Response.Write("<tr><td><b>REFERER</strong></td><td>" & Request.ServerVariables("HTTP_REFERER") & "</td></tr>")
' Call the Sub Procedure
At the time of writing, a similar system is in use on http://hpcfactor.com and http://220.127.116.11/ as a method of cleansing our run-away search engine listings. For the most part the scripting is straightforward, eloquent and effective; however there are some caveats with IIS 5.0/6.0 and ASP.
IIS 6.0 Configuration for Dynamic 404 redirection
Internet Information Server, whether it is a bug or intended, appears to struggle with retention of the querystring segment of a URL GET request.
Using my own server side script, and using the following URL:
You would expect the redirect to send the client to:
However, the reality is that IIS redirects the client to http://www.hpcfactor.com/forums/forums/thread-view.asp dropping the querystring completely and sending the user to the forum index page instead of the thread requested. The quesrystring information is not parsed through the HTTPd server and into the request.servervariables ASP object, dropping it from the server initiated redirect.
I have discovered an exception to this rule, and that is in an implied content document. Take for example a H/PC:QLink address:
This will redirect correctly to http://www.hpcfactor.com/qlink/?linkID=1 because there is no content document listed in the request URL. If however the URL is expanded to become literal (http://hpcfactor.com/qlink/default.asp?linkID=1; the implied default.asp is added) the redirect will no longer work (Redirecting to http://www.hpcfactor.com/qlink/default.asp; which returns the client to the Home Page).
There is a discernable reason for this. The details we use from Request.ServerVariables(“QUERY_STRING”) to ascertain the originating URL have been re-encoded as a querystring of the 404 page, not as a querystring of the originating failure page. Consequently, the querystring from the original URL has been dropped from the request as is has not been escaped into a repeatable character format. This is an inherent limitation of this script and of using ASP to perform this kind of server side process.
For content which does not require a querystring (most of our content) or for our perma-link system (our most important content) which does still work, this is an acceptable trade-off for our situation and have been using it effectively for some time.
The process is also proving effective at helping us to strip out unwanted ghost copies of the site, as demonstrated via Google here: http://www.google.com/search?q=site:18.104.22.168 which at the time of writing is down to only 6 remaining pages of Google content on our IP address.
Of course no webmaster wants to sacrifice traffic, and there is a far more efficient and effective way of doing this. Yet this script example is valid for anyone who does not wish to allow IIS to manage their redirection or, as may be the case, has limited console access to IIS metabase configuration.
IIS Home Directory Redirection (The Easy Way)
If you are an Apache user, you will already know that httpd.conf can be used to automate the process of domain interlinking and automated redirection. It therefore should come as no surprise that it is possible to deliver the same results from IIS for each of your IIS hosted sites.
As an option, this requires full configuration over the metabase, either through a metabase editor or through the IIS Admin MMC snap-in.
The IIS approach can be configured to forward http connections to a single, universal file (such as sending everybody to the same “click here to view the <name> home page”) or using wildcard variables to manipulate the domain, file system path and querystring of any given URL.
There are three key wildcard variables involved in this example. These are:
The suffix variable should be considered as the path difference analyser wild card. This variable performs the substitution between the old and new domain name. For example given an origin URL of:
Then the /folder/subfolder/script.asp would be relayed through to the destination URL, without the domain or the querystring portion of the GET request.
This variable parses the full querystring statement from the origin URL into the destination URL, ad-hoc. For example, if the origin URL was:
Then the ?opt1=var1&opt2=var2 would be relayed to the destination URL, in full
The Parameter wildcard is very similar to the $Q one responsible for seizing the querystring, with one exception. By using $P the wildcard returns only the parameters specified in the querystring, and not the entire querystring. In practice this simply means that the ? is dropped from the URL. For example drawing upon our example URL:
The data parsed through to the destination URL would be opt1=var1&opt2=var2.
In practice this is primarily useful for redirection of dynamic scripts into which you need to parse additional variables between an old and a new version (for example you wanted to hard code additional querystring options into the URL i.e. http://www.hpcfactor.com/script.asp?olduser=true&opt1=var1&opt2=var2).
The configuration process is relatively straight forward (although there are other options which you can find outlined in IIS’s on-line help).
Assuming that we are configuring a virtual server for http://hpcfactor.com/ and wish all connections to be redirected to http://www.hpcfactor.com/ the configuration becomes http://www.hpcfactor.com$s$q or translated http://www.hpcfactor.com<full-path><full-querystring>.
IIS 6.0 Home Directory Wild Card Redirection
Note that the first / after .com is formed as part of the $S statement and should not be manually entered after the URL. If you do, then redirect will resolve to http://www.hpcfactor.com//folder…
What about the Search Engines?
Up until this point, all the solutions I have discussed have relied upon transmitting 404 errors to any http client connecting to the server. This system is independent of the custom error system or client side scripting of previous cogitations. So how will using this system impact a desired search engine clean-up?
The answer to this question is that in reality it depends on the credibility of the search engine spider and whether or not you tick a box on the IIS configuration.
IIS implements its redirection functionality using http state codes 301 and 302. 301 is a “temporary redirect” i.e. the server is instructing the client that the redirection will be undone at some point in the future, while 302 is a “permanent redirect”. In IIS Admin the difference is toggled by selecting or deselecting the “This is a permanent redirection for this resource” check box, and for the devout search engine follower, you will want to ensure that your system is sending a 302 error – the permanent redirect.
In the case of a 404 error, the search engine spider is not given a choice, usually after between 3 and 6 randomly interspersed visits, if the page has been consistent in generating 404 the page will be considered for de-listing – note that getting a root page de-listed is a lot harder and far more time consuming!
With both 301 and 302, there is always content being sent back to the spider, so it is at the discretion and standards fidelity of the crawler in question as to whether it considered consistent redirections to be worthy of de-listing. If the bot is standards based, you can assume that by returning a 301, the redirect will not result in a de-listing – at least not a quick one. If the engine is on the ball, then a 302 will more rapidly de-list content. Of course on the other hand, the spider could ignore the state code completely and re-index both!
Ultimately for HPC:Factor, we elected to use 404’s to clear the bulk of the unwanted content as fast as possible. This is a policy which has worked for us and has not caused us any observable problems over the course of the experiment. Once the listings had cleaned up, particularly with regard to IP based entry points, we switched over to IIS’s redirect to catch stray links.
IIS ISAPI (The Hard Way)
The ISAPI module is yet another stage higher than that of previous methods. Such a solution would be required only by the smallest number of users who have very specific needs which are not catered for through any other method. The ISAPI approach extends upon the dynamic script, but instead of being a web process, the filtering and URL creation are performed by the HTTPd server. The ISAPI is a custom dll program which you can write and install onto the IIS box, which can be used to selectively redirect site content from one domain to another, in exactly the same fashion as the Dynamic 404 script – except that it is able to intercept and maintain the original querystring information resulting in 100% of transposed redirects from all existing URL’s and can be used to perform other more advanced process functionality.
An example of such server side functionality would be in run-time filtering of domain selection through variables such as browser encoding or regional local. You can likewise dynamically adjust the response codes being served back to the client, providing flexibility as broad as the imagination of the programmer.
In order to program an ISAPI, you need to be familiar with low level Windows languages (C++/C#) and be comfortable with the Windows API, something I am not willing to offer up code samples for due to lack of experience. The potential does, however, exist.
The limitation of both the 404 and ISAPI options are however that you require access to the MetaBase/IIS console/IIS Web Manager in order to re-configure the server. If you are on a low cost Windows hosting plan, you are not guaranteed to be afforded this functionality.
Additionally, there are known issues with very old browsers that are not able to interpret a server initiated redirect, and consequence provision must be made for the very unlikely possibility of such a visit; namely in displaying a text hyperlink (as is done in my Dynamic 404 script).
There’s always a bigger fish. Oh yes.
The previous methods have all focused around assuming that the content maps nicely between the different domains. If you are having a clean-up of you DNS access then chances are it may well do. However if you are undertaking this as part of a broader structural or infrastructure project then the literal path remapping may be utterly useless. So, at the top of the pile we have the Intelligent 404.
An intelligent 404 is really only as intellectual as the programmer that builds it, and in the contect of this article must be used in conjunction with other methods as part of an over-all package but suffice it to say, the process is designed to maximise the possibility of a lost user who is following a dead link or dead domain finding what they are looking for.
Parts of Microsoft.com make use of an intelligent 404 system, in fact Microsoft is the only example I can think of that makes use of such a system.
If I were writing an intelligent 404 system (and I have been considering it) the premise of the system would be that it uses a combination of URL archiving, URL analysis, key word generation and content searching to generate a match or list of result matches over where it thinks the user originally intended to go.
If the backend database contains a URL history, this can be very simple – H/PC:QLink as a perma-link engine is capable of maintaining such an archive list. Through a coupler into a dynamic 404 the database can be queried and the user sent on their way.
A more advanced approach would be to break the URL down into key-words and perform a ranking analysis on content that matches. Take for example the CESD URL http://hpcfactor.com/support/cesd/c/0031.asp, the 404 manager would be able to break the URL down into “cesd”, “c” and “0031”. By searching the CMS titles/page body fields, the engine would find the link to the CESD and redirect the user. The URL is so clear in this instance that the user could be automatically sent there without being prompted on the screen.
For more complicated URL’s, or for results with less obvious result sets, the user can be presented with content suggestions, allowing the user to make their own judgement over where it is they want to go.
The advantages of an intelligent 404 system should be clear. The redirection success rate will be far higher, particularly when transitioning from static content to complex querystring based URL’s that are involved in CMS operated sites or indeed operating a very large website such as Microsoft.com.
The downside is the workload involved in the creation of such a system. Familiarity with server side scripting and databases is required over and above the need to understand the internal working of your CMS technology and the distinct possibility that you will need to write ISAPI’s to make the project operate smoothly across both GET and more sophisticated POST requests.
The time and expense involved for an average on-line company far outweighs the benefits of such a system, but none the less intelligent 404’s are a titillating solution to anyone looking to cleanup their domain presence.
What not to do
At this stage you might be thinking “why not just perform a client side check on the domain”, or “force everyone onto the correct URL in-place when they load the page”.
Let me guess:
<% Dim strDomain
strDomain = Request.ServerVariables(“SERVER_NAME”)
if strDomain <> “www.hpcfactor.com” then
Response.Reirect(strDomain & Request.ServerVariables(“PATH_INFO”))
end if %>
You could do that, and you could drop a sizeable portion of your users into a recursive refresh loop. Somehow I don’t really think you want to though!
Attempting to do this through content that all your viewers will see, on every page is firstly a spectacularly phenomenal waste of processing time, and places you in a position where you are liable to regret installing it in the first place.
The check only needs to be made once, and not when the visitor is in the correct place – save it for when they are in the wrong place, at least there any browser incompatibilities will be minimised temporarily to those following bad links or rapidly disappearing search engine listings. If I haven’t sold you on this yet just remember: the search engine spider is not necessarily interested in your redirect. This system will not be beneficial in the pursuit of clean search engine listings as the spider only wants to see the error state code (404/302).
As a corporate entity, if you are spending any money on the Internet, then your brand and your own awareness of your brand should be just as important as the aesthetic look and feel of your web page.
If your IHP has not configured your domain points of entry correctly then potentially you may be losing traffic which could be working for your business. Professionally I have seen this countless times over the years as institutions fail to understand the DNS system, and IHP’s strive to provide as little as possible for as much as they can – particularly in the UK.
Opinions over the significance of the issues raised here in this article vary, you must weigh the impact of your entry points, search engine listings (both pro’s and con’s) and visual branding impact against the needs of your business. As an administrator, if and when the time comes that you are faced with the challenge of reconsolidating your domain presence some thought must be given to the impact of such consolidation on your companies exposure on-line. For any on-line commercial entity research nd judgment must come as a matter of course before that switch is thrown, condemning potentially thousands of users to error messages, failed access and bad cyber-service.
If people cannot click through into your web site, you may as well not have been on-line in the first place!