ControlFreak
LANGUAGES:VB.NET | C#
ASP.NETVERSIONS: 2.x
Prepare to Be Searched
Get Your Site Noticed by the People Who Matter Most
By Steve C. Orr
If your Web site provides useful content, services, orproducts, there are people out there who want to know about it. But how do youget the word out? You could send out copious amounts of spam to get noticed,but that?s not likely to earn the kind of reputation that most organizationscrave. Other forms of marketing and advertising are likely to bring morepositive results, but just because you don?t have an advertising budget doesn?tmean you?re out of luck. Read on to find out free ways to maximize your Website?s status and get found by the people you?re trying to reach.
Robots and Crawling Spiders
Sounds like an introduction to sci-fi movie, doesn?t it? Actually,robots, crawlers, and spiders are all names for custom software from searchengines like Google, Yahoo, and MSN Search that investigate what?s currentlyout on the Internet. If you have a public Web site, chances are it has alreadybeen visited, scanned, and thoroughly indexed by one of these ominous-soundingpieces of software. As intimidating as they sound, spiders can be your bestfriend if you take the time to understand them ? they hold the key to every Website?s search ranking. If your site sells discount toothpicks, then your siteneeds to appear near the top of the list when users search for discounttoothpicks ? and the spiders hold the power to make that happen.
Functionally speaking, spiders do little more than recordkey pieces of your Web page?s HTML and follow the hyperlinks to see where theylead. Conceptually, it?s not very difficult to design a basic spider yourself. The.NET WebRequest object is all you really need to retrieve the HTML of a page soyou can parse it and extract the hyperlinks to recursively parse other related Webpages. While in the process, you can store important pieces of text in adatabase for querying. Sites like Google and Yahoo have become masters of thistechnique, and by understanding some details about how they do it, you can usetheir global dominance to advance your own agenda.
A primary technique that spiders employ is to examine thewords used most often in your Web pages. Therefore, the text content of your Website is important for determining the ranking of your site in relation tospecific words and phrases. It?s not very feasible (or advisable) to make majorchanges to the content of a Web site just to increase search rankings. Instead,there are other techniques that are likely to give better results. For example,another extremely important item search engines examine is the title of a page.In a basic HTML page, the title would be defined like this:
<html>
<head>
<title>THIS ISTHE TITLE</title>
</head>
<body>
Hello World
</body>
</html>
When the page is viewed by a user, its title shows up inthe title bar of the browser, as shown in Figure 1. As far as search enginesare concerned, it is best to have the title consist of a good sentence or twofilled with highly descriptive words about the page and/or Web site. This willhelp search engines ?understand? the primary focus of the Web page, therebyincreasing the site?s ranking when people search for related topics.
Figure 1: A Web page?s title showsup in the title bar of the user?s browser. It?s a key element that is examinedby most major search engines to determine the subject matter of a Web page.
In ASP.NET 2.0, you?re likely to have a master page, sothe simplest way to specify the title for each page will be more like this:
<%@ Page TITLE="MY PAGE TITLE"Language="VB"
MasterPageFile="~/MyMasterPage.master" %>
<asp:Content ID="Content1"
ContentPlaceHolderID="CPH1" Runat="Server">
Hello World
</asp:Content>
This technique is fine for a small Web site, but forlarger sites you?re in for a major maintenance chore if you ever decide tochange the titles of all the pages in your Web site. Luckily, ASP.NET 2.0 makesit easy to change a page?s title programmatically from the page?s (or master page?s)code-behind file:
Page.Title = "Discount Toothpicks" 'VB 2005
Page.Title = "Discount Toothpicks"; //C# 2.0
Now all that?s needed is a way to programmatically set thepage title from some kind of a data source. Luckily, the SiteMapDataSource isperfect for this kind of thing. For more information about site maps, I suggestyou read AutomateNavigation Chores. Once a site map is set up, it only takes a tidbit ofcode in the master page?s code-behind to set the page title to the associatedtitle specified in the site map:
'VB 2005
If SiteMap.CurrentNode IsNot Nothing Then
Page.Title =SiteMap.CurrentNode.Title
End If
//C# 2.0
if (SiteMap.CurrentNode != null)
Page.Title =SiteMap.CurrentNode.Title;
Descriptions, Keywords, and Meta Tags
Virtually all search engines make use of the page title,so it has a high payoff to ensure each page is thoroughly titled. However,there are other specific HTML elements that some search engines also valuehighly in their rankings. For example, Yahoo and MSN Search use the Descriptionmeta tag when present; Yahoo uses the Keyword meta tag, as well. Here?s asyntactically correct example of these meta tags in action:
<html>
<head>
<title>THIS IS THE PAGE TITLE</title>
<metaname="description" runat="server"
content="Discount Toothpicks" id="description" />
<metaname="keyword" runat="server"
content="toothpicks, discount, teeth, cheap" />
</head>
<body>
Get yer cheap toopickshere
</body>
</html>
Technically, from an HTML perspective, the runat and idattributes are not required ? but by including them it permits you to adjusttheir value via server-side code. For example, you can use a SiteMap for theDescription meta tag in a similar way that the title page was set in theprevious example:
'VB 2005
If SiteMap.CurrentNode IsNot Nothing Then
Me.description.Content =SiteMap.CurrentNode.Description
Me.keywords.Content = _
SiteMap.CurrentNode("keyword").ToString()
End If
//C# 2.0
if (SiteMap.CurrentNode != null)
{
this.description.Content =SiteMap.CurrentNode.Description;
this.keywords.Content =
SiteMap.CurrentNode["keyword"].ToString();
}
While SiteMaps don?t officially support the keyword attribute,you can add it anyway because extraneous attributes are permitted and can beaccessed programmatically using the syntax listed above.
Get a Buzz
Another extremely important factor that search enginesconsider when ranking a site is how many other Web pages on the Internet linkto that site. For a Web site to be considered an authority on a particulartopic, it will need a lot of related Web sites pointing to your site, and theeffect is greatest when those sites rank highly (see Figure 2). Of course, therhetorical question here is how to get other sites to link to yours. There isno single great answer to this ? although it sure helps if you?ve got a lot ofadvertising dollars to spend. Otherwise, you?re stuck with gradually building areputation and getting other sites to link to yours via trading, begging,bartering, and hard work. Sometimes sharing content with other Web sites is agood way to get them to notice you and (more importantly) provide valuablehyperlinks back to your site.
Figure 2: The Google toolbar plug-in(available for Internet Explorer and Firefox) gives a good indication of aparticular Web site?s ranking. This ranking is based primarily on how manyother Web sites link to the site.
Creating a buzz is a great way to launch a public site onthe right foot. Get the word out. Make sure all the sites that should knowabout your pages are aware your site is online. Post in public forumsfrequently, and always include a hyperlink to the site in your signature orelsewhere in the posting. Get friends and coworkers to join in, too. If you?reproud of your site, make a big deal about it and see who notices.
Through some investigation, you might find some linknetworks related to your industry. Basically, when you join such a network youagree to provide links to other related Web sites, and they agree to link toyours as well. Varying degrees of automation are generally involved to ensureparticipation among members. If you go with this approach, be sure to stay withlink networks within your industry; straying into more general ?link farm?networks will often have the opposite effect; that is, watering down the focusof your Web site in the eyes of search engines, potentially making it moredifficult to find.
When you feel your site is ready, most major searchengines provide a way to submit a site for indexing, which effectively queuesthe site visit from a spider. To submit a site to a search engine, visit themain search page and find a ?help? link and click it to find their submittalpage. It?s generally not necessary to submit a site to the search engine becausetheir spiders will eventually find it on their own, although it can sometimesspeed up the process. In fact, Google?s spiders are so effective that Googledoesn?t even provide a way to manually submit a site. Don?t worry if your sitehas already been indexed; spiders will visit again soon to investigate contentrevisions.
What Not To Do
While all the previous tips provide valuable things thatcan be done to improve a site?s search ranking, there are also some things thatsimply should not be done. For example, most spiders are unable to analyzeimages, so you shouldn?t hide critical search phrases inside an image unlessthey are duplicated in the image?s ALT attribute.
It?s also advisable to not attempt to trick search enginesto increase a site?s ranking. People have come up with all kinds of deviousways to hide extra key words in HTML documents in an effort to boost profiles. Somepeople mistakenly think injecting a wide variety of irrelevant words in a Website will help it to be found by a wider audience. My advice is to not get cutelike this. The major search engines have seen it all before. At best, theseextra words will be ignored; at worst, your entire site could end up beingignored.
Generally speaking, the more Web sites that link to yoursite the better. However, there are a couple exceptions. Web sites infamous forundesirable content such as spam, warez, and other illegal activities mightgive your Web site a bad reputation in the eyes of search engines if theyconsistently link to your site. In other words, keep your nose clean soquestionable sites will have little interest in linking to your content.
Complex QueryStrings can also confuse spiders. For example,do these two URLs output the same content?
http://www.SomeSite.com/ShowContent.aspx?ID=1
http://www.SomeSite.com/ShowContent.aspx?ID=2
The answer is, ?it depends.? As a Web developer, you likelyknow that the ID QueryString tacked onto the end of the URL could be mostlyirrelevant, or it could completely change the page that is displayed. Spidersunderstandably tend to get confused by this kind of thing and don?t knowwhether to index them as separate pages. As a result, some spiders completelyignore such pages. Because complex QueryStrings confuse spiders, they should bemostly avoided, especially for pages that are meant to be highly searchable. TheContext.RewritePath method can be quite useful for providing spider-friendlyURLs without having to heavily modify a preexisting architecture that relies onQueryStrings.
Private Parts
Perhaps there are parts of a Web site that should not besearched. Maybe they contain personal information or sensitive copyrightedcontent. The best solution is to use some kind of authentication, such as FormsAuthentication or Windows Authentication. Because spiders don?t have useraccounts, they won?t be able to access (or index) the information containedwithin. However, if a full-blown authentication system is overkill for yourneeds, there are some simple alternatives to keep specific pages away fromprying spider eyes.
One solution is the ROBOTS meta tag. To prevent a page?scontent from being indexed, add the following meta tag to its HTML:
<meta name="ROBOTS" content="NOINDEX" />
To prevent spiders from following hyperlinks containedwithin the page, add this meta tag to the page?s HTML:
<meta name="ROBOTS" content="NOFOLLOW" />
While this solution can be useful for protecting a page ortwo, it can start to become less manageable for larger numbers of pages. Ifentire directory trees need to be protected, then creating a robots.txt file inthe web root may be a better solution because it centralizes the management ofsuch details. To prevent the entire Web site from being indexed, the robots.txtfile should contain the following text:
User-agent: *
Disallow: /
This tells all (*) spiders to ignore pages starting at theroot (/) of the Web site. It?s easy to be more selective about which files toexclude, such as in the following example that denies (only) Google permissionto index content in the web root?s subdirectory named ?secure?, as well as the ?/data/logs?subdirectory:
User-agent: Googlebot
Disallow: /secure
Disallow: /data/logs
It?s also possible to grant different levels of access tospiders from different search engines, and other advanced tricks that arebeyond the scope of this article. For more information, see http://www.robotstxt.org/wc/faq.html.
Although there is currently no ratified standard that isguaranteed to ward off all search engines, most voluntarily comply with thetechniques mentioned here.
Search Is King
Being easily found on the Internet is an importantaccomplishment for any public organization. Being able to find information canbe just as important. For more details on how to retrieve and use searchresults programmatically, see SearchBox.
Obviously, the topic of searching and indexing the Web isfar more complex than anyone could hope to cover in an article or two;otherwise, companies like Google and Yahoo wouldn?t be able to rake in suchenormous amounts of money from their expertise. Armed with the right knowledge,and building on the information you now have, maybe you too can scoot up to thetable and grab yourself a piece of the pie.
Steve C. Orr is anMCSD and a Microsoft MVP in ASP.NET. He?s been developing software solutionsfor leading companies in the Seattlearea for more than a decade. When he?s not busy designing software systems orwriting about them, he can often be found loitering at local user groups andhabitually lurking in the ASP.NET newsgroup. Find out more about him at http://SteveOrr.netor e-mail him at mailto:Steve@Orr.net.