[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Misdirected requests - no Host header maybe?



Sorry this request for help is a bit off topic for this group but I am really 
stuck and could do with some help. If you can't help but know where I might 
be able to get help I would appreciate a pointer in the right direction.

I run a few sites off one static IP address using virtual hosting. Some sites 
(www.crazysquirrel.com, www.ruralescapes.co.uk and www.shallowsea.com) are 
Java based and use the Apache Tomcat connector. Others, such as 
blog.crazysquirrel.com are php based and hosted straight out of Apache (I'm 
running Apache 2.0 on Debian). All the sites appear to work just fine. There 
doesn't appear to be any problems with people navigating around them. 

The problem is with search engines such as Yahoo Slurp and Googlebot. A large 
number of requests for pages that are in one of the other domains are ending 
up at blog.crazysquirrel.com. My best guess is that for some reason Slurp and 
Googlebot are making requests but leaving off the Host header. Now this 
wouldn't be completely out of spec because they are making HTTP 1.0 requests 
and as such don't require a Host header. I would have expected, however, that 
every request from them would come with one since virtual hosting is now so 
common. It briefly crossed my mind that it was simply a probe to detect 
virtual hosting but there are way to many requests going astray (more go 
astray than the real sites actually get) therefore I conclude something must 
be wrong.

A little more digging seems to indicate that the search bots are able to load 
the first page (say http://www.shallowsea.com/index.html) but then start 
screwing it up when trying to access the links they find in that page. For 
example here is a little snippet of log file from yesterday for 
blog.crazysquirrel.com. These is are page requests that should have gone to 
shallowsea.com

66.249.66.34 - - [12/Oct/2005:14:08:47 +0100] 
"GET /events.html?change-category=7&resource-name=event HTTP/1.1" 404 209 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

66.249.66.34 - - [12/Oct/2005:14:09:14 +0100] 
"GET /links.html?change-category=51&resource-name=link HTTP/1.1" 404 208 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

in the shallowsea.com log I find this:

66.249.66.34 - - [12/Oct/2005:14:07:09 +0100] 
"GET /events.html?change-category=60&resource-name=event HTTP/1.1" 200 7073 
"-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
+http://www.google.com/bot.html)"

66.249.66.34 - - [12/Oct/2005:14:09:50 +0100] 
"GET /links.html?change-category=54&resource-name=link HTTP/1.1" 200 7547 "-" 
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"

note the times of these requests. It is fairly obvious that the Googlebot is 
trying to index shallowsea.com but for some reason about half the requests 
are going to the wrong domain.

Has anyone got any idea what might be going on here? I'm perfectly happy to 
accept that there is some header that I should be sending back that I am not 
but that doesn't feel like it's the problem as some requests seem to get 
through fine.

Many thanks,

Graham



Reply to: