[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Misdirected requests - no Host header maybe?



On Fri, Oct 14, 2005 at 10:35:44AM +0100, Graham Smith wrote:
> Sorry this request for help is a bit off topic for this group but I am really 
> stuck and could do with some help. If you can't help but know where I might 
> be able to get help I would appreciate a pointer in the right direction.
> 
> I run a few sites off one static IP address using virtual hosting. Some sites 
> (www.crazysquirrel.com, www.ruralescapes.co.uk and www.shallowsea.com) are 
> Java based and use the Apache Tomcat connector. Others, such as 
> blog.crazysquirrel.com are php based and hosted straight out of Apache (I'm 
> running Apache 2.0 on Debian). All the sites appear to work just fine. There 
> doesn't appear to be any problems with people navigating around them. 
> 
> The problem is with search engines such as Yahoo Slurp and Googlebot. A large 
> number of requests for pages that are in one of the other domains are ending 
> up at blog.crazysquirrel.com. My best guess is that for some reason Slurp and 
> Googlebot are making requests but leaving off the Host header. Now this 
> wouldn't be completely out of spec because they are making HTTP 1.0 requests 
> and as such don't require a Host header. I would have expected, however, that 
> every request from them would come with one since virtual hosting is now so 
> common. It briefly crossed my mind that it was simply a probe to detect 
> virtual hosting but there are way to many requests going astray (more go 
> astray than the real sites actually get) therefore I conclude something must 
> be wrong.

If the problem is what you think it is, you might want to try out the
compatibility with older browsers using ServerPath directive as described
in http://httpd.apache.org/docs/2.0/vhosts/name-based.html#compat

> A little more digging seems to indicate that the search bots are able to load 
> the first page (say http://www.shallowsea.com/index.html) but then start 
> screwing it up when trying to access the links they find in that page. For 
> example here is a little snippet of log file from yesterday for 
> blog.crazysquirrel.com. These is are page requests that should have gone to 
> shallowsea.com
> 
> 66.249.66.34 - - [12/Oct/2005:14:08:47 +0100] 
> "GET /events.html?change-category=7&resource-name=event HTTP/1.1" 404 209 "-" 
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
> 
> 66.249.66.34 - - [12/Oct/2005:14:09:14 +0100] 
> "GET /links.html?change-category=51&resource-name=link HTTP/1.1" 404 208 "-" 
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
> 
> in the shallowsea.com log I find this:
> 
> 66.249.66.34 - - [12/Oct/2005:14:07:09 +0100] 
> "GET /events.html?change-category=60&resource-name=event HTTP/1.1" 200 7073 
> "-" "Mozilla/5.0 (compatible; Googlebot/2.1; 
> +http://www.google.com/bot.html)"
> 
> 66.249.66.34 - - [12/Oct/2005:14:09:50 +0100] 
> "GET /links.html?change-category=54&resource-name=link HTTP/1.1" 200 7547 "-" 
> "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
> 
> note the times of these requests. It is fairly obvious that the Googlebot is 
> trying to index shallowsea.com but for some reason about half the requests 
> are going to the wrong domain.
> 
> Has anyone got any idea what might be going on here? I'm perfectly happy to 
> accept that there is some header that I should be sending back that I am not 
> but that doesn't feel like it's the problem as some requests seem to get 
> through fine.
> 
> Many thanks,
> 
> Graham

Simo
-- 
:r ~/.signature

Attachment: signature.asc
Description: Digital signature


Reply to: