On Fri, Oct 14, 2005 at 10:35:44AM +0100, Graham Smith wrote: > Sorry this request for help is a bit off topic for this group but I am really > stuck and could do with some help. If you can't help but know where I might > be able to get help I would appreciate a pointer in the right direction. > > I run a few sites off one static IP address using virtual hosting. Some sites > (www.crazysquirrel.com, www.ruralescapes.co.uk and www.shallowsea.com) are > Java based and use the Apache Tomcat connector. Others, such as > blog.crazysquirrel.com are php based and hosted straight out of Apache (I'm > running Apache 2.0 on Debian). All the sites appear to work just fine. There > doesn't appear to be any problems with people navigating around them. > > The problem is with search engines such as Yahoo Slurp and Googlebot. A large > number of requests for pages that are in one of the other domains are ending > up at blog.crazysquirrel.com. My best guess is that for some reason Slurp and > Googlebot are making requests but leaving off the Host header. Now this > wouldn't be completely out of spec because they are making HTTP 1.0 requests > and as such don't require a Host header. I would have expected, however, that > every request from them would come with one since virtual hosting is now so > common. It briefly crossed my mind that it was simply a probe to detect > virtual hosting but there are way to many requests going astray (more go > astray than the real sites actually get) therefore I conclude something must > be wrong. If the problem is what you think it is, you might want to try out the compatibility with older browsers using ServerPath directive as described in http://httpd.apache.org/docs/2.0/vhosts/name-based.html#compat > A little more digging seems to indicate that the search bots are able to load > the first page (say http://www.shallowsea.com/index.html) but then start > screwing it up when trying to access the links they find in that page. For > example here is a little snippet of log file from yesterday for > blog.crazysquirrel.com. These is are page requests that should have gone to > shallowsea.com > > 66.249.66.34 - - [12/Oct/2005:14:08:47 +0100] > "GET /events.html?change-category=7&resource-name=event HTTP/1.1" 404 209 "-" > "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" > > 66.249.66.34 - - [12/Oct/2005:14:09:14 +0100] > "GET /links.html?change-category=51&resource-name=link HTTP/1.1" 404 208 "-" > "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" > > in the shallowsea.com log I find this: > > 66.249.66.34 - - [12/Oct/2005:14:07:09 +0100] > "GET /events.html?change-category=60&resource-name=event HTTP/1.1" 200 7073 > "-" "Mozilla/5.0 (compatible; Googlebot/2.1; > +http://www.google.com/bot.html)" > > 66.249.66.34 - - [12/Oct/2005:14:09:50 +0100] > "GET /links.html?change-category=54&resource-name=link HTTP/1.1" 200 7547 "-" > "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" > > note the times of these requests. It is fairly obvious that the Googlebot is > trying to index shallowsea.com but for some reason about half the requests > are going to the wrong domain. > > Has anyone got any idea what might be going on here? I'm perfectly happy to > accept that there is some header that I should be sending back that I am not > but that doesn't feel like it's the problem as some requests seem to get > through fine. > > Many thanks, > > Graham Simo -- :r ~/.signature
Attachment:
signature.asc
Description: Digital signature