Search Engines and aMember

Discussion in 'Customization & add-ons' started by bgetting, Jan 30, 2006.

  1. bgetting

    bgetting New Member

    Joined:
    Jan 28, 2006
    Messages:
    11
    I am curious if search engine spiders are able to access content that is protected with aMember.

    It appears that the systems works on session variables (just a guess) and I was hoping that someone could provide me with a way of allowing the search engine spiders to index those pages.

    We have protected content that we want to protect, but also still be indexed. Thank you for any help that anyone can provide.

    Thanks.
  2. alex

    alex Administrator Staff Member

    Joined:
    Jan 24, 2004
    Messages:
    6,020
    It is possible to allow access by user-agent. There are the following lines in
    amember/plugins/protect/php_include/check.inc.php

    if (preg_match('/^googlebot/i', $_SERVER['HTTP_USER_AGENT']))
    return;

    You may uncomment it and google will be allowed to open your pages without login.
  3. bgetting

    bgetting New Member

    Joined:
    Jan 28, 2006
    Messages:
    11
    Excellent!

    Thanks for another speedy reply. Hopefully these are questions that other people might have in the future.
  4. bgetting

    bgetting New Member

    Joined:
    Jan 28, 2006
    Messages:
    11
    Other agents

    I am assuming that I can copy and paste that code and change the user-agent to allow other robots. Would you mind showing examples for Yahoo, MSN, and any other popular ones. That would help me out a lot.

    Thanks.
  5. alex

    alex Administrator Staff Member

    Joined:
    Jan 24, 2004
    Messages:
    6,020
    Of course you can. However, I'm not sure which user_agent values they are using.
  6. bgetting

    bgetting New Member

    Joined:
    Jan 28, 2006
    Messages:
    11
    Spiders are NOT given access...

    I did some testing on this, and it does not appear to work properly. Using Firefox, I would change my user-agent to see what would happen when trying to access protected content.

    By uncommenting the line that is included in the script, which checks for a match for "googlebot", then using "googlebot" as a user-agent, I receive the following error:

    Code:
    Warning: Invalid argument supplied for foreach() in /home/practica/public_html/membership/plugins/protect/new_rewrite/new_rewrite.inc.php on line 12
    
    Warning: Cannot modify header information - headers already sent by (output started at /home/practica/public_html/membership/plugins/protect/new_rewrite/new_rewrite.inc.php:12) in /home/practica/public_html/membership/plugins/protect/new_rewrite/login.php on line 37
    I then put in another PREG_MATCH to check for "Slurp". When I put my user-agent to "Slurp" and try to access protected content I get an error from thr browser that it is stuck in a re-direct loop.

    Is there something that I can do? It seems to work fine for protection, but I absolutely NEED the spiders to be able to access that protected content. From what I can tell, it looks like it is trying to get SESSION information even when the user-agent is identified as a spider. Then again, I don't know the aMember system that well.

    Thank you for you help, I really need to get this working.
  7. bgetting

    bgetting New Member

    Joined:
    Jan 28, 2006
    Messages:
    11
    I got it....

    In case it helps anyone, it wasn't working for me because I am using mod_rewrite to also hide dynamic URLs. The fix was to move the conditional code that checks for spider user-agents below the rewrite rule needed to hide the URLs. Then it worked fine. Using Firefox and changing the user-agent I am able to access protected content as a spider.
  8. steveg

    steveg New Member

    Joined:
    Feb 22, 2006
    Messages:
    11
    Doesn't this present one huge great big security problem? Anyone wanting to see the protected content on your site can simply switch the user agent in their browser and pretend to be a google spider.

    Does anyone else have a problem with this?

    Steve
  9. bgetting

    bgetting New Member

    Joined:
    Jan 28, 2006
    Messages:
    11
    I haven't seen a problem

    You are right, there is a risk. For us, the value of having our protected content spidered by search engines is more important than the small amount of people that might figure out what to set their user-agent to in order to gain access.

    If you really wanted to lock it down, you could modify the rewrite rule to be conditional for the IP address that is associated with each spider. I am not sure how to do this, but it would be possible to then associate the user-agent with the correct IP that each spider uses.

    I have yet to see any abuse based on this yet, but I will post if it becomes an issue.
  10. teachnology

    teachnology New Member

    Joined:
    May 7, 2007
    Messages:
    14
    Sorry for pulling up an old thread.

    Let say you enable this and let google view your member site area. Wouldn't the cache on Google give users access to that content? I'm considering doing this.
  11. davidm1

    davidm1 aMember User & Partner

    Joined:
    May 16, 2006
    Messages:
    4,435
    Are you using a CMS system to handle your content or are you protecting basic html?

    In either case, I think you should only give the spiders "teaser text". That is a paragraph summarizing the content.

    David
  12. learnwake

    learnwake New Member

    Joined:
    Dec 19, 2007
    Messages:
    19
    I have uncommented the code as mentioned above for googlebot. I am mainly targeting google, yahoo, and msn. After doing some research, I beleive the yahoo and msn crawlers names are yahoo! slurp and msnbot... So would I add this code below the googlebot code I just uncommented?

    if (preg_match('/^yahoo! slurp/i', $_SERVER['HTTP_USER_AGENT']))
    return;
    if (preg_match('/^msnbot/i', $_SERVER['HTTP_USER_AGENT']))
    return;
  13. skippybosco

    skippybosco CGI-Central Partner Staff Member

    Joined:
    Aug 22, 2006
    Messages:
    2,526
    Just to be clear, showing different text for search bots than to users is considered a bad thing and could lead to your site being removed from the search engine index.
  14. learnwake

    learnwake New Member

    Joined:
    Dec 19, 2007
    Messages:
    19
    After doing some more diggin around here on this bulletin, I realized that the solution in this thread won't work for me because I am using the "new_rewrite" method of protection. So I've added this code to the .htaccess file in each of my protected directories. If anyone knows if I'm doing this right, I would appreciate the help...

    Under the code that says "RewriteEngine On" I put this snippet...

    #allow access for Google AdSense
    RewriteCond %{http_user_agent} ^Mediapartners-Google
    RewriteRule ^(.*)$ - [L]

    #allow access for Google
    RewriteCond %{http_user_agent} ^googlebot
    RewriteRule ^(.*)$ - [L]

    #allow access for Yahoo
    RewriteCond %{http_user_agent} ^yahoo! slurp
    RewriteRule ^(.*)$ - [L]

    #allow access for MSN
    RewriteCond %{http_user_agent} ^msnbot
    RewriteRule ^(.*)$ - [L]
  15. learnwake

    learnwake New Member

    Joined:
    Dec 19, 2007
    Messages:
    19
    OK that broke my website. I started getting this error...

    Internal Server Error

    The server encountered an internal error or misconfiguration and was unable to complete your request.

Share This Page