如何在 Cloudflare 中跳过 SEO 爬虫

VegaMonika · December 26, 2024, 11:55am

最近在配置网站的安全规则时，我发现有不少需要通过 Cloudflare 的规则来优化爬虫的过滤。为了确保 SEO 相关的爬虫能够正常访问，同时屏蔽其他非必要的流量，我总结了一套适合大多数站点的规则。

规则目标

允许主流搜索引擎爬虫（如 Googlebot、Bingbot）。
支持社交媒体爬虫（如 FacebookExternalHit、TwitterBot）。
跳过常见的 SEO 工具爬虫（如 AhrefsBot、SemrushBot）。

规则配置步骤

1. 登录 Cloudflare

前往 Cloudflare 仪表板，选择你的站点。
打开“安全性” > “WAF”（网页应用防火墙）。

2. 创建规则

点击“创建规则”，选择 防火墙规则，按照以下逻辑添加表达式：

完整表达式：

(cf.client.bot) 
or (http.user_agent contains "duckduckgo") 
or (http.user_agent contains "facebookexternalhit") 
or (http.user_agent contains "Feedfetcher-Google") 
or (http.user_agent contains "LinkedInBot") 
or (http.user_agent contains "Mediapartners-Google") 
or (http.user_agent contains "msnbot") 
or (http.user_agent contains "Slackbot") 
or (http.user_agent contains "TwitterBot") 
or (http.user_agent contains "ia_archive") 
or (http.user_agent contains "yahoo") 
or (http.user_agent contains "Bingbot") 
or (http.user_agent contains "Googlebot") 
or (http.user_agent contains "Baiduspider") 
or (http.user_agent contains "YandexBot") 
or (http.user_agent contains "Slurp") 
or (http.user_agent contains "Applebot") 
or (http.user_agent contains "PinterestBot") 
or (http.user_agent contains "AhrefsBot") 
or (http.user_agent contains "SemrushBot") 
or (http.user_agent contains "MJ12bot") 
or (http.user_agent contains "DotBot") 
or (http.user_agent contains "DuckDuckBot")

跳过的爬虫用途说明

Googlebot：谷歌的网页抓取工具，用于索引网页内容以显示在搜索结果中。
Bingbot：微软必应搜索引擎的抓取工具。
Mediapartners-Google：谷歌广告相关的爬虫，用于分析广告展示情况。
Feedfetcher-Google：抓取 RSS 或 Atom 内容，用于 Google 服务（例如 Google 新闻）。
DuckDuckBot：DuckDuckGo 搜索引擎的抓取工具。
YandexBot：俄罗斯搜索引擎 Yandex 的爬虫。
Baiduspider：百度搜索引擎的爬虫。
Sogou：搜狗搜索引擎的爬虫。
FacebookExternalHit：Facebook 的内容预览工具，用于抓取分享链接中的元数据。
TwitterBot：Twitter 用于生成链接预览的爬虫。
LinkedInBot：LinkedIn 的内容预览爬虫。
PinterestBot：Pinterest 抓取网页内容以生成 Pin。
AhrefsBot：Ahrefs SEO 工具的爬虫，用于分析外链和站点结构。
SemrushBot：Semrush SEO 工具的爬虫，用于 SEO 和流量分析。
MJ12bot：Majestic SEO 工具的爬虫，用于外链数据分析。
DotBot：Moz 的爬虫，用于站点分析。
Slackbot：Slack 用于生成消息中链接预览的爬虫。
ia_archiver：Internet Archive 的爬虫，用于存档网页内容（Wayback Machine）。
AdsBot-Google：谷歌广告爬虫，用于分析广告目标页面的加载速度和质量。

3. 设置规则动作

动作：允许（Allow）。
规则名称：自定义为“SEO 爬虫允许规则”。

4. 保存并测试

保存规则后，可以使用工具或 curl 模拟常见爬虫的 User-Agent 测试规则是否生效。

常见问题

Q: 为什么要使用 (cf.client.bot)？
A: Cloudflare 内建的 (cf.client.bot) 规则已经覆盖了许多主流爬虫，因此优先检查它会减少额外配置工作。

Q: 如何检查规则是否生效？
A: 在 Cloudflare 仪表板中查看“防火墙日志”，确认规则是否匹配。

Q: 如何扩展规则？
A: 如果发现新爬虫未被识别，可以通过日志查看其 User-Agent，并手动添加到规则中。

希望这篇帖子能帮助大家更好地管理爬虫流量，有效提升网站的 SEO 和性能！如果有任何疑问或补充，请在评论中分享

honeymoose · December 26, 2024, 1:40pm

非常感谢非常有价值的分享。