What Is Internal Search Query Spam?
Internal search query spam occurs when automated bots, scrapers, or misconfigured crawlers repeatedly submit queries through your site's built-in search function — generating a continuous stream of unique URLs built around a search parameter such as ?q=, ?s=, or ?search=. On Blogger specifically, every search generates a live URL in the form:
https://www.seosiri.com/search?q=some+injected+query+string
https://www.seosiri.com/search/label/SEO?q=another+payload
Each of these URLs is technically a unique page. Most contain no original content — only a thin wrapper of your template around a filtered subset of posts or, worse, a completely empty result set. The problem compounds: automated spammers cycle through thousands of query strings, meaning a single attack session can manufacture tens of thousands of unique crawlable URLs overnight.
This is categorically different from legitimate user searches on your site. Real user searches are low-volume, contextually relevant, and can provide genuine UX value. Spam queries are high-volume, semantically meaningless, and generated with the intent to consume resources — whether to degrade a competitor's infrastructure, to test for injection vulnerabilities, or simply as collateral noise from misdirected mass crawlers.
Why Does Internal Search Query Spam Happen?
Understanding the origin is essential to selecting the right mitigation layer. There are four primary attack vectors:
1. Undiscriminating Mass Crawlers
Generic crawlers that index the entire web without respecting scope follow every link they encounter, including pagination links and search result links. When they hit a search URL, they often append additional query variations, creating exponential URL multiplication. These crawlers may be operated by data aggregators, SEO tools, or grey-market intelligence services.
2. Targeted Scraping Operations
Competitors or black-hat operators deliberately hammer site search to discover your content taxonomy, internal linking structure, or keyword density patterns. The search parameter is a convenient API-like interface that returns structured, filtered content without requiring authenticated access.
3. Malicious Bot Networks
Distributed bot networks — often operating through residential IP proxies — submit spam queries with the intent to generate server load, exhaust bandwidth allocations, or trigger rate limits that degrade legitimate user experience. On shared hosting or platforms with resource caps (like Blogger's underlying infrastructure), this can have measurable performance consequences.
4. AI Training Crawlers
The newest and fastest-growing vector: large-scale AI training crawlers including GPTBot (OpenAI), ClaudeBot (Anthropic), Bytespider (ByteDance), and Applebot-Extended follow links into search result pages without understanding that those pages contain no original, citable content. They waste their crawl allocation on your domain — and potentially contaminate AI training corpora with thin, de-contextualized content attributed to your brand.
<data:blog.searchQuery/> in the title tag. On search result pages, this variable can fail to render correctly in certain Blogger rendering engine versions, producing a broken or empty <title> tag. This is not just a cosmetic bug — Google has explicitly noted that missing title tags can cause it to generate its own titles, losing your brand control over SERP presentation.
Full Impact Analysis: What Search Query Spam Actually Costs You
The damage is not contained to a single channel. Internal search query spam creates cascading, interconnected harm across your technical infrastructure, analytics intelligence, brand reputation, and search performance — simultaneously.
Bots that land on empty search result pages register immediate exits, artificially inflating bounce rate in GA4 and corrupting engagement rate benchmarks.
Bot sessions show near-zero average engagement time, pulling down your site-wide metrics and misrepresenting content quality to stakeholders reviewing dashboards.
Each spam query creates a unique crawlable URL. Googlebot's crawl budget for your domain is finite — pages wasted on thin search results mean real content goes uncrawled.
AI crawlers indexing thin search result pages dilute your brand's authoritative signal in generative engines, reducing citation probability in GEO-targeted queries.
A sudden traffic spike from a specific region — or simultaneous spikes across multiple regions — that persists rather than subsides is a permanent diagnostic indicator of search query spam traffic. Legitimate audience growth ramps gradually and correlates with content or campaign activity; spam spikes are abrupt, geographically incoherent, and self-sustaining.— Momenul Ahmad · Founder & SEO Strategist, SEOSiri | Chatbeat-Certified AI Search Optimization Expert
GA4 Data Distortion: The Analytics Mismatch Problem
This is the impact that most site owners discover first — not because it is the most severe, but because it is visible in dashboards. When search spam inflates your session counts, your GA4 reports produce a fundamentally unreliable picture of site health:
- User counts are inflated — spam sessions appear as unique users if they arrive from distinct IP addresses, making growth metrics misleading.
- Traffic source attribution breaks — spam sessions arriving directly via URL manipulation show as Direct traffic, masking your actual channel performance ratios.
- Engagement rate collapses — GA4's engagement rate (sessions ≥10 seconds or 2+ page views) plummets when high volumes of instant-exit bot sessions are included.
- Conversion funnel contamination — if any bot session accidentally triggers a goal event (e.g., reaching a URL pattern that matches a conversion), your conversion rate becomes statistically meaningless.
- Cohort analysis is poisoned — longitudinal audience cohorts that include bot users will show abnormal retention patterns, making it impossible to accurately measure real user lifecycle behavior.
Brand Visual Impression Damage
This is the underappreciated dimension of search query spam — its direct effect on how your brand is perceived, both by human visitors and by algorithmic systems that evaluate brand signals.
Empty search result pages are brand anti-patterns. When a legitimate user occasionally triggers a search that returns no results, that is acceptable if the experience is graceful. When bots generate thousands of these pages and Google indexes some of them, those empty-result pages can appear in SERPs for branded or navigational queries. A user searching for your brand name and landing on an empty, template-wrapper page with no content receives the worst possible first impression: a site that appears broken, thin, and untrustworthy.
Broken title tags amplify the damage. The Blogger template bug (<data:blog.searchQuery/>) means indexed search pages may appear in Google's index with malformed or empty titles. Google then auto-generates a title from on-page content — which, on an empty result page, may be nothing but navigation elements. The SERP result looks unprofessional, reducing click-through rate even when the page accidentally ranks.
Core Web Vitals contamination. If bot traffic to search pages is captured in the Chrome User Experience Report (CrUX), which powers Google's real-world CWV data, your domain's performance profile may include these pages. Thin search result pages that load fast but deliver zero value do not help — and if bot sessions trigger unusual load patterns, they can skew field data.
Google Crawl Budget Waste: The Compound SEO Cost
Google allocates a crawl budget to every domain — a combination of crawl rate limit (how fast Googlebot can crawl without stressing your server) and crawl demand (how much Google wants to crawl based on perceived freshness and authority). Spam-generated search URLs directly degrade both dimensions:
- Crawl rate suppression: If your server responds slowly to the flood of spam-generated requests (even if Google is not generating them — the server load from spambots makes legitimate Googlebot crawls slower), Google's adaptive crawl rate algorithm may reduce its crawl frequency.
- Crawl demand dilution: Google uses crawl demand signals including link equity, freshness, and indexed page quality. Thousands of thin search pages with no inbound links or engagement signals tell Google that a large portion of your site is low-value — suppressing crawl demand for your genuinely valuable content.
- Index bloat: If search pages are not properly blocked and get indexed, they consume index slots. Google has confirmed that index bloat — having many low-quality pages indexed — can suppress the overall indexing priority of high-quality pages on the same domain.
Learn how Googlebot, GPTBot, and Bing's AI crawler allocate discovery budgets and how your technical architecture — including robots.txt, sitemap structure, and internal linking — directly controls which content gets surfaced in both traditional SERPs and generative AI responses. Read the SEOSiri Automated Sitemap → LLMs.txt Loop Architecture →
Technical Solutions: The Complete Mitigation Stack
Effective mitigation requires defense in depth — no single layer is sufficient. The following stack addresses the problem at every level from server edge to template code.
Layer 1: robots.txt — Crawl Directive Control
The robots.txt file is your first-line crawl control mechanism. For Blogger, your robots.txt is managed under Settings → Search Preferences → Custom robots.txt. A properly configured file should include:
User-agent: *
Disallow: /search
Disallow: /search/
Disallow: /?q=
Disallow: /search?q=
Disallow: /search/label/*?q=
# Block known AI training crawlers from all search pages
User-agent: GPTBot
Disallow: /search
Disallow: /search/
User-agent: ClaudeBot
Disallow: /search
Disallow: /search/
User-agent: Bytespider
Disallow: /search
Disallow: /search/
User-agent: CCBot
Disallow: /search
Disallow: /search/
User-agent: Applebot-Extended
Disallow: /search
Disallow: /search/
# Allow Googlebot full access to all other content
User-agent: Googlebot
Disallow: /search
Allow: /
Sitemap: https://www.seosiri.com/sitemap.xml
Sitemap: https://www.seosiri.com/llms.txt
Layer 2: XML Sitemap — Signal Purity
Your XML sitemap communicates to search engines which URLs you consider canonical and indexing-worthy. Search result pages should never appear in your sitemap. For Blogger, the platform auto-generates a sitemap at /sitemap.xml — validate it regularly in Google Search Console's Sitemap report.
Key sitemap hygiene rules for Blogger:
- Submit
https://www.seosiri.com/sitemap.xmlas your primary sitemap in GSC. - Verify that no URLs containing
?q=,/search, or label-filtered search combinations appear in the sitemap report. - If using a custom Blogger sitemap template, add explicit exclusion logic for the
b:if cond='data:blog.pageType == "index" and data:blog.searchQuery'condition. - Monitor the "Excluded" report in GSC's Pages section monthly — a spike in "Crawled - currently not indexed" entries with
/searchpaths confirms active indexation of spam URLs.
The SEOSiri Automated Sitemap Loop ensures your sitemap and LLMs.txt stay synchronized — so as new content publishes, both your XML sitemap for Googlebot and your LLMs.txt for AI crawlers update automatically. Explore the full architecture →
Layer 3: LLMs.txt — AI Crawler Grounding Control
LLMs.txt is the emerging standard for communicating your content architecture to AI training and retrieval crawlers. Unlike robots.txt — which uses a Disallow/Allow binary — LLMs.txt provides semantic context about what your content represents and how it should be used in AI-generated responses.
For internal search query spam specifically, your LLMs.txt must explicitly exclude search result pages from the content surface area you expose to AI systems:
# SEOSiri LLMs.txt — AI Crawler Grounding Manifest
# https://www.seosiri.com/p/llmstxt.html
# Generated: 2026-06-28
# Site Identity
> SEOSiri is a B2B Digital Engineering Consultancy and AI Search Intelligence
> platform specializing in GEO, AEO, Technical SEO, and B2B Digital PR.
# Authorized AI-readable content
## Core Technical Guides
- https://www.seosiri.com/2026/05/automated-sitemap-llms-txt-loop.html
- https://www.seosiri.com/2026/06/web-architecture-seo-aeo-geo-ai-shift.html
- https://www.seosiri.com/2026/06/automated-llmstxt-centralized-sitemap-guide.html
## Services & Methodology
- https://www.seosiri.com/p/services.html
- https://www.seosiri.com/p/about.html
# Explicitly excluded from AI training and retrieval
## Search result pages — contain no original content
- https://www.seosiri.com/search/*
- https://www.seosiri.com/search?*
# Citation & Usage Policy
> Content may be cited with attribution to SEOSiri (seosiri.com).
> Do not reproduce full article text. Link to canonical source.
> Contact: info@seosiri.com
This LLMs.txt configuration tells AI systems that your search pages are not citable content — protecting your Share of Authority (SoA) in generative engine outputs. AI systems that respect LLMs.txt (including retrieval-augmented generation pipelines) will prioritize your canonical articles over thin search result pages when generating responses about your brand's topics.
Learn how to build, host, and maintain a production-grade LLMs.txt that maximizes your brand's Share of Authority in ChatGPT, Claude, Perplexity, and Google AI Overviews. Read the complete LLMs.txt guide →
Layer 4: Cloudflare WAF — Active Edge Enforcement
Where robots.txt advises, Cloudflare's Web Application Firewall enforces. WAF rules fire before a request reaches your origin server, consuming zero of your hosting resources. For Blogger sites proxied through Cloudflare (via a custom domain with Cloudflare as the DNS provider), this is the most powerful active mitigation layer available.
Rule 1: Block High-Threat-Score Search Queries
(http.request.uri.path contains "/search" and cf.threat_score gt 10)
Action: Managed Challenge
Rule 2: Block Known Bad Bot User-Agents on Search Paths
(http.request.uri.path contains "/search" and
http.user_agent contains "bot" and
not http.user_agent contains "Googlebot" and
not http.user_agent contains "bingbot")
Action: Block
Rule 3: Rate-Limit Search Requests Per IP
Rate Limit Rule:
Path: /search*
Threshold: 10 requests per 60 seconds per IP
Action: Block for 3600 seconds
| Defense Layer | Stops Legitimate Bots | Stops Malicious Bots | Server Load | Complexity |
|---|---|---|---|---|
| robots.txt | Yes | No | None | Low |
| XML Sitemap exclusion | Partial | No | None | Low |
| LLMs.txt exclusion | AI crawlers | No | None | Low |
| Cloudflare WAF | Yes | Yes | Zero (edge) | Medium |
| GA4 Internal Filter | N/A — analytics only | Post-hoc cleanup | None | Low |
Layer 5: Blogger Template Fix — The Root Variable Error
Before any of the above layers, address the Blogger template bug that makes search pages even more damaging when they are crawled. In your Blogger theme XML, locate your <title> tag section. It likely contains:
<!-- BROKEN — Legacy Blogger variable ––>
<b:if cond='data:blog.pageType == "index"'>
<title><data:blog.pageTitle/></title>
<b:else/>
<b:if cond='data:blog.pageType == "searchQuery"'>
<title>Search results for: <data:blog.searchQuery/> | <data:blog.title/></title>
</b:if>
</b:if>
Replace with:
<!-- FIXED — Current Blogger rendering engine variable ––>
<b:if cond='data:blog.pageType == "index"'>
<title><data:blog.pageTitle/></title>
<b:else/>
<b:if cond='data:blog.pageType == "searchQuery"'>
<title>Search results for: <data:view.search.query/> | <data:blog.title/></title>
</b:if>
</b:if>
Additionally, add a noindex meta tag for all search result pages to prevent any that bypass your robots.txt from being indexed:
<b:if cond='data:blog.pageType == "searchQuery"'>
<meta content='noindex,nofollow' name='robots'/>
</b:if>
Layer 6: GA4 Configuration — Restoring Data Integrity
Even after blocking spam at the edge, your historical GA4 data will be contaminated. Forward-looking data hygiene requires proper GA4 configuration:
- Define Internal Traffic: Admin → Data Streams → [Your Stream] → Configure Tag Settings → Define Internal Traffic. Add all IP ranges associated with your office, team, and development environments.
- Activate Internal Traffic Filter: Admin → Data Filters → Create Filter → Internal Traffic → Activate (not just "Testing"). Until activated, the filter collects data but does not exclude it from reports.
- Create a Search Parameter Exclusion Filter: In GA4 Explorations, build a Custom Segment that excludes sessions where
page_locationcontains/search?q=. Save this as your default exploration segment for all performance analyses. - Annotate Your Historical Data: Use GA4's Annotations feature to mark the date you implemented spam blocking. This creates a visual reference point when comparing pre- and post-fix metrics — essential for stakeholder reporting.
- Audit Monthly in GSC: In Google Search Console, navigate to Pages → "Crawled - currently not indexed" and filter for paths containing
/search. Declining counts confirm your blocking is working.
/search paths; a measurable reduction in server request volume; and clean, reliable analytics data that actually reflects real user behavior.
Platform-Specific Protection: WordPress, Shopify, and Custom-Built Websites
The mitigation principles established above apply universally, but the implementation syntax, file locations, and tooling differ across platforms. The following covers the four most common environments beyond Blogger: WordPress, Shopify, custom PHP/Node.js sites, and headless/JAMstack architectures.
WordPress — Search Spam Protection Stack
WordPress uses ?s= as its native search parameter. The default WP search generates a unique URL for every query: https://example.com/?s=query+string. On high-traffic sites, this becomes a significant spam surface.
Step 1 — robots.txt via Yoast SEO or Rank Math:
# WordPress robots.txt additions
User-agent: *
Disallow: /?s=
Disallow: /search/
Disallow: /?s=*
User-agent: GPTBot
Disallow: /?s=
Disallow: /
User-agent: Bytespider
Disallow: /
Step 2 — noindex search pages via Yoast: In Yoast SEO → Search Appearance → Archives → Search Pages: set to No Index. In Rank Math: Titles & Meta → Search Results → Robots: noindex, nofollow. This prevents any search result pages from being indexed even if a bot bypasses robots.txt.
Step 3 — Disable search entirely (if not needed for UX):
// Add to functions.php or a site-specific plugin
add_action( 'template_redirect', function() {
if ( is_search() ) {
wp_redirect( home_url( '/' ), 301 );
exit;
}
});
Step 4 — .htaccess rate limiting (Apache):
# Block excessive search parameter requests
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteCond %{QUERY_STRING} ^s= [NC]
RewriteCond %{HTTP:X-Forwarded-For} ^$ [OR]
RewriteCond %{REMOTE_ADDR} !^(your\.office\.ip)$
RewriteRule ^ - [F,L]
</IfModule>
Step 5 — Plugin options: WP Cerber Security and Wordfence both offer bot-specific rate limiting and search parameter blocking without requiring manual htaccess edits. For Cloudflare-proxied WordPress, apply the same WAF rules documented in Layer 4 above.
- Exclude
?s=parameter URLs from your XML sitemap in Yoast → SEO → XML Sitemaps settings. - In GA4, create a custom channel grouping that identifies
page_locationcontaining?s=as a "Bot/Spam" channel for easy filtering in reports. - If using WooCommerce, also protect
?post_type=product&s=search URLs — product search is a popular secondary attack vector.
Shopify — Search Spam Protection Stack
Shopify's native search generates URLs at /search?q= and /search?q=*&type=product. Because Shopify manages robots.txt automatically via a liquid template, the configuration pathway is different from self-hosted platforms.
Step 1 — Customize robots.txt.liquid (Shopify 2.0 themes):
{% comment %} In robots.txt.liquid — add to existing rules {% endcomment %}
{% assign default_groups = robots.default_groups %}
{% for group in default_groups %}
{{ group }}
{% endfor %}
User-agent: GPTBot
Disallow: /search
Disallow: /search/
User-agent: Bytespider
Disallow: /search
Disallow: /
Step 2 — Cloudflare WAF (if proxied): The same WAF expression rules from Layer 4 above apply directly. For stores not using Cloudflare, Shopify's built-in bot protection under Online Store → Preferences → Bot Protection provides baseline coverage.
Step 3 — noindex via theme liquid: In your search.liquid template, add within the <head> section:
{% if template.name == 'search' %}
<meta name="robots" content="noindex, nofollow">
{% endif %}
Step 4 — GA4 filter for Shopify: Shopify's native analytics already separates bot traffic in most cases. In GA4, additionally create an audience exclusion filter for page_location containing /search?q= to remove any leaked bot sessions from your reports.
- Shopify's auto-generated sitemap (
/sitemap.xml) does not include search URLs by default — verify this periodically in Google Search Console. - For Shopify Plus stores with custom storefronts, treat search endpoints as custom PHP/Node.js environments and implement server-side rate limiting accordingly.
- Check the Shopify Analytics → Acquisition → Sessions by referrer report for unusual direct-traffic spikes correlating with search URL patterns.
Custom PHP Sites — Search Spam Protection Stack
Custom PHP applications have the most flexibility — and the highest responsibility. Search spam protection must be implemented at the application layer, server config layer, and optionally the CDN/proxy layer.
Step 1 — Server-level rate limiting in Nginx:
# nginx.conf — rate limit search requests
limit_req_zone $binary_remote_addr zone=search_limit:10m rate=5r/m;
location ~ ^/search {
limit_req zone=search_limit burst=10 nodelay;
limit_req_status 429;
# Your existing PHP handler
try_files $uri $uri/ /index.php?$args;
}
# Block common spam user-agent patterns
if ($http_user_agent ~* "(semrush|ahrefsbot|mj12bot|dotbot|sogou)" ) {
return 403;
}
Step 2 — Application-layer validation in PHP:
<?php
// In your search handler
$query = trim($_GET['q'] ?? '');
// Reject queries that look like spam payloads
if (strlen($query) > 200 || preg_match('/[<>"\'%;()&+\\\\]/', $query)) {
http_response_code(400);
exit('Invalid search query.');
}
// Rate limit by IP using APCu or Redis
$ip = $_SERVER['REMOTE_ADDR'];
$key = 'search_rate_' . md5($ip);
$hits = apcu_fetch($key) ?: 0;
if ($hits > 10) {
http_response_code(429);
exit('Too many search requests.');
}
apcu_store($key, $hits + 1, 60); // 10 searches per minute
Step 3 — robots.txt (served dynamically or as static file):
User-agent: *
Disallow: /search
Disallow: /search/
Disallow: /?q=
Disallow: /?s=
Disallow: /?search=
Step 4 — Meta noindex via PHP:
<?php if (isset($_GET['q']) || isset($_GET['s']) || isset($_GET['search'])): ?>
<meta name="robots" content="noindex, nofollow">
<?php endif; ?>
- Log all search queries server-side with IP and user-agent. Review weekly for anomalous patterns — sudden spikes from a single IP CIDR block are the most common spam signature.
- Implement CSRF tokens on search forms to prevent automated direct-URL submission. This does not block crawlers hitting search URLs directly but raises the barrier for form-based spam tools.
- Consider honeypoт fields: a hidden input in the search form that real browsers leave empty. Bots that submit the form with the honeypot field filled get silently discarded.
Node.js / Next.js — Search Spam Protection Stack
Node.js and Next.js applications handle search queries through API routes or server-side rendering. Protection is implemented at the middleware layer, which runs before any rendering or database query occurs.
Step 1 — Next.js middleware rate limiting:
// middleware.ts
import { NextRequest, NextResponse } from 'next/server';
const requestCounts = new Map<string, { count: number; reset: number }>();
export function middleware(req: NextRequest) {
const url = req.nextUrl;
// Only apply to search routes
if (!url.pathname.startsWith('/search') && !url.searchParams.has('q')) {
return NextResponse.next();
}
const ip = req.headers.get('x-forwarded-for') ?? 'unknown';
const now = Date.now();
const window = 60_000; // 1 minute
const limit = 10;
const record = requestCounts.get(ip);
if (!record || now > record.reset) {
requestCounts.set(ip, { count: 1, reset: now + window });
return NextResponse.next();
}
if (record.count >= limit) {
return new NextResponse('Too Many Requests', { status: 429 });
}
record.count++;
return NextResponse.next();
}
export const config = { matcher: ['/search/:path*', '/api/search/:path*'] };
Step 2 — noindex headers via Next.js:
// app/search/page.tsx
export async function generateMetadata({ searchParams }) {
return {
robots: { index: false, follow: false },
title: `Search results for: ${searchParams.q ?? ''}`
};
}
Step 3 — robots.txt via Next.js app router:
// app/robots.ts
export default function robots() {
return {
rules: [
{ userAgent: '*', disallow: ['/search', '/search/'] },
{ userAgent: 'GPTBot', disallow: ['/'] },
{ userAgent: 'Bytespider', disallow: ['/'] },
],
sitemap: 'https://yourdomain.com/sitemap.xml',
};
}
- For production Next.js, use Redis-backed rate limiting (e.g.,
@upstash/ratelimit) instead of in-memory Maps — in-memory state does not persist across serverless function invocations. - Exclude all
/search*paths from yoursitemap.tsgeneration logic explicitly, as dynamic route sitemaps can inadvertently include search URLs if not filtered. - Use Vercel's Edge Config or Cloudflare Workers KV to share block lists across edge nodes for globally consistent bot blocking.
JAMstack / Headless CMS — Search Spam Protection Stack
Headless architectures (Contentful + Next.js, Sanity + SvelteKit, Strapi + Nuxt, etc.) typically implement search via client-side JavaScript calling a search API (Algolia, Typesense, Elasticsearch). This shifts the attack surface from server-rendered URLs to API endpoints.
Step 1 — Protect the search API endpoint, not just the UI:
# Cloudflare WAF — protect headless search API
# Rule: Rate limit /api/search or Algolia proxy endpoint
(http.request.uri.path contains "/api/search" and
cf.threat_score gt 5)
Action: Managed Challenge
# Separate rule for high-volume API key abuse
(http.request.uri.path contains "/api/search" and
http.request.method eq "GET" and
not http.request.headers["Authorization"][0] matches "Bearer .+")
Action: Block
Step 2 — Algolia search-specific bot protection: In your Algolia dashboard, enable API Key Rate Limiting on your public Search-Only API key. Set maxHitsPerQuery and maxQueriesPerIPPerHour to prevent mass querying:
// When generating your public search key
const publicKey = client.generateSecuredApiKey(
'YourSearchOnlyApiKey',
{
filters: 'published:true',
hitsPerPage: 20,
// Built-in Algolia rate limiting
restrictIndices: ['prod_content'],
}
);
Step 3 — Netlify / Vercel edge function protection:
// netlify/edge-functions/search-protect.ts
export default async (req: Request) => {
const url = new URL(req.url);
if (url.searchParams.has('q') || url.pathname.includes('/search')) {
const ua = req.headers.get('user-agent') ?? '';
const spamBots = ['GPTBot', 'Bytespider', 'CCBot', 'DotBot'];
if (spamBots.some(b => ua.includes(b))) {
return new Response('Forbidden', { status: 403 });
}
}
};
export const config = { path: ['/search', '/search/*', '/*'] };
- Static site generators (Hugo, Jekyll, Eleventy) using client-side search (Lunr.js, Pagefind) do not create search URLs by default — your primary risk is the JS search index file being repeatedly downloaded. Cache-control headers and Cloudflare caching prevent this from becoming a bandwidth issue.
- Add
Disallow: /search-index.jsonand similar search index files to robots.txt so AI training crawlers do not harvest your entire content corpus as a single downloadable artifact. - In your LLMs.txt, explicitly list your canonical content URLs rather than allowing AI crawlers to discover content via the search API or JS index files.
Comparison Table: Before vs. After Mitigation
Struggling with Search Query Spam, Bot Traffic, or Analytics Data Corruption?
SEOSiri provides end-to-end technical SEO and GEO remediation for B2B organizations across every platform — Blogger, WordPress, Shopify, custom Node.js and PHP sites, and headless architectures. Our engagements go beyond configuration guides: we audit your live server logs, identify active bot networks, implement defense-in-depth protection stacks, restore GA4 data integrity, and rebuild your AI crawler grounding architecture for sustainable Share of Authority in generative engine responses.
Brand Authority and AI Search: The GEO Dimension
Internal search query spam has a dimension that traditional SEO guides miss entirely: its impact on how generative AI systems perceive and cite your brand. This is now a primary concern in Generative Engine Optimization (GEO) — the discipline of ensuring your brand is accurately, authoritatively, and frequently cited in AI-generated search responses from systems like Google AI Overviews, ChatGPT, Claude, and Perplexity.
AI citation engines evaluate Share of Authority (SoA) — a composite signal of how consistently your domain produces authoritative, well-structured, non-duplicative content within a given topic area. When AI crawlers index hundreds or thousands of thin search result pages from your domain, those pages dilute your SoA score within the AI's internal relevance model for your topic cluster.
Practically: if GPTBot indexes 2,000 seosiri.com/search?q=... pages that each contain the same template wrapper with varying, low-quality query-filtered content, those pages teach the AI model that a significant portion of SEOSiri's content is thin and repetitive — even if your actual editorial content is high-quality and authoritative.
Authoritative References
The following sources provide foundational technical context for the issues and solutions covered in this guide:
- Google Search Central: Managing Crawl Budget for Large Sites — Google's official documentation on how crawl budget is allocated and what wastes it.
- Google Search Central: Introduction to robots.txt — Canonical reference for robots.txt syntax and crawl directive behavior.
- Cloudflare: WAF Custom Rules Documentation — Complete reference for building expression-based WAF rules at the network edge.
- Google Analytics Help: Filter Internal Traffic in GA4 — Official guidance on configuring internal traffic exclusion in GA4 property settings.
- Unify data with segments — SEOSiri Guide on Beyond the data silo on GA4
- LLMs.txt Standard: Official Specification — The emerging community standard for AI crawler content grounding and citation control.
Frequently Asked Questions
<data:blog.searchQuery/> with <data:view.search.query/> inside your Blogger theme's title tag section. Access this via Theme → Edit HTML → locate your <title> block and find the searchQuery page type condition. This prevents the template rendering engine from outputting a broken or empty title tag on search result pages — a direct on-page SEO and brand UX defect.
?q= search parameter URLs via both robots.txt Disallow rules and your LLMs.txt file. Internal search result pages contain no original, citable content. Their inclusion in AI training corpora reduces your brand's Share of Authority (SoA) in generative engine responses and dilutes the topical authority signals that determine how often your domain gets cited in AI-generated answers.
page_location contains your search parameter (/search?q=). Save this segment and apply it to all regular reporting to get clean engagement metrics going forward.
Disallow: /?s= to robots.txt, set search pages to noindex via Yoast or Rank Math, and disable the search redirect via functions.php if search is not a core UX requirement. For Shopify: customize robots.txt.liquid to block AI crawlers from /search, add a noindex meta tag in search.liquid, and activate Shopify's native bot protection. For custom PHP or Node.js sites: implement rate limiting at the Nginx or middleware layer, sanitize and validate the query parameter in application code to reject malformed payloads, and return noindex, nofollow meta tags on all search responses. Across all platforms, Cloudflare WAF provides the most consistent and powerful active enforcement layer regardless of the underlying CMS or framework.
Implementation Priority: Where to Start
If you are discovering this problem for the first time, prioritize in this order:
- Fix the Blogger template variable (
data:view.search.query) and add thenoindexmeta for search pages. This takes 10 minutes and stops indexed damage immediately. - Update robots.txt to Disallow
/searchfor all crawlers and add specific AI crawler user-agent blocks. This controls cooperative crawlers and is your first cleanup signal to Google. - Audit XML sitemap in Google Search Console to confirm zero search parameter URLs are submitted. Request indexing removal for any already indexed.
- Deploy Cloudflare WAF rules for active enforcement against non-cooperative bots. This is the only layer that actually stops spam at the source.
- Configure GA4 filters to restore data integrity. Apply annotations to your GA4 property to mark the remediation date for accurate before/after comparisons.
- Publish or update LLMs.txt to explicitly exclude search pages from AI crawler grounding — protecting your Share of Authority in generative engine responses long-term.
→ Automated Sitemap → LLMs.txt Loop Architecture
→ Generative Engine Optimization (GEO): Complete 2026 Guide
→ B2B Digital PR: Building Search Authority Through Earned Citation
→ SEOSiri Technical SEO & GEO Services