XML Sitemaps: The SEO Feature Most Websites Get Wrong

📅 May 16, 2026 ⏱️ 9 min read ✍️ By Lu Shen
XML sitemap structure with URLs and search engine crawling

I audited a client's sitemap last month and found 340 URLs. Their site had 87 pages. The other 253 were dead links, redirect chains, and pages that hadn't existed in two years. They were paying an SEO agency $2,000/month, and nobody had looked at the sitemap in over a year.

This isn't unusual. Most websites have sitemaps, but most of those sitemaps are garbage. They contain errors that don't just waste Googlebot's time — they actively train search engines to distrust your signals. A bad sitemap is worse than no sitemap at all.

What a Sitemap Actually Does

A sitemap is a file that tells search engines which pages on your site you want indexed. That's it. It's not a guarantee of indexing, it's not a ranking factor, and it doesn't replace internal linking. It's a suggestion — like leaving a map on your doorstep for a delivery driver.

Search engines can discover pages by crawling links, and they'll do that regardless of whether you have a sitemap. But sitemaps become valuable when:

Think of the sitemap as crawl efficiency, not crawl necessity. It helps search engines find your content faster and understand your priorities.

The Six Sitemap Mistakes I See Everywhere

1. Including Non-Indexable URLs

This is the most common and most damaging mistake. Your sitemap should only contain URLs that you want indexed and that can actually be indexed. Yet I routinely see sitemaps that include:

Every non-indexable URL in your sitemap is a broken promise to search engines. You're saying "this page exists and matters," and then the crawler finds it doesn't. Do this enough times and Google starts treating your sitemap as unreliable. They'll still crawl it, but with less urgency and less trust.

2. Stale or Inaccurate lastmod Dates

The lastmod tag tells search engines when a page was last modified. It's supposed to help them prioritize which pages to recrawl. But here's what most CMS platforms do: they set lastmod to the current date for every page, every time the sitemap is generated. Even if the page hasn't changed in months.

This is the boy who cried wolf of SEO. If every page says it was modified today, the signal is meaningless. Search engines learn to ignore your lastmod values entirely, which means when you actually do update important content, they don't prioritize recrawling it.

The fix: only update lastmod when the main content of the page has genuinely changed. Updated a typo? Probably not worth updating lastmod. Rewrote half the article? Definitely update it.

3. Not Using Sitemap Indexes for Large Sites

A single sitemap file can contain at most 50,000 URLs and must be under 50MB uncompressed. If your site exceeds these limits, you need a sitemap index file that points to multiple sitemap files.

But even for sites well under 50,000 URLs, splitting your sitemap by section makes sense. Having separate sitemaps for blog posts, product pages, and category pages lets you monitor indexing rates for each section independently in Google Search Console. If your product sitemap has a 30% indexing rate but your blog sitemap has 90%, you know exactly where the problem is.

4. Missing or Wrong Canonical URLs

If a page in your sitemap has a canonical tag pointing to a different URL, you're sending mixed signals. The sitemap says "index this URL," but the canonical tag says "the canonical version is over there." Search engines have to guess which one to trust, and they don't always guess right.

Rule: every URL in your sitemap should be self-canonical. The URL in the sitemap, the canonical tag on the page, and the URL the user sees in their browser should all match.

5. Forgetting to Submit the Sitemap

Having a sitemap at /sitemap.xml is good. Referencing it in robots.txt is better. But actually submitting it through Google Search Console and Bing Webmaster Tools is what ensures it gets processed promptly.

I've seen sites where the sitemap existed for months but was never submitted. Google eventually found it through the robots.txt reference, but it took weeks longer than it should have. For a new site trying to get indexed, those weeks matter.

6. Never Updating the Sitemap

Static sitemaps for dynamic sites are a problem. If you publish new blog posts, add products, or create landing pages, your sitemap needs to reflect those changes. A sitemap that was accurate in January but hasn't been updated since is doing nobody any favors.

The ideal setup is a dynamically generated sitemap that updates automatically when content changes. Most CMS plugins handle this. If you're running a custom setup, you'll need to build it yourself or use a generator.

Best Practices That Actually Matter

Let me cut through the noise and give you the sitemap practices that make a real difference:

Include only canonical, indexable, 200-status URLs. This one rule eliminates 80% of sitemap problems. Audit your sitemap regularly and remove anything that doesn't meet all three criteria.

Keep it under 10,000 URLs per file even though the limit is 50,000. Smaller files process faster, and it's easier to diagnose problems when things go wrong.

Use accurate lastmod dates. If you can't automate accurate lastmod, don't include the tag at all. Omitting lastmod is better than lying about it.

Set reasonable priority values or just leave them out. The priority tag is relative (it compares pages on your site to each other, not to other sites), and most search engines ignore it anyway. If you do use it, make sure your homepage isn't the same priority as every other page.

Use hreflang attributes for multilingual sites. If you serve content in multiple languages, include the hreflang annotations in your sitemap. It's cleaner than adding them to every page's HTML and keeps all your internationalization signals in one place.

Compress your sitemap with gzip. A 10MB sitemap compresses to about 1MB. Faster downloads mean faster processing by search engines. Just reference it as sitemap.xml.gz in your robots.txt.

How to Audit Your Sitemap

Here's my quick audit process that catches most problems:

  1. Crawl your sitemap URLs and check for 4xx/5xx status codes and redirects. Every redirect or error is a wasted crawl budget entry.
  2. Cross-reference with Google Search Console. Check the sitemap report for warnings and errors. Google is surprisingly explicit about what's wrong.
  3. Compare sitemap URL count to actual pages. If your sitemap has 5,000 URLs but your site has 500 pages, something is very wrong.
  4. Check canonical alignment. Pull the canonical tag from each page in your sitemap and verify it matches the sitemap URL.
  5. Validate the XML. Missing closing tags, unescaped ampersands, and encoding errors can make the entire file unparseable.

I do this audit at least quarterly for every site I manage. It takes about 30 minutes and always surfaces something.

The Image and Video Sitemap Opportunity

Most people stop at the basic XML sitemap. But if your site has significant image or video content, extended sitemaps can get you extra visibility in Google Images and video carousels.

Image sitemaps let you specify the image title, caption, geographic location, and license — metadata that helps Google understand and surface your images better. For e-commerce sites, this can directly impact product discovery.

Video sitemaps are even more impactful. They let you specify a thumbnail, duration, and even whether the video is family-friendly. Google uses this data for video rich results, which can dramatically increase your click-through rates from search.

The Takeaway

A sitemap is not a set-it-and-forget-it thing. It's a living document that reflects the current state of your site. When it's accurate, it helps search engines discover and prioritize your content. When it's stale or error-ridden, it trains search engines to ignore you.

The bar isn't high — most sitemaps I audit are in rough shape — so getting yours right puts you ahead of the majority. Clean out the dead URLs, fix the lastmod dates, align your canonicals, and submit it properly. That's 90% of the battle.

If you need to generate a clean sitemap from scratch, I built a sitemap generator that crawls your site, filters out non-indexable pages, and produces valid XML with proper lastmod dates. It also validates your existing sitemap and highlights errors. Way faster than doing it by hand.