How to crawl a legacy site.

kexej28769@nongnue · Post by **kexej28769@nongnue** » Thu Feb 13, 2025 7:14 am

Crawl the old website so that you have a copy of all URLs, page titles, metadata, headers, redirects, broken links, etc. Regardless of the crawler's preferred request ( see Appendix ), make sure that the crawl is not too restrictive. Before crawling a legacy site, pay close attention to the crawler's settings and consider whether you need to:

Ignore robots.txt (if a critical part is accidentally blocked)
Follow internal "nofollow" links (so that the crawler reaches more pages)
Crawl all subdomains (depending on scope)
Crawl outside the argentina number data folder (depending on scope)
Change user agent to Google Bot (Desktop)
Change the user agent to Google Bot (smartphone).
Pro tip: Keep a copy of the old site's crawl data (on file or in the cloud) for several months after the migration is complete, just in case you need any of the old site data after the new site goes live.

3xx, 4xx, and 5xx pages (e.g. redirects, pages not found, bad requests, etc.)
Soft 404s are pages with no content that return a 200 server response instead of a 404.
Canonicalized pages (other than self-referencing canonical URLs)
Meta robots noindex directive pages
What is a site transfer?
Site migration is a term widely used by SEO professionals to describe any event in which a website makes substantial changes in areas that can significantly impact search engine visibility — typically changes to the site's location, platform, structure, content, design, or UX.

Google's site migrations do not cover them and downplay the fact that they often result in significant traffic and revenue loss, which can last from a few weeks to several months — depending on the extent to which search engine ranking signals are affected, as well as how long it may take for the affected business to plan for a successful recovery.

Site transfer examples.
The following section discusses what both a successful and failed site migration looks like and explains why it is 100% possible to come out of a site migration without any significant losses.

How to identify indexable pages.
Once the crawl is complete, work on identifying the indexed pages of the legacy site. These are HTML pages with the following characteristics:

Return a 200 server response.
Either there is no canonical tag or there is a self-referencing canonical URL.
Meta robots are not noindex.