Limit Wget Recursive Download to Same Domain on Ubuntu
This article provides a concise guide on configuring the wget command-line utility within Ubuntu to recursively download website content while restricting the process to a single domain. You will learn the specific flags required to stop wget from traversing links that point to external servers, ensuring your local mirror contains only the data from your target site.
To prevent wget from following links to other domains, you must use
the --domain flag alongside the recursive option. While
wget often stays on the same host by default, explicitly defining the
allowed domain ensures no external resources are pulled in during the
crawl.
Use the following command structure in your Ubuntu terminal:
wget --recursive --domain=example.com http://example.comIn this command, the --recursive or -r flag
tells wget to follow links found within the HTML pages. The
--domain or -D flag restricts this behavior to
only the specified domain name. Replace example.com with
the actual domain you intend to download.
For a more robust download that includes all necessary assets like images and CSS without leaving the domain, combine additional flags. The following example ensures all page requisites are met while staying within the domain boundaries:
wget --recursive --domain=example.com --page-requisites --no-parent http://example.comThe --page-requisites flag downloads all files needed to
display the page properly, and --no-parent prevents
ascending to the parent directory hierarchy. Using these options
together creates a safe, contained mirror of the website on your Ubuntu
system without fetching data from third-party domains.