Commands.page Logo

Limit Wget Recursive Download to Same Domain on Ubuntu

This article provides a concise guide on configuring the wget command-line utility within Ubuntu to recursively download website content while restricting the process to a single domain. You will learn the specific flags required to stop wget from traversing links that point to external servers, ensuring your local mirror contains only the data from your target site.

To prevent wget from following links to other domains, you must use the --domain flag alongside the recursive option. While wget often stays on the same host by default, explicitly defining the allowed domain ensures no external resources are pulled in during the crawl.

Use the following command structure in your Ubuntu terminal:

wget --recursive --domain=example.com http://example.com

In this command, the --recursive or -r flag tells wget to follow links found within the HTML pages. The --domain or -D flag restricts this behavior to only the specified domain name. Replace example.com with the actual domain you intend to download.

For a more robust download that includes all necessary assets like images and CSS without leaving the domain, combine additional flags. The following example ensures all page requisites are met while staying within the domain boundaries:

wget --recursive --domain=example.com --page-requisites --no-parent http://example.com

The --page-requisites flag downloads all files needed to display the page properly, and --no-parent prevents ascending to the parent directory hierarchy. Using these options together creates a safe, contained mirror of the website on your Ubuntu system without fetching data from third-party domains.