How to Download a Complete Website Using Wget on Ubuntu

This is an article about downloading entire websites from the web using wget, a powerful command-line utility that comes pre-installed in many Linux distributions, including Ubuntu. In this article, you will find information about how to effectively use wget for offline browsing and archiving purposes, ensuring you capture all necessary parts of a website accurately.

In this guide, we’ll cover everything from the basic installation (if required), configuration options available with wget, strategies to download entire websites or specific sections, troubleshooting common issues, and tips on managing downloaded content. By the end of this article, you should be able to confidently use wget for downloading complete web sites for your personal offline review or backup.

What is Wget?

Wget (Whole World Web) is a free software utility that retrieves files from the web using HTTP, HTTPS and FTP protocols. It operates non-interactively and can work in the background, making it perfect for automating tasks like downloading websites. Although wget does not support recursive downloads out of the box without proper configuration, once you learn how to set up options correctly, you’ll find that wget is an incredibly useful tool.

Installing Wget on Ubuntu

If you are using a fresh installation or have somehow uninstalled wget, installing it again is straightforward. Open your terminal and run the following command:

sudo apt-get update && sudo apt-get install wget -y

This command updates your package list first to ensure that all software packages are up-to-date before installing wget. If you encounter any prompts, simply follow them to complete the installation process.

Basic Usage of Wget

Once installed, downloading a file using wget is simple. The basic syntax for downloading from HTTP/HTTPS and FTP URLs looks like this:

wget [options] URL

For instance, to download a single HTML page named “index.html” from the server at http://example.com/, you would use the following command:

wget http://example.com/index.html

If wget is not installed on your system but is available in repositories, Ubuntu will attempt to install it and then proceed with downloading the file.

Downloading Multiple Files

You can download multiple files at once by specifying each URL separately:

wget http://example.com/file1.html http://example.com/file2.html

Or you may use a space-separated list of URLs or provide an external text file containing the URLs, one per line. This is especially useful for downloading many files from a directory listing.

Recursive Download Using Wget

The real power of wget comes to light when you want to download not just a single webpage but every linked resource on that page and all its subpages recursively. This capability allows you to make complete backups or offline copies of websites.

To achieve this, use the -r (recursive) option along with -l (level), specifying how deep wget should dive into the website’s directory structure:

wget -r http://example.com/

This command will download all pages and files linked from the homepage. However, setting a maximum depth ensures you don’t accidentally download entire web servers or get lost in infinite loops of self-referential content.

Additional Options for Recursive Downloads

Follow FTP links: Use -m instead of -r -l inf to mirror an entire site, including FTP directory listings.
Include specific file types: To limit the scope of your download (e.g., downloading only image files), combine wget with a -A option:

wget -r -l 2 -A jpg,png http://example.com/gallery/

This command restricts downloads to JPEG and PNG images up to two levels deep.
Ignore robots.txt: By default, wget respects the Robots Exclusion Standard (robots.txt). To bypass it, use -e robots=off.

Handling SSL Certificates

When downloading HTTPS sites, you might run into certificate issues if the site’s security isn’t properly configured. Use --no-check-certificate to ignore these warnings:

wget --no-check-certificate -r https://example.com/

However, please note that disabling certificate checks can expose your system to man-in-the-middle attacks.

Managing Downloaded Content

After downloading a complete site using wget, you’ll likely want to organize or inspect the files. Here are some tips:

Remove unnecessary files: Use find and other Unix commands to prune directories of unwanted files (like those ending in .html?, which can be generated by websites).
Organize content: You might want to move specific file types into separate folders for easier browsing.

mv *.jpg images/
View the website offline: With a web server like Apache or Nginx installed locally, you can serve these files and view them in your browser. Alternatively, use tools like python -m http.server to quickly set up an HTTP server.

Troubleshooting Common Issues

Infinite Loops

One common pitfall is creating infinite loops by following links that lead back to the main page or other parts of the site being mirrored. Use -np (no-parent) and careful planning of depth levels to avoid this problem.

403 Forbidden Errors

Some sites block automated tools like wget due to their IP address or user-agent string. You can circumvent these blocks by setting a custom user agent:

wget -r --user-agent="Mozilla/5.0" http://example.com/

Conclusion

Read this article to find out about the versatile capabilities of wget for downloading and mirroring websites on Ubuntu. With its robust command-line interface and numerous options, wget is an indispensable tool in any Linux user’s arsenal. Whether you’re looking to create a local backup or simply explore a website without internet access, mastering wget will greatly enhance your ability to manage web content effectively.

Last Modified: 26/05/2019 - 14:42:08