Filter Wget Downloads by Content-Type on Ubuntu Linux
This article explains how to utilize the wget command on Ubuntu to filter downloads based on file types. You will learn how to apply specific rules to accept or reject certain content during the retrieval process to ensure only desired data is saved to your system.
While wget does not filter by HTTP MIME headers directly, it filters
by file extension which typically corresponds to the content type. To
download only specific file types, use the --accept flag
followed by a comma-separated list of extensions. For example, to
download only PDF and PNG files, run the following command:
wget --accept=pdf,png https://example.com/filesTo exclude specific file types instead, use the --reject
flag. This is useful when you want to download everything except large
archives or HTML pages. The command below downloads all files except
those ending in zip or html:
wget --reject=zip,html https://example.com/filesYou can combine these flags when recursively downloading directories
using the -r option. This ensures that wget traverses links
but only saves files matching your criteria. The following example
recursively downloads a site while accepting only images and rejecting
all other formats:
wget -r --accept=jpg,jpeg,png,gif --reject=* https://example.com/imagesIf you need to match partial strings within file names rather than
strict extensions, you can use --accept-regex or
--reject-regex. This provides more granular control over
which URLs are followed and downloaded. Always test your filters on a
small subset of data before running large recursive downloads to prevent
unintended data retrieval.