Commands.page Logo

How to Download and Extract Links from XML on Ubuntu

This tutorial provides a step-by-step method for downloading XML files and extracting hyperlinks using the Ubuntu command line. It covers essential tools like wget for retrieval and xmllint or grep for parsing content directly within the terminal. By following these instructions, you can automate data collection tasks efficiently on your Linux system.

Install Necessary Tools

Before beginning, ensure your system has the required utilities. Open your terminal and update your package list, then install wget for downloading and libxml2-utils for parsing XML structures. Run the following command:

sudo apt update
sudo apt install wget libxml2-utils

Download the XML File

Use wget to fetch the XML file from the target URL. This command saves the file to your current directory. Replace the URL below with the actual link to the XML resource you wish to analyze:

wget https://example.com/data.xml

If the file is named differently or you want to specify a filename, use the -O flag followed by your desired name.

The most reliable way to extract specific nodes from an XML file is using xmllint with XPath queries. Assuming the links are stored within <url> tags or similar elements, you can target them directly. For example, to extract all text content from <loc> tags often found in sitemaps, use:

xmllint --xpath "//loc/text()" data.xml

If the XML structure uses namespaces, you may need to adjust the query or remove namespaces for simpler parsing. This method ensures you only grab valid data defined by the XML structure.

For a quicker, less strict approach, you can use grep to search for patterns resembling URLs. This is useful if the XML formatting is inconsistent or if you need a rapid extraction without installing additional parsers. The following command searches for http or https strings:

grep -o 'http[^"]*' data.xml

This command prints every occurrence of a string starting with “http” up to the next quotation mark, effectively listing the links found within the file attributes or text.