How to Parse Meta Tags for Links in Ubuntu
This tutorial demonstrates how to download web pages and extract URL data from meta tags using Ubuntu. We will cover using command-line utilities for quick extraction and Python scripts for reliable parsing. By following these steps, you can automate the retrieval of metadata links such as canonical URLs or Open Graph properties directly from your terminal.
Prerequisites
Ensure your Ubuntu system is updated and has the necessary tools
installed. You will need curl for downloading files and
python3 with pip for advanced parsing. Open
your terminal and run the following commands:
sudo apt update
sudo apt install curl python3-pip -y
pip3 install beautifulsoup4 requestsMethod 1: Quick Extraction with Curl and Grep
For simple tasks, you can download the HTML and filter specific meta tags using standard text processing tools. This method is fast but less reliable for complex HTML structures.
- Download the page content to a file named
page.html:bash curl -o page.html https://example.com - Extract meta tags containing URLs (such as canonical links) using
grep:bash grep -oP 'meta property="og:url" content="\K[^"]*' page.html
This command searches for the Open Graph URL property and prints only
the link content. You can adjust the search pattern to match other meta
tags like name="twitter:image".
Method 2: Robust Parsing with Python
For accurate parsing, use a Python script with the Beautiful Soup library. This handles malformed HTML and ensures you extract the correct attributes.
Create a new file named
parse_meta.py:bash nano parse_meta.pyPaste the following code into the file:
import requests from bs4 import BeautifulSoup url = 'https://example.com' response = requests.get(url) soup = BeautifulSoup(response.text, 'html.parser') # Find all meta tags with a content attribute containing a link meta_tags = soup.find_all('meta', attrs={'content': True}) for tag in meta_tags: content = tag.get('content') if content.startswith('http'): print(f"{tag.get('property') or tag.get('name')}: {content}")Run the script:
bash python3 parse_meta.py
This script downloads the page in memory, parses the HTML structure, and prints any meta tag content that begins with an HTTP protocol. This method is ideal for scraping multiple pages or handling dynamic content structures.