Commands.page Logo

How to Parse Meta Tags for Links in Ubuntu

This tutorial demonstrates how to download web pages and extract URL data from meta tags using Ubuntu. We will cover using command-line utilities for quick extraction and Python scripts for reliable parsing. By following these steps, you can automate the retrieval of metadata links such as canonical URLs or Open Graph properties directly from your terminal.

Prerequisites

Ensure your Ubuntu system is updated and has the necessary tools installed. You will need curl for downloading files and python3 with pip for advanced parsing. Open your terminal and run the following commands:

sudo apt update
sudo apt install curl python3-pip -y
pip3 install beautifulsoup4 requests

Method 1: Quick Extraction with Curl and Grep

For simple tasks, you can download the HTML and filter specific meta tags using standard text processing tools. This method is fast but less reliable for complex HTML structures.

  1. Download the page content to a file named page.html: bash curl -o page.html https://example.com
  2. Extract meta tags containing URLs (such as canonical links) using grep: bash grep -oP 'meta property="og:url" content="\K[^"]*' page.html

This command searches for the Open Graph URL property and prints only the link content. You can adjust the search pattern to match other meta tags like name="twitter:image".

Method 2: Robust Parsing with Python

For accurate parsing, use a Python script with the Beautiful Soup library. This handles malformed HTML and ensures you extract the correct attributes.

  1. Create a new file named parse_meta.py: bash nano parse_meta.py

  2. Paste the following code into the file:

    import requests
    from bs4 import BeautifulSoup
    
    url = 'https://example.com'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    
    # Find all meta tags with a content attribute containing a link
    meta_tags = soup.find_all('meta', attrs={'content': True})
    
    for tag in meta_tags:
        content = tag.get('content')
        if content.startswith('http'):
            print(f"{tag.get('property') or tag.get('name')}: {content}")
  3. Run the script: bash python3 parse_meta.py

This script downloads the page in memory, parses the HTML structure, and prints any meta tag content that begins with an HTTP protocol. This method is ideal for scraping multiple pages or handling dynamic content structures.