How to parse dynamic HTML content using Python

In the previous tutorial we learning how to parse HTML in Python. In the Python tutorial we are going to learn to to parse dynamic HTML content generated by JavaScript, jQuery, Ajax, Angular or other dynamic pages technology.

What’s the problem with parsing dynamic HTML content in Python and in general?

The problem is that when you request contents of a HTML page, you are presented HTML, CSS and scripts returned from the server. If the page is dynamic, what you get is only a couple of scripts that are meant to be interpreted by your browser that, in its turn, will eventually display HTML content for a user.

That leads us to the idea that we should first render the page and then grab its HTML. Also it should take some time to render the page since sometimes the content is quite “heavy” and it takes some time to load it.

So, along with pure Python we should use some kind of UI component and in particular a Web View or some kind of Web frame.

One of the options is to use Qt for Python and to handle page rendering events and another one (which I honestly prefer more) is to use selenium for python.

So, let’s get down to writing some code but before that let’s elaborate and approach.

Open web view with URL.
Wait untill the page is loaded. Often the criteria here is a loaded div of some class.
Grab the rendered HTML.
Process it further using beautiful soup

You will need Chrome Web Driver to run the web view.

Also you will have to install selenium as well as libs from previous tutorial:

pip install selenium

pip install selenium

So here is the Python code to parse dynamic content:

from bs4 import BeautifulSoup
from selenium import webdriver
from urllib.request import urlopen
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

# url - the url to fetch dynamic content from.
# delay - seconds for web view to wait
# block_name - id of the tag to be loaded as criteria for page loaded state.
def fetchHtmlForThePage(url, delay, block_name):
    # supply the local path of web driver.
    # in this example we use chrome driver
    browser = webdriver.Chrome('/Applications/chromedriver')
    # open the browser with the URL
    # a browser window will appear for a little while
    browser.get(url)
    try:
        # check for presence of the element you're looking for
        element_present = EC.presence_of_element_located((By.ID, block_name))
        WebDriverWait(browser, delay).until(element_present)
    # unless found, catch the exception
    except TimeoutException:
        print("Loading took too much time!")    

    # grab the rendered HTML
    html = browser.page_source
    # close the browser
    browser.quit()
    # return html
    return html

# call the fetching function we created
html = fetchHtmlForThePage(url, 5, 're-Searchresult')
# grab HTML document
soup = BeautifulSoup(html, "html.parser")
# process it further as you wish.....
# .....
processFetchedUrls(soup, path)

from bs4 import BeautifulSoup
from selenium import webdriver
from urllib.request import urlopen
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By

# url - the url to fetch dynamic content from.
# delay - seconds for web view to wait
# block_name - id of the tag to be loaded as criteria for page loaded state.
def fetchHtmlForThePage(url, delay, block_name):
    # supply the local path of web driver.
    # in this example we use chrome driver
    browser = webdriver.Chrome('/Applications/chromedriver')
    # open the browser with the URL
    # a browser window will appear for a little while
    browser.get(url)
    try:
        # check for presence of the element you're looking for
        element_present = EC.presence_of_element_located((By.ID, block_name))
        WebDriverWait(browser, delay).until(element_present)
    # unless found, catch the exception
    except TimeoutException:
        print("Loading took too much time!")    

    # grab the rendered HTML
    html = browser.page_source
    # close the browser
    browser.quit()
    # return html
    return html

# call the fetching function we created
html = fetchHtmlForThePage(url, 5, 're-Searchresult')
# grab HTML document
soup = BeautifulSoup(html, "html.parser")
# process it further as you wish.....
# .....
processFetchedUrls(soup, path)

So here how to parse dynamic HTML content generated with JavaScript with the of Python.

Visit us to get help with your Python challenge of let us know if can help you with your digital needs.

How to parse dynamic HTML content using Python

How to parse dynamic HTML content using Python

Recent Posts

Archive

Tags

AI Strategy & Consulting

Company

Services