parsing - Cyber Whale

#import selenium compnents, urllib, beautiful soup from bs4 import BeautifulSoup from selenium import webdriver from urllib import urlopen from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By #url - the url to fetch dynamic content from. #delay - second for web view to wait #block_name - id of the tag to be loaded as criteria for page loaded state. def fetchHtmlForThePage(url, delay, block_name): #supply the local path of web driver. #in this example we use chrome driver browser = webdriver.Chrome('/Applications/chromedriver') #open the browser with the URL #a browser windows will appear for a little while browser.get(url) try: #check for presence of the element you're looking for element_present = EC.presence_of_element_located((By.ID, block_name)) WebDriverWait(browser, delay).until(element_present) #unless found, catch the exception except TimeoutException: print "Loading took too much time!" #grab the rendered HTML html = browser.page_source #close the browser browser.quit() #return html return html #call the fetching function we created html = fetchHtmlForThePage(url, 5, 're-Searchresult') #grab HTML document soup = BeautifulSoup(html) #process it further as you wish..... #..... processFetchedUrls(soup, path)

In this tutorial we are going to get an idea of how to parse emails from HTML using Python.

Python is a scripting language easy to get started and is perfect for tasks like parsing emails.

So let’s elaborate an approach of how parsing works:

Initialize a queue of URLs. The first item will be the initial URL.
Initialize a set of already visited URL to avoid repetitions.
Start parsing the current URL from the queue.
Add the URL to processed URLs.
Extract the whole HTML, search for an email pattern using a regex.
If one or multiple emails were found, write to CSV.
Loop through <a> tags found.
Check if URL is relative or absolute.
Check if URL is already in the processed URLs set. If not, add to the processing queue
Repeat from step 3.

Before launching the script don’t forget to install proper libraries.

Using command line do:

pip install requests

pip install urlparse

pip install csv

pip install beautifulsoup4

Once you have the libraries installed, you’ll be able to check the script.

from bs4 import BeautifulSoup
import requests
import requests.exceptions
from urlparse import urlparse
from urlparse import urlsplit
from collections import deque
import re
import csv

#initialize CSV writer and filename
cw = csv.writer(open("Singa.csv",'a'), delimiter=',')
# a queue of urls, start
new_urls = deque(['https://foundersgrid.com/50-singapore-startups/'])

# a set of urls that we have already crawled
processed_urls = set()

# a set of crawled emails
emails = set()

# process urls one by one until we exhaust the queue
while len(new_urls):

    #extract the last one from queue
	url = new_urls.popleft()
	#mark as visited by adding to proccessed URLs
	processed_urls.add(url)

    # break down the extract the base url to resolve relative links
	parts = urlsplit(url)
	base_url = "{0.scheme}://{0.netloc}".format(parts)
	path = url[:url.rfind('/')+1] if '/' in parts.path else url

    # get url's content
	#handle exception if any
	try:
		response = requests.get(url)
	except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError):
        # skip pages with errors
		continue

    # extract all email addresses and add them into the resulting set
	new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I))
	emails.update(new_emails)
	print new_emails
	#write to CSV the new mails.
	#alternatively you can write the emails set to CSV after parsing
	for em in new_emails:
		cw.writerow([em,])

    # create a beutiful soup object as representation of the html page
	soup = BeautifulSoup(response.text)

    # walk through a anchords
	for anchor in soup.find_all("a"):
        # extract link url from the anchor
		link = anchor.attrs["href"] if "href" in anchor.attrs else ''
        # resolve relative links
		if link.startswith('/'):
			link = base_url + link
		elif not link.startswith('http'):
			link = path + link
        # add the new url to the queue if it was not enqueued nor processed yet
		if not link in new_urls and not link in processed_urls:
			new_urls.append(link)

As you can see, parsing emails in Python is rather a simple task.

If you have any questions on this tutorial, you can contact us hello@cyberwhale.tech

Also, if you need assistance with data collection or any other digital service, please let us know.

Don’t forget to share the tutorial and visit us at https://cyberwhale.tech

PS. In the next tutorial we will discuss how to parse dynamic HTML content using Python.

Tag: parsing

How to parse dynamic HTML content using Python

How to parse emails from HTML in Python