In this tutorial we are going to get an idea of how to parse emails from HTML using Python.
Python is a scripting language easy to get started and is perfect for tasks like parsing emails.
So let’s elaborate an approach of how parsing works:
- Initialize a queue of URLs. The first item will be the initial URL.
- Initialize a set of already visited URL to avoid repetitions.
- Start parsing the current URL from the queue.
- Add the URLÂ to processed URLs.
- Extract the whole HTML, search for an email pattern using a regex.
- If one or multiple emails were found, write to CSV.
- Loop through <a> tags found.
- Check if URL is relative or absolute.
- Check if URL is already in the processed URLs set. If not, add to the processing queue
- Repeat from step 3.
Before launching the script don’t forget to install proper libraries.
Using command line do:
pip install requests
pip install urlparse
pip install csv
pip install beautifulsoup4
Once you have the libraries installed, you’ll be able to check the script.
from bs4 import BeautifulSoup import requests import requests.exceptions from urlparse import urlparse from urlparse import urlsplit from collections import deque import re import csv #initialize CSV writer and filename cw = csv.writer(open("Singa.csv",'a'), delimiter=',') # a queue of urls, start new_urls = deque(['https://foundersgrid.com/50-singapore-startups/']) # a set of urls that we have already crawled processed_urls = set() # a set of crawled emails emails = set() # process urls one by one until we exhaust the queue while len(new_urls): #extract the last one from queue url = new_urls.popleft() #mark as visited by adding to proccessed URLs processed_urls.add(url) # break down the extract the base url to resolve relative links parts = urlsplit(url) base_url = "{0.scheme}://{0.netloc}".format(parts) path = url[:url.rfind('/')+1] if '/' in parts.path else url # get url's content #handle exception if any try: response = requests.get(url) except (requests.exceptions.MissingSchema, requests.exceptions.ConnectionError): # skip pages with errors continue # extract all email addresses and add them into the resulting set new_emails = set(re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", response.text, re.I)) emails.update(new_emails) print new_emails #write to CSV the new mails. #alternatively you can write the emails set to CSV after parsing for em in new_emails: cw.writerow([em,]) # create a beutiful soup object as representation of the html page soup = BeautifulSoup(response.text) # walk through a anchords for anchor in soup.find_all("a"): # extract link url from the anchor link = anchor.attrs["href"] if "href" in anchor.attrs else '' # resolve relative links if link.startswith('/'): link = base_url + link elif not link.startswith('http'): link = path + link # add the new url to the queue if it was not enqueued nor processed yet if not link in new_urls and not link in processed_urls: new_urls.append(link)
As you can see, parsing emails in Python is rather a simple task.
If you have any questions on this tutorial, you can contact us [email protected]
Also, if you need assistance with data collection or any other digital service, please let us know.
Don’t forget to share the tutorial and visit us at https://cyberwhale.tech
PS. In the next tutorial we will discuss how to parse dynamic HTML content using Python.