email content extraction

A forum for general discussion of the Python programming language.

email content extraction

Postby ominousbarkingdog » Thu Sep 15, 2016 9:45 am

I've been trying to find a way to filter out and extract specific content from a series of emails, spanning over some years.

The emails all have the following characteristics in common:

Sender
First 5 words of the subject line

The text I want to extract starts always starts with the same string, and USUALLY (I'd guess between 90% and 97% of the time) ends on the second period after the string starts. I'd be willing to forgo the sections that have more periods if I can miss the rest of the useless text in the emails.
The text in between includes special characters. I'm not sure what the encoding is. If necessary, how do I find out?
The very last line of the section always starts with the same string, too.

The emails are HTML, and it's a Gmail account.
The emails are not stored locally.

Is there a way to do this?

Thank you in advance.
Last edited by Yoriz on Thu Sep 15, 2016 9:51 am, edited 1 time in total.
Reason: First post lock.
ominousbarkingdog
 
Posts: 1
Joined: Thu Sep 15, 2016 9:21 am

Re: email content extraction

Postby Ofnuts » Thu Sep 15, 2016 11:52 am

Use a regular expression, something like:
Code: Select all
import re

regex=re.compile('''^(MARKER[^.]+\.([^.]+\.)?)''')

tests=['No marker','MARKER only','MARKER And dot.','MARKER and. two dots.','MARKER and. two dots. And more stuff.']

for s in tests:
    m=regex.match(s)
    if not m:
        print '*****',s
    else:
        print '>>>>>',s,'->',m.group(1)

Yields:
Code: Select all
***** No marker
***** MARKER only
>>>>> MARKER And dot. -> MARKER And dot.
>>>>> MARKER and. two dots. -> MARKER and. two dots.
>>>>> MARKER and. two dots. And more stuff. -> MARKER and. two dots.

The regex in slow-mo:
  • ^: From the beginning
  • (: Open capture group (this will be group(1) in a match)
  • MARKER: The marker string (replace with actual)
  • [^.]+:followed by any non-zero number of characters that are not dots
  • \.: And a dot
  • ([^.]+\.)?: And optionally (the "?"), another set of "non-zero number of characters that are not dots plus a dot"
  • ): Close the initial capture group
This forum has been moved to http://python-forum.io/. See you there.
User avatar
Ofnuts
 
Posts: 2659
Joined: Thu May 14, 2015 9:46 am
Location: Paris, France, EU, Earth, Solar system, Milky Way, Local Cluster, Universe #32987440940987


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 2 guests