Postby ominousbarkingdog » Thu Sep 15, 2016 9:45 am

I've been trying to find a way to filter out and extract specific content from a series of emails, spanning over some years.

The emails all have the following characteristics in common:

First 5 words of the subject line

The text I want to extract starts always starts with the same string, and USUALLY (I'd guess between 90% and 97% of the time) ends on the second period after the string starts. I'd be willing to forgo the sections that have more periods if I can miss the rest of the useless text in the emails.
The text in between includes special characters. I'm not sure what the encoding is. If necessary, how do I find out?
The very last line of the section always starts with the same string, too.

The emails are HTML, and it's a Gmail account.
The emails are not stored locally.

Is there a way to do this?

Thank you in advance.
Postby Ofnuts » Thu Sep 15, 2016 11:52 am

Use a regular expression, something like:
import re


tests=['No marker','MARKER only','MARKER And dot.','MARKER and. two dots.','MARKER and. two dots. And more stuff.']

for s in tests:
    if not m:
        print '*****',s
        print '>>>>>',s,'->',

***** No marker
***** MARKER only
>>>>> MARKER And dot. -> MARKER And dot.
>>>>> MARKER and. two dots. -> MARKER and. two dots.
>>>>> MARKER and. two dots. And more stuff. -> MARKER and. two dots.

The regex in slow-mo:
  • ^: From the beginning
  • (: Open capture group (this will be group(1) in a match)
  • MARKER: The marker string (replace with actual)
  • [^.]+:followed by any non-zero number of characters that are not dots
  • \.: And a dot
  • ([^.]+\.)?: And optionally (the "?"), another set of "non-zero number of characters that are not dots plus a dot"
  • ): Close the initial capture group
