lxml.html confusion

This is the place for queries that don't fit in any of the other categories.

lxml.html confusion

Postby metulburr » Fri Apr 18, 2014 8:16 pm

This is a spawn off someones else question, but specifcally about lxml.html as i am now confused on how to acquire the info needed with this library.

the website
http://m.funtweets.com/random

the code i came up with:
Code: Select all
import lxml.html

html = lxml.html.parse('http://m.funtweets.com/random')
users = html.xpath('//div[@class="tweet"]/a[@class="tweet-user-link"]')

for user in users:
    print(user.text_content())


the html (i think)
Code: Select all
<div class="tweet-text">
            <a class="name" href="http://funtweets.com/u/MikeJeffJordan"><span>@</span>MikeJeffJordan</a><br>
            My local grocery store has a special deal going on at the self scan aisle, buy one get like 30 free.
         </div>


I am not sure how to aquire the text that the user says? Also how would you acuire not just the first initial post, but the first few posts (without re-parsing the website 3 times for the first posts)

I have tried xpath strings:
Code: Select all
//div[@class="tweet-text"]

Code: Select all
//div[@class="tweet"]/div[@class="tweet-text"]

Code: Select all
//div[@class="tweet"]/[@class="tweet-text"]

Code: Select all
//div[@class="tweet"]/a[@class="tweet-user-link"]

Code: Select all
//div[@class="tweet"]

with the last one being the only way to grab the actual content, but i can only grab the username and the content not just hte content
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1511
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: lxml.html confusion

Postby stranac » Fri Apr 18, 2014 8:29 pm

Did you check that the html you're getting is the html you expected?
The site seems to be giving different html to my browser and to lxml (probably user agent based)
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1212
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html confusion

Postby metulburr » Fri Apr 18, 2014 8:37 pm

Did you check that the html you're getting is the html you expected?

it looks to be what i expected
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1511
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: lxml.html confusion

Postby stranac » Fri Apr 18, 2014 8:50 pm

This works for me (html I get by lxml.html.parse doesn't have "tweet-text" divs, so using requests with user agent):
Code: Select all
>>> import requests
>>> import lxml.html
>>> r = requests.get('http://funtweets.com/random', headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; rv:28.0) Gecko/20100101 Firefox/28.0'})
>>> doc = lxml.html.fromstring(r.content)
>>> for tweet in doc.xpath('//div[@class="tweet-text"]'):
...     ''.join(tweet.xpath('./text()')).strip()
...
"Hey Bradley Cooper's eyes: the most beautiful sky imaginable called - it wants it's color back"
'My clients have a 86% survival rate, which makes me an above-average babysitter.'
'I bought my kids electric toothbrushes because it was taking too long to splatter toothpaste all over the bathroom w/the regular toothbrush.'
"let's head over to the barber shop and make hair angels on the floor"
'Whenever someone holds my baby & he makes even a tiny peep, I yell "WHAT DID YOU DO TO MY BABY WHY ARE YOU BURNING MY BABY!?"'
"Based on the novel 'Push Notifications' by iSapphire"
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1212
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html confusion

Postby igotapochahontas » Fri Apr 18, 2014 11:35 pm

That looks like it would work.....except i dont have the "requests" module. I do have a version of lxml that might work but i cant import requests obviously.......... and this is why ive been working on this script for over a week....
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: lxml.html confusion

Postby stranac » Fri Apr 18, 2014 11:54 pm

You shouldn't need requests actually.
I just did some testing again, and it seems lxml gets correct html now.
I have no idea what was going on when I originally tested...

So yeah, if you have lxml, this should work.
If not, xml.etree.ElementTree has a pretty similar API.
I'm pretty sure BeautifulSoup can do it too, but I consider BS API insanely bad.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1212
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html confusion

Postby snippsat » Sat Apr 19, 2014 12:34 am

Yes work fine with lxml,just to show an other way with xpath.
'//div[@class="tweet-text"]' also works fine.
Code: Select all
from lxml.html import parse

html = parse('http://m.funtweets.com/random')
for tweet in range(3, 11):
    user = html.xpath('//html/body/div[{}]/a[1]/b/text()'.format(tweet))
    text = html.xpath('//html/body/div[{}]/text()'.format(tweet))
    print('@{}\n{}\n'.format(''.join(user).strip(), ''.join(text).strip()))

Output:
Code: Select all
@Trick_or_tweet
I need to do just one more beheading & this will be the best New Year's revolution, ever!

@FilthyRichmond
I know they don't recommend ibuprofen during pregnancy but I needed something for the hangovers.

@TequilaTears
I just had an AMAZING salad at McDonalds. The toppings I chose were 4 big macs & 10 chicken mc nuggets with 9 sweet & sour packs as dressing

@sbellelauren
what do you mean you can't deliver pizza to a pillow fort

@Kyle_Lippert
I just saw an Asian chick with big boobs and a booty. I took a pic so if any of you have Mythbuster's email hit me up.

@edjusttweeted
I think we should find time today to send a friend request to Myspace Tom on Facebook; he was there for us when we didn't have any friends.

@gotmyhairdid
Oh vajazzled is definitely going on my bucket list. I'll pity the fool that has to jazzle my vag.

@robdelaney
Just saw a great panel at Comic-Con, “How to Talk to a Human Woman.”
User avatar
snippsat
 
Posts: 272
Joined: Thu Feb 21, 2013 12:04 am

Re: lxml.html confusion

Postby igotapochahontas » Sat Apr 19, 2014 1:55 am

Tried it and it said "xml is not defined". So frustrating. Note: I used
import xml.etree.cElementTree as etree to use lxml because lxml documentation said to do that to uselxml. Annoying......
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: lxml.html confusion

Postby snippsat » Sat Apr 19, 2014 11:12 am

Tried it and it said "xml is not defined". So frustrating. Note: I used
import xml.etree.cElementTree as etree to use lxml because lxml documentation said to do that to uselxml. Annoying......

Not checked,but as said in other post i doubt that lxml work for android(sl4a).
There are of course more way to solve this,here a more mad :twisted: and not advisable way with regex(because html and regex are not best friends)
Code: Select all
import re
import urllib2

page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
user = re.compile(r'<span>@</span>(\w+)')
text = re.compile(r"</b></a> (\w.*)")
user_lst =[match.group(1) for match in re.finditer(user, page)]
text_lst =[match.group(1) for match in re.finditer(text, page)]
for _user, _text in zip(user_lst, text_lst):
    print '@{}\n{}\n'.format(_user,_text)

Output:
Code: Select all
@bazecraze
Quentin Tarantino always looks like he walked through a car wash.

@TheBosha
My Memory Lane is now mostly traffic cones.

@kellyoxford
The way you feel when your phone dies is exactly how Cinderella felt at midnight.

@thesulk
I go to the gym so infrequently that I still call it the James.

@shanethevein
I don't know what's more disturbing? My son reading a billboard that says "LIVE NUDE GIRLS" or him asking if there's dead ones.

@rik_ee_rik
I have beiber fever; every time i hear about him i get sick.

@ruthakers
Man outside walmart is asking for donations for the drug and alcohol outreach program You mean there's people who don't have access to them?

@Im_Tricia
I wish making friends didn't involve talking to strangers.
User avatar
snippsat
 
Posts: 272
Joined: Thu Feb 21, 2013 12:04 am

Re: lxml.html confusion

Postby igotapochahontas » Mon Apr 21, 2014 5:01 pm

Hahaha. This happened:
Code: Select all
 ValueError: zero length field name in format
 

Wow. How about saving the soup to a string and then slicing it? I will post that as a seperate
Forum topic later unless that's a bad idea. Note: when I try to save the soup as a
String and then slice the string I get told its an "unhashable type. Lol. I suck at this.
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: lxml.html confusion

Postby stranac » Mon Apr 21, 2014 5:30 pm

igotapochahontas wrote:
Code: Select all
 ValueError: zero length field name in format
 

That means you're running an older version of python.
Try changing that last line to:
Code: Select all
    print '@{0}\n{1}\n'.format(_user,_text)

Also, post entire error tracebacks, not just one line from them.
igotapochahontas wrote:Wow. How about saving the soup to a string and then slicing it? I will post that as a seperate
Forum topic later unless that's a bad idea.

Awful idea.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1212
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html confusion

Postby igotapochahontas » Tue Apr 22, 2014 12:03 am

hey, that worked! Thanks y'all. What should I read up on to understand why that worked?
I'm extremely new to this so I know next to nothing about python.
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm

Re: lxml.html confusion

Postby stranac » Tue Apr 22, 2014 1:24 am

This I guess: http://docs.python.org/2.7/library/new.html
It's just one of those things that was added to newer versions of python.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 1212
Joined: Thu Feb 07, 2013 3:42 pm

Re: lxml.html confusion

Postby 7stud » Tue Apr 22, 2014 2:26 am

I am not sure how to acquire the text that the user says?


Here is the html I'm seeing:

<div class="tweet">
<a href="http://m.funtweets.com/u/teen_news69" class="tweet-user-link"><img src="http://pbs.twimg.com/profile_images/419918079557386241/NdbRlmep_normal.jpeg"><b><span>@</span>teen_news69</b></a> PISSED: teen gets fed up with teacher "can i use the bathroom?" "i don't know, CAN you?" *takes deep breath* *pisses all over teachers desk*


Rule 1: All text in an html document(including whitespace, e.g. a newline after the end of a tag) is contained in a "Text Node", which means it's as if the text is inside an invisible <text> tag.

The text you are after is contained inside the 'tweet' <div>. The <a> tag is the first child of the 'tweet' <div>... Wrong! There is whitespace after the end of the 'tweet' <div>, which is contained inside an invisible <text> tag, and that <text> tag is the first child of the 'tweet' <div>. The second child is the <a> tag, and the third child is the invisible <text> tag that contains the text that the user said. So, you are after the third child of the 'tweet' <div>:

Code: Select all
from bs4 import BeautifulSoup
import urllib.request as ur

html = ur.urlopen("http://m.funtweets.com/random").read()
soup = BeautifulSoup(html)

tweet_divs = soup.find_all('div', attrs={'class': 'tweet'})

for div in tweet_divs:
    children = div.contents
    user_said = children[2]
    print(user_said.string)

--output:--
 Born on February 29th of a leap year, I can't legally drink till I'm 84.
         
 You're about as useful as closed captioning in a porno.
         
 I get my exercise by running and jumping over the light beam before my garage door closes on me.
         
 Hey movie villains - make a bomb where the wires are all one color.
         
 Now that Justin Timberlake is married he might as well take sexy back and exchange it for some sweatpants and a recliner.
         
 E-cigs are fedoras for your mouth
         
 I wish there were a way to find hot singles in my area.
         
 "This movie is so awful & unfunny I refuse to air it EVEN on a Saturday afternoon."-  No one at Comedy Central ever




I'm pretty sure BeautifulSoup can do it too, but I consider BS API insanely bad.

Not I. And BeautifulSoup has good docs. And lxml can be a real bear to install.

...but specifcally about lxml.html

Well, then let's use some lxml.html methods rather than xpath(which is not specific to lxml.html):

Code: Select all
import lxml.html

tree = lxml.html.parse('http://m.funtweets.com/random')
html_tag = tree.getroot()

tweet_divs = html_tag.cssselect('div.tweet')  #target div's have the attribute: class='tweet'

for div in tweet_divs:
    print(div[0].tail)  #lxml's default setup ignores whitespace between tags, so no Text Nodes are created for the whitespace.
                        #A tag can be treated as a list of its children.
                        #The first child of the div tag, div[0], is the <a> tag.
                        #tail gives the text after the end of a tag--up to the start of the next tag.
                       

--output:--
 90% of the economy is just women giving each other useless gifts.
         
 Caterpillars: Neither cats NOR pillars.
         
 I think the world of you! (Polluted, poor, generally prone to disaster.)
         
 If you're in Los Angeles and lost your wallet near the Starbucks on Melrose I found your wallet but not the $58 inside it.

...

Or, you could go straight to the <a> tag and get its tail.
Or, you could use a combination of both(which makes it more likely you'll get the correct <a> tags):
Code: Select all
tweet_links = html_tag.cssselect('div.tweet a.tweet-user-link')  #<a> tags with class='tweet-user-link", which are inside
                                                                 #divs with class="tweet"

for link in tweet_links:
    print(link.tail)
Last edited by 7stud on Tue Apr 22, 2014 7:04 pm, edited 1 time in total.
7stud
 
Posts: 106
Joined: Wed Apr 02, 2014 2:36 am

Re: lxml.html confusion

Postby igotapochahontas » Mon Jul 07, 2014 6:23 am

none of these examples are working on my phone. the way that was working doesnt work anymore either. annoying...
igotapochahontas
 
Posts: 20
Joined: Thu Apr 10, 2014 5:51 pm


Return to General Coding Help

Who is online

Users browsing this forum: W3C [Linkcheck] and 3 guests