First python exercise

This is the place for queries that don't fit in any of the other categories.

First python exercise

Postby Nevik34 » Fri Nov 01, 2013 10:40 pm

Hello everyone,

Upon learning about Python's uses in screen scraping, I decided I would give it a shot in order to try and accomplish a conceptually simple task. My overall goal is to query a web page which returns search results, and pull 2 very specific pieces of information from each listing - the name and current listing price of an item.

Specifically, an item name and price at which the item is starting at.

Now, I have ran through multiple websites and guides and gotten the general Python syntax down (I believe), and have tried numerous methods of parsing. What I believe I know is that the page in question is called 'broken HTML' (so not a clean XML or HTML exclusively). In addition, I know *exactly* where the information is that I want when I read the breakdown of the site - however, I can't get the code to see it. Listed below is a quick example of what I currently have - I have resorted to using BeautifulSoup because of what I have read it can offer for small data sets.

Code: Select all
import requests
result = requests.get("http://steamcommunity.com/market/search?q=demeter")
c = result.content

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(c)

table = soup.findAll('div', "market_listing_row market_recent_listing_row")
print(table)


This will return you one page of listings of a particular item, and the section in which the information is located that we want. My question specifically is how to isolate the "Starting at" information out of the following:
Code: Select all
<div class="market_listing_right_cell market_listing_num_listings">
<span>
<span class="market_listing_num_listings_qty">107</span>
<br />
            Starting at:<br />
            &#36;0.32         </span>
</div>


This probably seems rather simple to everyone but myself, but I have gotten complete tunnel vision on this task and am beginning to bang my head on the desk.

Any help you can offer would be appreciated.
Regards,
Nevik34
Last edited by Yoriz on Fri Nov 01, 2013 11:14 pm, edited 1 time in total.
Reason: First post lock
Nevik34
 
Posts: 3
Joined: Fri Nov 01, 2013 10:30 pm

Re: First python exercise

Postby Somelauw » Fri Nov 01, 2013 11:28 pm

Do you mean like this?

Code: Select all
>>> b = bs4.BeautifulSoup("""
... <div class="market_listing_right_cell market_listing_num_listings">
... <span>
... <span class="market_listing_num_listings_qty">107</span>
... <br />
...             Starting at:<br />
...             $0.32         </span>
... </div>
... """)
>>> b.text[b.text.index("$"):].strip()
u'$0.32'
Join the #python-forum IRC channel on irc.freenode.net!
Somelauw
 
Posts: 72
Joined: Tue Feb 12, 2013 8:30 pm

Re: First python exercise

Postby Nevik34 » Sat Nov 02, 2013 9:46 pm

Perhaps, but I'm not quite sure what you've done here. The output I am going for is something like

Item name
0.32

This is only the first step. Once I've figured out how to pull the information, I would like to then use regular expressions to evaluate the item and price. If it is below a certain amount, my task will then be how to submit a form using python to validate and purchase the item. Ultimately, I want to create something shy of a bot, which can purchase things for me based on a price.

But....baby steps...this is where I am at now.

It seems you have passed in the block of tags that contain what I'm looking for and stripped text out of that until a certain point - can you explain the syntax a little? That's more or less what I'm trying to accomplish I just don't understand.

Regards,
Nevik34
Nevik34
 
Posts: 3
Joined: Fri Nov 01, 2013 10:30 pm

Re: First python exercise

Postby Somelauw » Sat Nov 02, 2013 10:36 pm

Sure
Code: Select all
>>> # this is the full text
>>> b.text
u'... \n... \n... 107\n... \n...             Starting at:\n...             $0.32         \n... \n... '
>>> It can be printed using print
>>> print(b.text)
...
...
... 107
...
...             Starting at:
...             $0.32         
...
>>> # We are only interested in the part after the dollar sign, so we first search for the dollar
>>> b.text.index("$")
68
>>> # We can use slicing to get only the text after position 68
>>> b.text[68:]
u'$0.32         \n... \n... '
>>> # We remove all enters and spaces in front of or behind the text.
>>> b.text[68:].strip()
u'$0.32         \n... \n...'
Join the #python-forum IRC channel on irc.freenode.net!
Somelauw
 
Posts: 72
Joined: Tue Feb 12, 2013 8:30 pm

Re: First python exercise

Postby Nevik34 » Sun Nov 03, 2013 7:23 pm

Okay - taking what you've shown me, I've updated my code to get something nearly to my specifications. This will return a string containing all of the information we want, and pull sub strings out of it:

Code: Select all
## Attempts to get the elements using BeautifulSoup

import requests
result = requests.get("http://steamcommunity.com/market/search?q=graphite awp")
c = result.content

from BeautifulSoup import BeautifulSoup
from bs4 import UnicodeDammit
#dammit = UnicodeDammit("Sacr\xc3\xa9 bleu!")
#print dammit.unicode_markup
soup = BeautifulSoup(c)
table = soup.findAll('div', "market_listing_row market_recent_listing_row")

for i in range(0,len(table)):
    b = unicode(table[i].text)
    start = b.index(";")
    finish = b.index("Counter")
    string = b[start+1:finish].strip()
    price = ""
    item = ""
    for j in range(0,len(string)):
        if(string[j].isdigit()):
            price = (price + string[j])
        elif (string[j] == "."):
            price = (price + string[j])
        elif string[j].isspace():
            item = (item + string[j])
        elif string[j] == "u'\u2122'":
            continue
        else:
            item = (item + string[j])
    print item
    print price


This results in:
Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit (AMD64)]
Type "help", "copyright", "credits" or "license" for more information.
[evaluate Parsing exercise.py]
AWP | Graphite (Factory New)
15.99
AWP | Graphite (Minimal Wear)
15.50
Traceback (most recent call last):
File "C:\Program Files (x86)\Wing IDE 101 4.1\src\debug\tserver\_sandbox.py", line 54, in <module>
File "C:\Program Files (x86)\Wing IDE 101 4.1\bin\2.7\src\debug\tserver\dbgutils.py", line 1499, in write
File "C:\Python27\Lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u2122' in position 8: character maps to <undefined>


I'm very close, as you can see, I just cannot currently parse the "TM" character that is part of the results on the page. To see what I am talking about, visit this page and scroll to the results near the bottom.

Any help in the encoding/decoding/skipping the text being parsed here would be very helpful.

Regards,
Nevik34
Nevik34
 
Posts: 3
Joined: Fri Nov 01, 2013 10:30 pm


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 4 guests