Beautifuloup extract string from href

This is the place for queries that don't fit in any of the other categories.

Beautifuloup extract string from href

Postby metulburr » Mon Apr 22, 2013 3:52 pm

I am not sure why i am getting the whole <a> tag, as i have tried index['href'], index.text, and index.string. What is weird is print the value and i get the 1.png string i want, but when i input that into urljoin, to downlaod it, somehow the <a> tag still remains, even though the print it did not.

I have tried:
Code: Select all
u = urllib.parse.urljoin(url,index['href'])

and
Code: Select all
u = urllib.parse.urljoin(url,index.text)

and
Code: Select all
u = urllib.parse.urljoin(url,index.string)


Code: Select all
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import shutil
import os


url = 'http://littlealchemy.com/img/base/'
req = urllib.request.urlopen(url)
html = req.read().decode()

soup = BeautifulSoup(html)
print(html)
lister = soup.findAll('a')

for index in lister:
   try:
      u = urllib.parse.urljoin(url,index['href'])
      f = urllib.request.urlopen(u)
      print('downloading {}'.format(u))
      with open(index,'wb') as lf:
         shutil.copyfileobj(f,lf)
         
   except Exception as e:
      print(e)


but regardless i get the outcome of:
...
invalid file: <a href="95.png">95.png</a>
downloading http://littlealchemy.com/img/base/96.png
invalid file: <a href="96.png">96.png</a>
downloading http://littlealchemy.com/img/base/97.png
invalid file: <a href="97.png">97.png</a>
downloading http://littlealchemy.com/img/base/98.png
invalid file: <a href="98.png">98.png</a>
downloading http://littlealchemy.com/img/base/99.png
invalid file: <a href="99.png">99.png</a>
New Users, Read This
version Python 3.3.2 and 2.7.5, tkinter 8.5, pyqt 4.8.4, pygame 1.9.2 pre
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
User avatar
metulburr
 
Posts: 1126
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Beautifuloup extract string from href

Postby stranac » Mon Apr 22, 2013 4:19 pm

This is the line that's raising the error:
Code: Select all
      with open(index,'wb') as lf:

I don't think you wanted to pass index as filename.

Also, don't catch Exception, catch specific exceptions instead.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 910
Joined: Thu Feb 07, 2013 3:42 pm

Re: Beautifuloup extract string from href

Postby snippsat » Mon Apr 22, 2013 5:04 pm

Code: Select all
lister = soup.findAll('a')

You are using new BeautifulSoup(bs4) here it's called soup.find_all('a') (better name as it follow PEP-8 advice)

To write something that download all images to my hdd.
I use urlretrieve() then i don't need to use shutil.
lister is a terrible variable name ;)
Code: Select all
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup


url = 'http://littlealchemy.com/img/base/'
req = urllib.request.urlopen(url)
soup = BeautifulSoup(req)
links = soup.find_all('a')
for img_link in links:
    if img_link['href'].endswith('.png'):
        img_name = ('{}{}'.format(req.geturl(),img_link['href']))
        urllib.request.urlretrieve(img_name, img_link['href'])
User avatar
snippsat
 
Posts: 91
Joined: Thu Feb 21, 2013 12:04 am

Re: Beautifuloup extract string from href

Postby metulburr » Mon Apr 22, 2013 6:39 pm

oh i am a dumbass, sending the index as the filename.

Yeah it wasnt great code, it was literally to shorten time instead of manuelly downloading each one. I made a quick fix before you both posted and just used a for loop with rnage(400), since the filenames were integers, lol.

Actually now on the subject. urlretreive(), for some reason i thought that wouldnt grab images, that is why i used shutil. For some reason i was thinking that was mainly used to download zipped archives.
New Users, Read This
version Python 3.3.2 and 2.7.5, tkinter 8.5, pyqt 4.8.4, pygame 1.9.2 pre
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
User avatar
metulburr
 
Posts: 1126
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Beautifuloup extract string from href

Postby snippsat » Mon Apr 22, 2013 7:43 pm

Actually now on the subject. urlretreive(), for some reason i thought that wouldnt grab images

urlretreive() will grab any filetype.

The reason to use shutil.copyfileobj() is for downloading large files,because it by default is downloading in chunks of 1024 bytes .
The source code of shutil.copyfileobj()
Code: Select all
def copyfileobj(fsrc, fdst, length=16*1024):
    """copy data from file-like object fsrc to file-like object fdst"""
    while 1:
        buf = fsrc.read(length)
        if not buf:
            break
        fdst.write(buf)

Can write the same for urllib.
But because of size of most images,chunk size is not important.
Code: Select all
req = urllib2.urlopen(url)
CHUNK = 16 * 1024
with open(file, 'wb') as fp:
  while True:
    chunk = req.read(CHUNK)
    if not chunk:
        break
    fp.write(chunk)
User avatar
snippsat
 
Posts: 91
Joined: Thu Feb 21, 2013 12:04 am


Return to General Coding Help

Who is online

Users browsing this forum: No registered users and 2 guests