read whole website (i.e. not just one webpage) urllib2

read whole website (i.e. not just one webpage) urllib2

Postby nico82 » Fri Apr 05, 2013 12:27 pm

Hello all!

I am trying to write a python program for reading a whole website.
With what I found from Google, I just found explanations for only one webpage with urllib2.
Here is my code:

Code: Select all
import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://ru.wikipedia.org/wiki/%D0%97%D0%B0%D0%B2%D0%BE%D0%B4_%D0%B8%D0%BC%D0%B5%D0%BD%D0%B8_%D0%9C%D0%B0%D0%BB%D1%8B%D1%88%D0%B5%D0%B2%D0%B0')
page = infile.read()


Now, If I want to read from the whole wikipedia for example, how should I proceed?

Not only http://en.wikipedia.org but all the webpages which address starts with http://en.wikipedia.org/blablabla....

Thanks a lot all for your attention !
nico82
 
Posts: 2
Joined: Fri Apr 05, 2013 12:13 pm

Re: read whole website (i.e. not just one webpage) urllib2

Postby setrofim » Fri Apr 05, 2013 1:28 pm

Try Scrapy.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: read whole website (i.e. not just one webpage) urllib2

Postby micseydel » Fri Apr 05, 2013 9:36 pm

I use mechanize and lxml for my scraping. setrofim, do you have a scrapy tutorial you recommend? I remember briefly checking it out and being excited about it, and being massively confused and abandoning it.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1443
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: read whole website (i.e. not just one webpage) urllib2

Postby setrofim » Sat Apr 06, 2013 2:14 pm

micseydel wrote:setrofim, do you have a scrapy tutorial you recommend?

Couldn't find anything I'd recommend by Googling, so I wrote one. Let me know if it makes sense, or if I should add/change something.
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: read whole website (i.e. not just one webpage) urllib2

Postby micseydel » Sun Apr 07, 2013 1:12 am

Holy crap! Kudos, and thanks, setrofim! If I wasn't insanely busy I'd look it over right this minute.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1443
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: read whole website (i.e. not just one webpage) urllib2

Postby nico82 » Tue Apr 16, 2013 2:01 pm

Ok, after all, I just decided to read the html source code and detect all the html parts. It was ok for what I wanted to do
nico82
 
Posts: 2
Joined: Fri Apr 05, 2013 12:13 pm


Return to Networking

Who is online

Users browsing this forum: No registered users and 1 guest