Check webpages automatically?

A forum for general discussion of the Python programming language.

Check webpages automatically?

Postby Nicked » Mon Apr 29, 2013 2:23 pm

Hi,

I want to start to learn and use python but I have a task at work I need to complete before I learnt it and I´m not to keen on doing the task manually :)

Symantec is adding new places for their CRL handling so they have more than 100 new places.
My task is to check that I can access each of them from my company so they aren´t blocked in the FW.

I was thinking a python script would be the way to go here;

Test page is like this one and I have more than 100 to check, http://2.22.133.163/test.html

I need to check if I get the output like below which says success or if I get HTTP 404 not found I would guess.

"Success! You have reached the test page for CRL access.
If you can see this message, your firewall is configured appropriately to allow CRL access. "

How can I do this with Python? Maybe someone has found something similar which I could reuse?

/As fresh as they can get on Python :)
Nicked
 
Posts: 2
Joined: Mon Apr 29, 2013 2:01 pm

Re: Check webpages automatically?

Postby metulburr » Mon Apr 29, 2013 4:54 pm

well with something that small, you could do it with the urllib module alone. However if you wanted to check more a long complex html page with multiple tags, then you would want something like BeautifulSoup.

But for this, urlopen() to grab the html since its so small and check for a word in the html string, or numerous words in the html string. Then add a try, except to catch a 404.

if you comment out each url variable, you'll see the output for each, a 404, and the html output.

python3.x
Code: Select all
import urllib.request

url = 'http://2.22.133.163/something_not_here.html'
#url = 'http://2.22.133.163/test.html'

try:
   req = urllib.request.urlopen(url)
   html = req.read().decode()
   print(html)
except urllib.error.HTTPError as e:
   print(e)



python2.7+
Code: Select all
import urllib2

url = 'http://2.22.133.163/something_not_here.html'
#url = 'http://2.22.133.163/test.html'

try:
   req = urllib2.urlopen(url)
   html = req.read().decode()
   print(html)
except urllib2.HTTPError as e:
   print(e)


there is also urlopen().getcode() which returns the status code

EDIT:
Actually a better method would be to check individual codes, so you can do things based on if its a 404 or a 403, 401, whatever, for exmaple.

python3.x
Code: Select all
import urllib.request

valid_urls = ['http://2.22.133.163/test.html', 'http://www.google.com', 'http://www.metulburr.com/cgi-bin/']


url_list = []
for ind in range(3):
   url_list.append('http://2.22.133.163/test{}.html'.format(ind))


url_list += valid_urls

for u in url_list:
   
   try:
      req = urllib.request.urlopen(u)
      html = req.read().decode()
      stat = req.status
   except urllib.error.HTTPError as e:
      if e.code == 404:
         stat = 404
      elif e.code == 403:
         stat = 403
      else:
         stat = 'Some other code'
      
   print('{} returned: {}'.format(u, stat))
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1416
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY

Re: Check webpages automatically?

Postby Nicked » Tue Apr 30, 2013 11:40 am

Thanks a lot!

I will give it a go.
Nicked
 
Posts: 2
Joined: Mon Apr 29, 2013 2:01 pm


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 2 guests