Read a web page as plain text

This is the place for queries that don't fit in any of the other categories.

Read a web page as plain text

Postby Sudheshna » Thu Dec 05, 2013 5:41 pm

I tried accessing a web page "http://www.abc.edu/journals/2013/12/paper1/fulltext . But I am able to access is only contents of http://www.abc.edu but not the content of the exact page(http://www.abc.edu/journals/2013/12/paper1/fulltext) .

I tried something like this:
Code: Select all
urlHTML = "http://www.abc.edu/journals/2013/12/paper1/fulltext"
raw = urlopen(urlHTML).read()
notraw = nltk.clean_html(raw)
tokens = nltk.word_tokenize(notraw)
print tokens


and also ::
Code: Select all
>>> page = myopener.open("http://www.abc.edu/journals/2013/12/paper1/fulltext")
>>> text = page.read()
>>> page.close()
>>> soup = BeautifulSoup(text)
>>> text


Both of them reads http://www.abc.edu content only not the exact page.

What I am missing here... Please help me out..
Last edited by micseydel on Thu Dec 05, 2013 6:26 pm, edited 1 time in total.
Reason: Code tags, locked OP.
Sudheshna
 
Posts: 1
Joined: Thu Dec 05, 2013 5:25 pm

Re: Read a web page as plain text

Postby tnknepp » Thu Dec 05, 2013 6:46 pm

Python: 2.7 via Anaconda
Numpy: 1.7
Pandas: 0.11
OS: Windows 7
IDE: Spyder/IPython
User avatar
tnknepp
 
Posts: 119
Joined: Mon Mar 11, 2013 7:41 pm


Return to General Coding Help

Who is online

Users browsing this forum: Google [Bot], Majestic-12 [Bot] and 3 guests