Please help me find Beautiful Soup Scraping tutorials?

A forum for general discussion of the Python programming language.

Please help me find Beautiful Soup Scraping tutorials?

Postby gsmtts » Sun Jul 28, 2013 11:00 pm

Hi all,

I am quite a newbie (I think that's the term?) when it comes to Python coding and/or web scraping. When it comes to Python, I pretty much can do one thing, run code in the terminal. I am however very well versed in Statistical packages/languages (SAS/Stata/SQL) -- not sure if this helps me too much. I have been tasked with scraping a website, specifically http://ipr.etsi.org/IPRDetails.aspx?IPRD_ID=5&IPRD_TYPE_ID=2&MODE=2#, for example, and pulling a bunch of information off of the "IPR information statement and licensing declaration" and "IPR information statement annex" tabs.

I would love to learn how to do this using Beautiful Soup, which I hear is a great tool for this kind of task. My end goal is to have code that loops through all the URLs from that website, scrolling where where "IPRD_ID = 5" and going from 1 through N. I would like my final output to look like what I have attached here in an Excel sheet, but I am very comfortable using Stata, SAS, Excel, etc. to reshape the data myself, so if I could get the data in some form from Beautiful Soup I would be able to place it into the format you see in the Excel sheet.

I would imagine this is probably a lot to ask someone to write this entire code out (again I have NO idea about BS right now :? ), but I would love it if someone could get me started by showing me how to extract a few of the fields I have listed here and then maybe I could write the rest of the code by mimicking?

Primarily, I would need to know how to
(1) Extract the headings/fields selected in Excel sheet from "IPR information statement and licensing" making sure they are connected to each "disclosure number"
(2) Extract a dummy (yes/no, 1/0, etc.) from the selectors to see if the organization is the proprietor and/or is prepared to granting a license
(3) Extract only the information regarding the "basis" patent (family members are much less important, as it seems like it would be very difficult to do this), but separately and marked accordingly for each disclosure number (in this case there are 5, but there really could be anywhere from 1 to N disclosure numbers on any individual page)
(4) Loop through websites where the "IPRD_ID" part of the URL can go from 1 through N
(5) Output all of this to CSV or some other format that I can access

I would GREATLY appreciate any help and forever be in anyone's debt! And if anyone needs any econometric help let me know and this could be symbiotic!

Thanks so much for looking!
gsmtts
 
Posts: 1
Joined: Sun Jul 28, 2013 10:32 pm

Re: Please help me find Beautiful Soup Scraping tutorials?

Postby Yoriz » Sun Jul 28, 2013 11:13 pm

New Users, Read This
Join the #python-forum IRC channel on irc.freenode.net!
Spam topic disapproval technician
Windows7, Python 2.7.4., WxPython 2.9.5.0., some Python 3.3
User avatar
Yoriz
 
Posts: 723
Joined: Fri Feb 08, 2013 1:35 am
Location: UK

Re: Please help me find Beautiful Soup Scraping tutorials?

Postby metulburr » Sun Jul 28, 2013 11:23 pm

(4) Loop through websites where the "IPRD_ID" part of the URL can go from 1 through N

i did this with BS at one point to extract data from a free sampler site
https://github.com/metulburr/sampler/bl ... sampler.py

but check Yoriz links, and google how-to's and you'll come up with a ton of results for the info you want
New Users, Read This
OS Ubuntu 14.04, Arch Linux, Gentoo, Windows 7/8
https://github.com/metulburr
steam
User avatar
metulburr
 
Posts: 1300
Joined: Thu Feb 07, 2013 4:47 pm
Location: Elmira, NY


Return to General Discussions

Who is online

Users browsing this forum: W3C [Linkcheck] and 1 guest