Extract separate a text with Scrapy

A forum for general discussion of the Python programming language.

Extract separate a text with Scrapy

Postby floriano » Fri Aug 02, 2013 6:43 am

This a source code from a website: http://www.example.com and I want to extract with scrapy crawler all THIS IS A TEXT.

<table id="content">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<tr>
<td>
<table>
<tr>
<td colspan="5" style="text-align:left;padding-left:4px;" class="category"> <img src="http://www.example.com/images/menu.gif">
THIS IS A TEXT </td>
</tr>
<tr>
<td class="date" colspan="5">THIS IS A TEXT</td>
</tr>
<tr>
<td style="test-align:left;width:40px;">THIS IS A TEXT</td>
<td style="padding-right:4px; width:180px;text-align:right">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"> <nobr><a id="I1" name="I1" href="javascript:MoreInformation(1,'1141','1563513','TT','home');">
THIS IS A TEXT</a></nobr>
</td>
<td style="padding-left:5px; width:180px;text-align:left">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"></td>
</tr>
<tr>
<td style="test-align:left;width:40px;">THIS IS A TEXT </td>
<td style="padding-right:4px; width:180px;text-align:right">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"> THIS IS A TEXT </td>
<td style="padding-left:5px; width:180px;text-align:left">
THIS IS A TEXT </td>
<td style="width:40px;text-align:center"></td>
</tr>


This is my scrapy_project.py:
I tried to extract everything from td:rows = hxs.select('.//td') , I don't know how to extract separate "This is a text".
I receive this mistake: u'\n\t\t\t\t\t\t\t\t. Someone can help me please?

Code: Select all
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from dirbot.items import Website


class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["example.com"]
    start_urls = [
        "http://www.example.com/",
        "",
    ]

    def parse(self, response):
       
        hxs = HtmlXPathSelector(response)
        rows = hxs.select('//table[@id="content"]//table/tr')
        items = []

        for row in rows:
            item = Website()
            item ["name"] = row.select("td[1]/text()").extract()
            item ["description"] = row.select("td[0]/a/nobr/text()").extract()
            item ["job"] = row.select("td[2]/text()").extract()
            items.append(item)

        return items

Another question: how can eliminate this: u'\n\t\t\t\t\t\t\t\t
floriano
 
Posts: 15
Joined: Thu Jun 06, 2013 9:10 am

Re: Extract separate a text with Scrapy

Postby setrofim » Fri Aug 02, 2013 7:25 am

You can strip() whitespace from a string, e.g.
Code: Select all
            item ["name"] = row.select("td[1]/text()").extract().strip()
setrofim
 
Posts: 288
Joined: Mon Mar 04, 2013 7:52 pm

Re: Extract separate a text with Scrapy

Postby floriano » Fri Aug 02, 2013 7:38 am

setrofim wrote:You can strip() whitespace from a string, e.g.
Code: Select all
            item ["name"] = row.select("td[1]/text()").extract().strip()


Thanks, I try but I receive mistake

This is Traceback
    item ["name"] = row.select("td[1]/text()").extract().strip()
    exceptions.AttributeError: 'list' object has no attribute 'strip'
floriano
 
Posts: 15
Joined: Thu Jun 06, 2013 9:10 am

Re: Extract separate a text with Scrapy

Postby manojg » Fri Aug 02, 2013 5:47 pm

Use as simple as possible code. Here, is with finding the index.
Code: Select all
#!/usr/bin/python

textToFind = 'this is a text'
fileName = open('text.txt')
lines = fileName.readlines()
fileName.close()

for line in lines:
        line = line.strip()
        index = line.find(textToFind)
        if(index >= 0): print line[index:index + len(textToFind)]


Cheers,
Attachments
text.txt
an example text file to search the word
(5.12 KiB) Downloaded 25 times
manojg
 
Posts: 13
Joined: Tue Jul 09, 2013 6:40 pm

Re: Extract separate a text with Scrapy

Postby stranac » Fri Aug 02, 2013 7:06 pm

@manojg: There is such a thing as too simple. There are a huge number of cases where that code won't be good enough.

The best way to do this is probably using xpath and item loaders.
Friendship is magic!

R.I.P. Tracy M. You will be missed.
User avatar
stranac
 
Posts: 909
Joined: Thu Feb 07, 2013 3:42 pm

Re: Extract separate a text with Scrapy

Postby floriano » Sat Aug 03, 2013 8:20 am

I want to extract from a web site, what you say @ manojg is to extract from a page on my server (html or txt). That you suggested? It can extract as you suggested in a website? I am a beginner, learn now.

Maybe you have a solution for me. Thank you.
floriano
 
Posts: 15
Joined: Thu Jun 06, 2013 9:10 am

Re: Extract separate a text with Scrapy

Postby manojg » Sat Aug 03, 2013 3:38 pm

floriano wrote:I want to extract from a web site, what you say @ manojg is to extract from a page on my server (html or txt). That you suggested? It can extract as you suggested in a website? I am a beginner, learn now.

Maybe you have a solution for me. Thank you.


What about if you import the webpage in a text file first.
manojg
 
Posts: 13
Joined: Tue Jul 09, 2013 6:40 pm


Return to General Discussions

Who is online

Users browsing this forum: No registered users and 1 guest