Processing html text nodes with scrapy and XPath

By : analina
Source: Stackoverflow.com
Question!

I'm using scrapy to process documents like this one:

...
<div class="contents">
    some text
    <ol>
        <li>
            more text
        </li>
        ...
    </ol>
</div>
...

I want to collect all the text inside the contents area into a string. I also need the '1., 2., 3....' from the <li> elements, so my result should be 'some text 1. more text...'

So, I'm looping over <div class="contents">'s children

for n in response.xpath('//div[@class="contents"]/node()'):
    if n.xpath('self::ol'):
        result += process_list(n)
    else:
        result += n.extract()

If nis an ordered list, I loop over its elements and add a number to li/text() (in process_list()). If nis a text node itself, I just read its value. However, 'some text' doesn't seem to be part of the node set, since the loop doesn't get inside the else part. My result is '1. more text'

Finding text nodes relative to their parent node works:

response.xpath('//div[@class="contents"]//text()')

finds all the text, but this way I can't add the list item numbers.

What am I doing wrong and is there a better way to achieve my task?

By : analina


Answers

Scrapy's Selectors use lxml under the hood, but lxml doesn't work with XPath calls on text nodes.

>>> import scrapy
>>> s = scrapy.Selector(text='''<div class="contents">
...     some text
...     <ol>
...         <li>
...             more text
...         </li>
...         ...
...     </ol>
... </div>''')
>>> s.xpath('.//div[@class="contents"]/node()')
[<Selector xpath='.//div[@class="contents"]/node()' data='\n    some text\n    '>, <Selector xpath='.//div[@class="contents"]/node()' data='<ol>\n        <li>\n            more text\n'>, <Selector xpath='.//div[@class="contents"]/node()' data='\n'>]
>>> for n in s.xpath('.//div[@class="contents"]/node()'):
...     print(n.xpath('self::ol'))
... 
[]
[<Selector xpath='self::ol' data='<ol>\n        <li>\n            more text\n'>]
[]

But you could hack into the underlying lxml object to test it's type for a text node (it's "hidden" in a .root attribute of each scrapy Selector):

>>> for n in s.xpath('.//div[@class="contents"]/node()'):
...     print([type(n.root), n.root])
... 
[<class 'str'>, '\n    some text\n    ']
[<class 'lxml.etree._Element'>, <Element ol at 0x7fa020f2f9c8>]
[<class 'str'>, '\n']

An alternative is to use some HTML-to-text conversion library like html2text

>>> import html2text
>>> html2text.html2text("""<div class="contents">
...     some text
...     <ol>
...         <li>
...             more text
...         </li>
...         ...
...     </ol>
... </div>""")
'some text\n\n  1. more text \n...\n\n'


If n is not an ol element, self::ol yields an empty node set. What is n.xpath(...) supposed to return when the result of the expression is an empty node set?

An empty node set is "falsy" in XPath, but you're not evaluating it as a boolean in XPath, only in Python. Is an empty node set falsy in Python?

If that's the problem, you could fix it by changing the if statement to

if n.xpath('boolean(self::ol)'):

or

if n.xpath('count(self::ol) > 1'):
By : LarsH


The descriptions are mandatory for any content you or any frameworks you link against attempt to access. The errors are generated upon an attempt to access the content if a usage description was not supplied, so if you're getting those errors your app must be requesting them. You should discover why your app or its frameworks require these and add appropriate usage descriptions to your app's info.plist.

Or more ideally, if you don't need access, see if there's a way to not request it (or use frameworks that do unnecessarily).



This video can help you solving your question :)
By: admin