अजगर, lxml और lxml.html.tostring (एल)

का उपयोग करने से बाहरी टैग को हटाने मैं नीचे का उपयोग कर रहा एक वर्ग के HTML सामग्री के सभी प्राप्त करने के लिए एक डेटाबेसअजगर, lxml और lxml.html.tostring (एल)

el = doc.get_element_by_id('productDescription') 
lxml.html.tostring(el)

उत्पाद विवरण के लिए बचाने के लिए

<div id='productDescription'> 

    <THE HTML CODE I WANT> 

</div>

कोड अच्छा काम करता है, मुझे एचटीएमएल कोड के सभी देता है, लेकिन मैं यानी <div id='productDescription'> और बंद टैग </div> बाहरी परत कैसे निकालूँ: एक टैग है कि इस तरह दिखता है है?

स्रोत

2012-02-14 Tampa

आप व्यक्तिगत रूप से स्ट्रिंग के लिए प्रत्येक बच्चे को परिवर्तित कर सकते हैं:

text = el.text 
text += ''.join(map(lxml.html.tostring, el.iterchildren()))

या और भी अधिक hackish तरह से:

el.attrib.clear() 
el.tag = '|||' 
text = lxml.html.tostring(el) 
assert text.startswith('<'+el.tag+'>') and text.endswith('</'+el.tag+'>') 
text = text[len('<'+el.tag+'>'):-len('</'+el.tag+'>')]

स्रोत

2012-02-14 19:24:56 jfs

अगर आपके productDescriptiondiv div शामिल मिश्रित पाठ/तत्वों सामग्री, उदा

<div id='productDescription'> 
    the 
    <b> html code </b> 
    i want 
</div>

आप सामग्री (स्ट्रिंग में) प्राप्त कर सकते हैं xpath('node()') ट्रेवर्सल का उपयोग कर:

s = '' 
for node in el.xpath('node()'): 
    if isinstance(node, basestring): 
     s += node 
    else: 
     s += lxml.html.tostring(node, with_tail=False)

स्रोत

2012-02-15 14:07:32 mykhal

'बेसस्टिंग' क्या है? – nHaskins

यहाँ एक समारोह करता है कि तुम क्या चाहते है।

def strip_outer(xml): 
    """ 
    >>> xml = '''<math xmlns="http://www.w3.org/1998/Math/MathML" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/1998/Math/MathML   http://www.w3.org/Math/XMLSchema/mathml2/mathml2.xsd"> 
    ... <mrow> 
    ...  <msup> 
    ...  <mi>x</mi> 
    ...  <mn>2</mn> 
    ...  </msup> 
    ...  <mo> + </mo> 
    ...  <mi>x</mi> 
    ... </mrow> 
    ... </math>''' 
    >>> so = strip_outer(xml) 
    >>> so.splitlines()[0]=='<mrow>' 
    True 

    """ 
    xml = xml.replace('xmlns=','xmlns:x=')#lxml fails with xmlns= attribute 
    xml = '<root>\n'+xml+'\n</root>'#...and it can't strip the root element 
    rx = lxml.etree.XML(xml) 
    lxml.etree.strip_tags(rx,'math')#strip <math with all attributes 
    uc=lxml.etree.tounicode(rx) 
    uc=u'\n'.join(uc.splitlines()[1:-1])#remove temporary <root> again 
    return uc.strip()

स्रोत

2013-04-20 16:22:12

regexp का उपयोग करें।

def strip_outer_tag(html_fragment): 
    import re 
    outer_tag = re.compile(r'^<[^>]+>(.*?)</[^>]+>$', re.DOTALL) 
    return outer_tag.search(html_fragment).group(1) 

html_fragment = strip_outer_tag(tostring(el, encoding='unicode')) # `encoding` is optionaly

स्रोत

2017-04-02 00:52:57 bl79

अजगर, lxml और lxml.html.tostring (एल)

उत्तर

संबंधित मुद्दे