D7net
Home
Console
Upload
information
Create File
Create Folder
About
Tools
:
/
usr
/
lib64
/
python3.6
/
site-packages
/
lxml
/
html
/
__pycache__
/
Filename :
clean.cpython-36.opt-1.pyc
back
Copy
3 ��]b�i � / @ s� d Z ddlZddlZyddlmZ W n ek rD ddlmZ Y nX ddlmZ ddl m Z ddl mZmZ ddl m Z mZ ye W n ek r� eZY nX ye W n ek r� eZY nX ye W n ek r� eZY nX ye W n ek �r eefZY nX dd d ddd dgZejdejejB �jZejdej�jZejdej�jZejdej�j Z!ejdej�j Z"ejdej�j Z#dd� Z$ejd�jZ%ejdejejB �Z&ej'd�Z(ej'ddeid�Z)G dd � d e*�Z+e+� Z,e,j-Z-ejdej�ejdej�gZ.d d!d"d#d$d%gZ/ejd&ej�ejd'ej�ejd(�gZ0d)gZ1e.e/e0e1fd*d�Z2d+d,� Z3d-d� Z4e2j e4_ d!d d"gZ5d.gZ6d/e5e6ed0�fd1d �Z7d2d� Z8d3d4� Z9ejd5ej�Z:d6d7� Z;dS )8zcA cleanup tool for HTML. Removes unwanted tags and content. See the `Cleaner` class for details. � N)�urlsplit)�etree)�defs)� fromstring�XHTML_NAMESPACE)� xhtml_to_html�_transform_result� clean_html�clean�Cleaner�autolink� autolink_html� word_break�word_break_htmlzexpression\s*\(.*?\)z @\s*importz</?[a-zA-Z]+|\son[a-zA-Z]+\s*=z^:(javascript|jscript|livescript|vbscript|data|about|mocha):z (xml|svg)c C s: d}x t | �D ]}d}t|�rdS qW |r.dS tt| ��S )NFT)�_find_image_dataurls�_is_unsafe_image_type�bool�_is_possibly_malicious_scheme)�sZis_image_urlZ image_type� r �/usr/lib64/python3.6/clean.py�_is_javascript_schemeT s r z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)Z namespacesc @ s� e Zd ZdZdZdZdZdZdZdZ dZ dZdZdZ dZdZdZdZdZdZdZdZejZdZf Zeddg�Zdd� Zed d ddgd d d d d �Zdd� Zdd� Z dd� Z!dd� Z"dd� Z#d!dd�Z$dd� Z%e&j'de&j(�j)Z*dd� Z+dd � Z,dS )"r a Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor. ``scripts``: Removes any ``<script>`` tags. ``javascript``: Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets as they could contain Javascript. ``comments``: Removes any comments. ``style``: Removes any style tags. ``inline_style`` Removes any style attributes. Defaults to the value of the ``style`` option. ``links``: Removes any ``<link>`` tags ``meta``: Removes any ``<meta>`` tags ``page_structure``: Structural parts of a page: ``<head>``, ``<html>``, ``<title>``. ``processing_instructions``: Removes any processing instructions. ``embedded``: Removes any embedded objects (flash, iframes) ``frames``: Removes any frame-related tags ``forms``: Removes any form tags ``annoying_tags``: Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>`` ``remove_tags``: A list of tags to remove. Only the tags will be removed, their content will get pulled up into the parent tag. ``kill_tags``: A list of tags to kill. Killing also removes the tag's content, i.e. the whole subtree, not just the tag itself. ``allow_tags``: A list of tags to include (default include all). ``remove_unknown_tags``: Remove any tags that aren't standard parts of HTML. ``safe_attrs_only``: If true, only include 'safe' attributes (specifically the list from the feedparser HTML sanitisation web site). ``safe_attrs``: A set of attribute names to override the default list of attributes considered 'safe' (when safe_attrs_only=True). ``add_nofollow``: If true, then any <a> tags will have ``rel="nofollow"`` added to them. ``host_whitelist``: A list or set of hosts that you can use for embedded content (for content like ``<object>``, ``<link rel="stylesheet">``, etc). You can also implement/override the method ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) ``embedded``. Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning. Note that you may also need to set ``whitelist_tags``. ``whitelist_tags``: A set of tags that can be included with ``host_whitelist``. The default is ``iframe`` and ``embed``; you may wish to include other tags like ``script``, or you may want to implement ``allow_embedded_url`` for more control. Set to None to include all tags. This modifies the document *in place*. TFN�iframe�embedc K sZ x:|j � D ].\}}t| |�s,td||f ��t| ||� q W | jd krVd|krV| j| _d S )NzUnknown parameter: %s=%r�inline_style)�items�hasattr� TypeError�setattrr �style)�self�kw�name�valuer r r �__init__� s zCleaner.__init__�src�href�code�object)�script�link�appletr r �layer�ac C s� t |d�r|j� }t|� x|jd�D ] }d|_q&W | jsD| j|� t| jpNf �}t| j p\f �}t| j pjf �}| jr~|jd� | j r�t| j�}x:|jtj�D ]*}|j}x|j� D ]}||kr�||= q�W q�W | j�r*| j o�| jtjk�s(x@|jtj�D ]0}|j}x$|j� D ]}|jd��r||= �qW q�W |j| jdd� | j�s�x\t|�D ]P}|jd�} td | �} td | �} | j| ��r�|jd= n| | k�rJ|jd| � �qJW | j�s*x�t|jd��D ]p}|jd d �j � j!� dk�r�|j"� �q�|j#�p�d } td | �} td | �} | j| ��rd|_#n| | k�r�| |_#�q�W | j�s:| j$�rF|jtj%� | j$�rZ|jtj&� | j�rl|jd� | j�r�tj'|d� | j(�r�|jd � nT| j�s�| j�r�xBt|jd ��D ]0}d|jdd �j � k�r�| j)|��s�|j"� �q�W | j*�r�|jd� | j+�r|j,d)� | j-�r�x\t|jd��D ]J}d}|j.� }x$|dk �rX|jd*k�rX|j.� }�q6W |dk�r$|j"� �q$W |j,d+� |j,d,� | j/�r�|j,tj0� | j1�r�|jd� |j,d-� | j2�r�|j,d.� g } g }x`|j� D ]T}|j|k�r| j)|��r��q�|j3|� n&|j|k�r�| j)|��r"�q�| j3|� �q�W | �rb| d" |k�rb| j4d"�}d#|_|jj5� n8|�r�|d" |k�r�|j4d"�}|jdk�r�d#|_|j5� |j6� x|D ]}|j"� �q�W x| D ]}|j7� �q�W | j8�r�|�r�t9d$��ttj:�}|�rlg }x(|j� D ]}|j|k�r|j3|� �qW |�rl|d" |k�rT|j4d"�}d#|_|jj5� x|D ]}|j7� �qZW | j;�r�xdt<|�D ]X}| j=|��s~|jd�}|�r�d%|k�r�d&d'| k�r��q~d(| }nd%}|jd|� �q~W dS )/z& Cleans the document. �getrootZimageZimgr* ZonF)Zresolve_base_hrefr � �typeztext/javascriptz /* deleted */r+ Z stylesheet�rel�meta�head�html�title�paramNr, r) r r r- Zform�button�input�select�textarea�blink�marqueer ZdivzIIt does not make sense to pass in both allow_tags and remove_unknown_tagsZnofollowz nofollow z %s z%s nofollow)r4 r5 r6 )r, r) )r, )r r r- r) r7 )r8 r9 r: r; )r<