html转义字符怎么通过代码识别

2019-09-08 郑州网站建设  

偶尔会在数据中看到诸如' 这样的字符,特征如下 

以&#开头,中间是一串数字,以;结尾 
以&开头,中间一串字符,以;结尾 

比如最常见的 或者等价的  

浏览器遇到这些转义符,会转义回来,但怎么通过代码识别? org.apache.commons.lang.StringEscapeUtils.unescapeHtml提供了很好的说明 

遇到上面的第一种情况,中间是数字的,直接将数字(unicode)转为char 
遇到第二情况,中间是字符,只能查映射表了,从映射表中找到字符对应的数字再转换为char 看看代码就一目了然了 

看看HTML40怎么定义的 
 

复制代码

代码如下:


static { 
HTML40 = new Entities(); 
fillWithHtml40Entities(HTML40); 

static void fillWithHtml40Entities(Entities entities) { 
entities.addEntities(BASIC_ARRAY); 
entities.addEntities(ISO8859_1_ARRAY); 
entities.addEntities(HTML40_ARRAY); 


再看看BASIC_ARRAY、ISO8859_1_ARRAY、HTML40_ARRAY 分别是什么 

BASIC_ARRAY 

复制代码

代码如下:


private static final String[][] BASIC_ARRAY = {{"quot", "34"}, // " - double-quote 
{"amp", "38"}, // & - ampersand 
{"lt", "60"}, // < - less-than 
{"gt", "62"}, // > - greater-than 
}; 


ISO8859_1_ARRAY 

复制代码

代码如下:


static final String[][] ISO8859_1_ARRAY = {{"nbsp", "160"}, // non-breaking space 
{"iexcl", "161"}, // inverted exclamation mark 
{"cent", "162"}, // cent sign 
{"pound", "163"}, // pound sign 
{"curren", "164"}, // currency sign 
{"yen", "165"}, // yen sign = yuan sign 
{"brvbar", "166"}, // broken bar = broken vertical bar 
{"sect", "167"}, // section sign 
{"uml", "168"}, // diaeresis = spacing diaeresis 
{"copy", "169"}, // � - copyright sign 
{"ordf", "170"}, // feminine ordinal indicator 
{"laquo", "171"}, // left-pointing double angle quotation mark = left pointing guillemet 
{"not", "172"}, // not sign 
{"shy", "173"}, // soft hyphen = discretionary hyphen 
{"reg", "174"}, // � - registered trademark sign 
{"macr", "175"}, // macron = spacing macron = overline = APL overbar 
{"deg", "176"}, // degree sign 
{"plusmn", "177"}, // plus-minus sign = plus-or-minus sign 
{"sup2", "178"}, // superscript two = superscript digit two = squared 
{"sup3", "179"}, // superscript three = superscript digit three = cubed 
{"acute", "180"}, // acute accent = spacing acute 
{"micro", "181"}, // micro sign 
{"para", "182"}, // pilcrow sign = paragraph sign 
{"middot", "183"}, // middle dot = Georgian comma = Greek middle dot 
{"cedil", "184"}, // cedilla = spacing cedilla 
{"sup1", "185"}, // superscript one = superscript digit one 
{"ordm", "186"}, // masculine ordinal indicator 
{"raquo", "187"}, // right-pointing double angle quotation mark = right pointing guillemet 
{"frac14", "188"}, // vulgar fraction one quarter = fraction one quarter 
{"frac12", "189"}, // vulgar fraction one half = fraction one half 
{"frac34", "190"}, // vulgar fraction three quarters = fraction three quarters 
{"iquest", "191"}, // inverted question mark = turned question mark 
{"Agrave", "192"}, // � - uppercase A, grave accent 
{"Aacute", "193"}, // � - uppercase A, acute accent 
{"Acirc", "194"}, // � - uppercase A, circumflex accent 
{"Atilde", "195"}, // � - uppercase A, tilde 
{"Auml", "196"}, // � - uppercase A, umlaut 
{"Aring", "197"}, // � - uppercase A, ring 
{"AElig", "198"}, // � - uppercase AE 
{"Ccedil", "199"}, // � - uppercase C, cedilla 
{"Egrave", "200"}, // � - uppercase E, grave accent 
{"Eacute", "201"}, // � - uppercase E, acute accent 
{"Ecirc", "202"}, // � - uppercase E, circumflex accent 
{"Euml", "203"}, // � - uppercase E, umlaut 
{"Igrave", "204"}, // � - uppercase I, grave accent 
{"Iacute", "205"}, // � - uppercase I, acute accent 
{"Icirc", "206"}, // � - uppercase I, circumflex accent 
{"Iuml", "207"}, // � - uppercase I, umlaut 
{"ETH", "208"}, // � - uppercase Eth, Icelandic 
{"Ntilde", "209"}, // � - uppercase N, tilde 
{"Ograve", "210"}, // � - uppercase O, grave accent 
{"Oacute", "211"}, // � - uppercase O, acute accent 
{"Ocirc", "212"}, // � - uppercase O, circumflex accent 
{"Otilde", "213"}, // � - uppercase O, tilde 
{"Ouml", "214"}, // � - uppercase O, umlaut 
{"times", "215"}, // multiplication sign 
{"Oslash", "216"}, // � - uppercase O, slash 
{"Ugrave", "217"}, // � - uppercase U, grave accent 
{"Uacute", "218"}, // � - uppercase U, acute accent 
{"Ucirc", "219"}, // � - uppercase U, circumflex accent 
{"Uuml", "220"}, // � - uppercase U, umlaut 
{"Yacute", "221"}, // � - uppercase Y, acute accent 
{"THORN", "222"}, // � - uppercase THORN, Icelandic 
{"szlig", "223"}, // � - lowercase sharps, German 
{"agrave", "224"}, // � - lowercase a, grave accent 
{"aacute", "225"}, // � - lowercase a, acute accent 
{"acirc", "226"}, // � - lowercase a, circumflex accent 
{"atilde", "227"}, // � - lowercase a, tilde 
{"auml", "228"}, // � - lowercase a, umlaut 
{"aring", "229"}, // � - lowercase a, ring 
{"aelig", "230"}, // � - lowercase ae 
{"ccedil", "231"}, // � - lowercase c, cedilla 
{"egrave", "232"}, // � - lowercase e, grave accent 
{"eacute", "233"}, // � - lowercase e, acute accent 
{"ecirc", "234"}, // � - lowercase e, circumflex accent 
{"euml", "235"}, // � - lowercase e, umlaut 
{"igrave", "236"}, // � - lowercase i, grave accent 
{"iacute", "237"}, // � - lowercase i, acute accent 
{"icirc", "238"}, // � - lowercase i, circumflex accent 
{"iuml", "239"}, // � - lowercase i, umlaut 
{"eth", "240"}, // � - lowercase eth, Icelandic 
{"ntilde", "241"}, // � - lowercase n, tilde 
{"ograve", "242"}, // � - lowercase o, grave accent 
{"oacute", "243"}, // � - lowercase o, acute accent 
{"ocirc", "244"}, // � - lowercase o, circumflex accent 
{"otilde", "245"}, // � - lowercase o, tilde 
{"ouml", "246"}, // � - lowercase o, umlaut 
{"divide", "247"}, // division sign 
{"oslash", "248"}, // � - lowercase o, slash 
{"ugrave", "249"}, // � - lowercase u, grave accent 
{"uacute", "250"}, // � - lowercase u, acute accent 
{"ucirc", "251"}, // � - lowercase u, circumflex accent 
{"uuml", "252"}, // � - lowercase u, umlaut 
{"yacute", "253"}, // � - lowercase y, acute accent 
{"thorn", "254"}, // � - lowercase thorn, Icelandic 
{"yuml", "255"}, // � - lowercase y, umlaut 
}; 


HTML40_ARRAY 

复制代码

代码如下:


河南郑州做网站首选天择文化,我们专注郑州网站建设网站设计网站制作与开发,是中原地区专业的郑州网络公司,多年来我们一直努力,服务客户数百家,欢迎您的咨询。
本文链接:郑州网络公司http://tzchb.ieyo.com/seo/1416.html转载请标明出处,谢谢合作!
标签:
  • HTML
  • 看到
  • 数据
  • 诸如
  • 偶尔
  • ;
  • #39