|
|
Finds Markup Language chuks matching exp.
What is an ml-expression?
Simply a regular expression with some more infos about murkups.
Grammar:
-
MLEX := MREGEX | MTAGEX | MLEX MLEX | ''
-
MTAGEX := '{'REGEX'}' | '<'REGEX'>'
-
MREGEX := '['REGEX']' | REGEX
-
REGEX := regular expression
Example:
-
".*<b>([0-9]*(Kb|Mb))</b>"
This matches a generic size in bold.
-
".*<(b|i)>([0-9]*(Kb|Mb))</(b|i)>"
This matches a generic size in bold or italics, obviously it doesn't check if it opens with a b and closes with a /i.
-
"a<b>[c]{d}e{f}[g]<h>"
This matches abdefgh, abeh and othe strigs created considering optionals the tags/strings between {} and []
Limitation:
-
You can use regular expressions inside tags or outside tags,but you can't use regexp with tags. For example it is impossible to specify an arbutrary number of
"<b>".
-
A string, say an MREGEX not optional, cant start with
[ since it is reserver for optional strings. You must put the expression into round brackets to avoid this.
-
The parser is not really smart. It always alternates a string with a tag, so an xpression
"<a><b>" is interpreted as this sequence of tokens: "","<a>","","<b>".
What is an ml-get-expression?
It is the counterpart of a ml-expression. It selects what is important and what not.
Grammar:
-
MLGEX := REGGEX TAGGEX | MLGEX MLGEX | ''
-
TAGGEX := '<'EX'>' | '{'EX'}'
-
REGGEX := EX | '['EX']'
-
EX := 'X' | 'O'
Example:
-
If the ml-expression is
".*<b>.*<.*img.*src.*>.*</b>"
and the ml-get-expression is "O<O>O<X>X<O>"
and data is "<tt><b><img src="nice.jpg">hello</b>"
mlmatch returns a list of length 2 (read: the nember of "X") the first defining "img src="nice.jpg"" and the second defining "hello".
Remembre that if an optional string/tag is used in the ml-expression, the corrspong optional string/tag signature must be used in the ml-get-expression.
A short explanation of how the engine works (considering the prevoius example):
-
tokenize the strings:
-
"<tt><b><img src="nice.jpg">hello</b>" becames "","<tt>","","<b>","","<img src="nice.jpg">","hello","</b>"
-
".*<b>.*<.*img.*src.*>.*</b>" becames ".*","<b>",".*","<.*img.*src.*>",".*","</b>"
-
"O<O>O<X>X<O>" becames "O","<O>","O","<X>","X","<O>"
-
The ml-expression matches perfectly the data starting from the third token, since each regexp matches the corresponding token. so we obtain this sub-list of tokens
"","<b>","","<img src="nice.jpg">","hello","</b>"
-
The sublist has the same length of the ret expression and selecting only the tokens with a corresponding
X we obtain {"img src="nice.jpg"","hello".}
Notes:
-
data, exp and ret MUST be modifyable. they will not be altered, but during processing they may be accessed in write.
- Parameters:
-
| data | is a Markup Language file like an html page (must be modifyable) |
| exp | is the ml-expression (must be modifyable) |
| ret | is the ml-get-expression (must be modifyable) |
- Returns:
- a list of list of chunk_t
|