NUTCH-2522 Bidirectional URL exemption filter#290
NUTCH-2522 Bidirectional URL exemption filter#290okedoki wants to merge 9 commits intoapache:masterfrom
Conversation
sebastian-nagel
left a comment
There was a problem hiding this comment.
Clean PR with correct code format and documentation! Thanks, @okedoki!
Afaics, the implementation does the following:
- take the lowercased host part of both from and to URL
- match all regex rules defined in the rules files and remove the matched part
- finally, if from and to host are equal return true => URL is accepted ("exempted" from ignore external host exclusion)
Is this correct?
Wouldn't be a different rule file format more suitable?
- the leading +/- is not used
- don't know whether this makes sense, but could also define the replace string, ev. including references to captured groups, cf. the file format used by urlnormalizer-regex
| @@ -0,0 +1,33 @@ | |||
| # Licensed to the Apache Software Foundation (ASF) under one or more | |||
There was a problem hiding this comment.
Configuration files should be added as *.template. And are "instantiated" (copied) during the first compilation. Users than can modify the content without conflicts and undesired overwrites.
|
|
||
| # Example 1: | ||
| #---------- | ||
| # To exempt urls ending with image extensions, uncomment the below line |
There was a problem hiding this comment.
Description does not fit the following line/rule.
| # Format : | ||
| #-------- | ||
| # The format is same same as `regex-urlfilter.txt`. | ||
| # Each non-comment, non-blank line contains a regular expression |
There was a problem hiding this comment.
The description does not match the implementation.
| # Example 1: | ||
| #---------- | ||
| # To exempt urls ending with image extensions, uncomment the below line | ||
| -(www.) |
There was a problem hiding this comment.
Why the rule starts with +-? The dot is not escaped, would also apply to wwwfinder catching wwwf.
|
@sebastian-nagel The usage is correct, at this moment we apply the same regex for both input and output url and see if they match each other. In the future it can be improved with two separated regex for input and output. |
No description provided.