NUTCH-2522 Bidirectional URL exemption filter by okedoki · Pull Request #290 · apache/nutch

okedoki · 2018-03-06T09:17:15Z

No description provided.

sebastian-nagel

Clean PR with correct code format and documentation! Thanks, @okedoki!

Afaics, the implementation does the following:

take the lowercased host part of both from and to URL
match all regex rules defined in the rules files and remove the matched part
finally, if from and to host are equal return true => URL is accepted ("exempted" from ignore external host exclusion)

Is this correct?

Wouldn't be a different rule file format more suitable?

the leading +/- is not used
don't know whether this makes sense, but could also define the replace string, ev. including references to captured groups, cf. the file format used by urlnormalizer-regex

sebastian-nagel · 2018-03-06T09:32:11Z

conf/db-ignore-external-exemptions-bidirectional.txt

@@ -0,0 +1,33 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more


Configuration files should be added as *.template. And are "instantiated" (copied) during the first compilation. Users than can modify the content without conflicts and undesired overwrites.

sebastian-nagel · 2018-03-06T09:32:36Z

conf/db-ignore-external-exemptions-bidirectional.txt

+
+# Example 1:
+#----------
+# To exempt urls ending with image extensions, uncomment the below line


Description does not fit the following line/rule.

sebastian-nagel · 2018-03-06T09:37:32Z

conf/db-ignore-external-exemptions-bidirectional.txt

+# Format :
+#--------
+# The format is same same as `regex-urlfilter.txt`.
+# Each non-comment, non-blank line contains a regular expression


The description does not match the implementation.

sebastian-nagel · 2018-03-06T09:43:48Z

conf/db-ignore-external-exemptions-bidirectional.txt

+# Example 1:
+#----------
+# To exempt urls ending with image extensions, uncomment the below line
+-(www.)


Why the rule starts with +-? The dot is not escaped, would also apply to wwwfinder catching wwwf.

…zer-regex

okedoki · 2018-03-16T17:54:56Z

@sebastian-nagel
Thanks for the suggestion to use urlnormalizer-regex. I rewrote the plugin based on this approach( now it makes sense to refactor urlnormalizer-regex and this plugin to use the same code base).

The usage is correct, at this moment we apply the same regex for both input and output url and see if they match each other.

In the future it can be improved with two separated regex for input and output.

…tionla

okedoki and others added 5 commits January 22, 2018 09:44

Merge branch 'master', remote branch 'origin'

a898b5c

Merge remote branch 'origin/master'

07cfff4

Merge remote branch 'origin/master'

8d5b917

bidirectionexemptionurlfilter added

3977018

added db-ignore-external-exemptions-bidirectional.txt in conf folder

bbe5409

sebastian-nagel requested changes Mar 6, 2018

View reviewed changes

okedoki added 2 commits March 7, 2018 14:26

fixed name of jar for bidirectional exemption url filter

3b07d03

refactoring of bidirectional exemption filter, inspired by urlnormali…

20c46ed

…zer-regex

okedoki added 2 commits March 21, 2018 13:43

added super setup of config for BidirectionalExemptionUrlFilter

375bc56

FileReader was replaced with Reader in urlfilter-ignoreexempt-bidirec…

010c0a2

…tionla

lewismc changed the title ~~NUTCH-2522~~ NUTCH-2522 Bidirectional URL exemption filter Jan 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-2522 Bidirectional URL exemption filter#290

NUTCH-2522 Bidirectional URL exemption filter#290
okedoki wants to merge 9 commits intoapache:masterfrom
okedoki:NUTCH-2522

okedoki commented Mar 6, 2018

Uh oh!

sebastian-nagel left a comment

Uh oh!

sebastian-nagel Mar 6, 2018

Uh oh!

sebastian-nagel Mar 6, 2018

Uh oh!

sebastian-nagel Mar 6, 2018

Uh oh!

sebastian-nagel Mar 6, 2018

Uh oh!

okedoki commented Mar 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,33 @@
		# Licensed to the Apache Software Foundation (ASF) under one or more

Conversation

okedoki commented Mar 6, 2018

Uh oh!

sebastian-nagel left a comment

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Mar 6, 2018

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Mar 6, 2018

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Mar 6, 2018

Choose a reason for hiding this comment

Uh oh!

sebastian-nagel Mar 6, 2018

Choose a reason for hiding this comment

Uh oh!

okedoki commented Mar 16, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants