Here is good Linux command which extract only URLs / website addresses (including http://) from Linux command output
This way im extracting only URLs out of an webpage:
curl --silent http://domain.tld/page.htm | grep -ahoP 'http[[email protected]:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[[email protected]:%_\+.~#?&//=]*)?'
RESULT:
http://domain1.tld
http://www.domain2.tld/page.html
if i want to trim away http://www and http://, i enhance command:
curl --silent http://domain.tld/page.htm | grep -ahoP 'http[[email protected]:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[[email protected]:%_\+.~#?&//=]*)?' | sed -e "s/http:\/\/www.//g" | sed -e "s/http:\/\///g"
RESULT:
domain1.tld
domain2.tld/page.html
(first trim away http://www. then http://)
to remove duplicate URLs, add " | sort -u" to the end of the command. It will also sort URLs which may be unwanted
Bookmarks