Fli
05-31-2015, 10:43 AM
Here is good Linux command which extract only URLs / website addresses (including http://) from Linux command output
This way im extracting only URLs out of an webpage:
curl --silent http://domain.tld/page.htm | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?'
RESULT:
http://domain1.tld
http://www.domain2.tld/page.html
if i want to trim away http://www and http://, i enhance command:
curl --silent http://domain.tld/page.htm | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?' | sed -e "s/http:\/\/www.//g" | sed -e "s/http:\/\///g"
RESULT:
domain1.tld
domain2.tld/page.html
(first trim away http://www. then http://)
to remove duplicate URLs, add " | sort -u" to the end of the command. It will also sort URLs which may be unwanted
This way im extracting only URLs out of an webpage:
curl --silent http://domain.tld/page.htm | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?'
RESULT:
http://domain1.tld
http://www.domain2.tld/page.html
if i want to trim away http://www and http://, i enhance command:
curl --silent http://domain.tld/page.htm | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?' | sed -e "s/http:\/\/www.//g" | sed -e "s/http:\/\///g"
RESULT:
domain1.tld
domain2.tld/page.html
(first trim away http://www. then http://)
to remove duplicate URLs, add " | sort -u" to the end of the command. It will also sort URLs which may be unwanted