PDA

View Full Version : Sed, how to extract URL out of Linux command output



Fli
05-31-2015, 10:43 AM
Here is good Linux command which extract only URLs / website addresses (including http://) from Linux command output

This way im extracting only URLs out of an webpage:

curl --silent http://domain.tld/page.htm | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?'

RESULT:
http://domain1.tld
http://www.domain2.tld/page.html

if i want to trim away http://www and http://, i enhance command:

curl --silent http://domain.tld/page.htm | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?' | sed -e "s/http:\/\/www.//g" | sed -e "s/http:\/\///g"

RESULT:
domain1.tld
domain2.tld/page.html

(first trim away http://www. then http://)

to remove duplicate URLs, add " | sort -u" to the end of the command. It will also sort URLs which may be unwanted