PDA

View Full Version : Linux, how to extract URLs out of webpage / RSS feed / sitemap



Fli
05-31-2015, 09:03 AM
I wanted to monitor Artist Against 419 RSS feed for newly added scam sites (http://www.aa419.org/rss.php), so i created new Linux bash script to extract these scam domains.

This is an educational tutorial on how one can extract any type of content out of webpage, if you want to extract only URLs, then read this more simple tutorial (http://internetlifeforum.com/showthread.php?t=3468).

Here i extract that webpage output:


# curl http://www.aa419.org/rss.php
<?xml version="1.0" ?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
<channel>
<title>Artist Against 419</title>
<link>http://www.aa419.org/</link>
<description>The 20 latest fake banks from Artist Against 419</description>
<language>en-us</language>
<item>
<title>ATD Logistics</title>
<link>http://db.aa419.org/fakebanksview.php?key=100984</link>
<guid isPermaLink="true">http://db.aa419.org/fakebanksview.php?key=100984</guid>
<description>URL: http://www.logisticss.net
Status: active
</description>
</item>
<item>
<title>Trans Cargo International Shipping TCS Inc</title>
<link>http://db.aa419.org/fakebanksview.php?key=100983</link>
<guid isPermaLink="true">http://db.aa419.org/fakebanksview.php?key=100983</guid>
<description>URL: http://www.tcs-shipping.com
Status: active
</description>
</item>
...

I want to extract scam site domains.. The line that interests me is:
<description>URL: http://www.*

so i use curl to get site and grep only that lines:

curl --silent http://www.aa419.org/rss.php | grep "URL:"

"--silent" is to get rid of curl messages that can damage output

result:

<description>URL: http://www.somedomain1.tld
<description>URL: http://www.somedomain2.tld

i see it has space before urls so i remove all spaces by trim command:

curl --silent http://www.aa419.org/rss.php | grep "URL:" | tr -d " "

result:

<description>URL:http://www.somedomain1.tld
<description>URL:http://www.somedomain2.tld

then i want to get rid of "<description>URL:http://www." so i have plain domains as it is what i want to search in my apache config:

curl --silent http://www.aa419.org/rss.php | grep "URL:" | tr -d " " | sed -e "s/<description>URL:http:\/\/www.//g"

result:

somedomain1.tld
somedomain2.tld

i can now output scam domains to the temporary file by adding " > filename" to the end of command