PDA

View Full Version : Extract URLs from Google search results page (SERP) on Linux, into file, on screen



Fli
06-17-2015, 05:47 PM
Hello,

bash is nice for automating tasks. I want to share here the script that can extract 10 URLs from Google Search results page. It can be very usefull


interpret="Shaun Baker feat Maloy (2009)"
song="Give"
# replace spaces by plus signs for google search query
interpretquery=${interpret// /+}
songquery=${song// /+}

curl -sLA "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12" --connect-timeout 5 --max-time 10 http://www.google.com/search?q=%2Bintitle:%22Index+Of+%2F%22+$interpretq uery+$songquery+mp3 | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?' | grep -v "google"
# curl -s is silent, no progress, errors. curl -L is to follow possible redirect on google domain

The core of the curl is the google search URL:
http://www.google.com/search?q=search+phrasse+here
i set two variables: interpret & song in it for my purpose, but you can simplyffy it by setting one variable only:


# set search phrasse
searchphrasse="search phrasse here"
# add + signs into search phrasse to be google query
searchphrassequery=${searchphrasse// /+}
curl -sLA "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12" --connect-timeout 5 --max-time 10 http://www.google.com/search?q=$searchphrassequery | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?' | grep -v "google"

if you dont want to use it in script, just need to get one-time URL list from SERP into file, do Linux command:



curl -sLA "User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12" --connect-timeout 5 --max-time 10 http://www.google.com/search?q=search+phrasse+here | grep -ahoP 'http[-a-zA-Z0-9@:%_\+.~#?&//=]{2,256}\.[a-z]{2,4}\b(\/[-a-zA-Z0-9@:%_\+.~#?&//=]*)?' | grep -v "google" > filename

replace "search+phrasse+here" by one you search for

---------

IF geting malformed URLs out of above commands, try this alternative command to extract SERP links:


lynx --dump http://www.google.com/search?q=search+phrasse+here | grep -o '?q=http.*&sa' | awk -F'?q=|&sa' '{print $2}' | grep -v "google"

To instead extract links from pages found on google SERP (download all pages found on SERP and extract links from them):


for url in $(lynx --dump http://www.google.com/search?q=search+phrasse+here | grep -o '?q=http.*&sa' | awk -F'?q=|&sa' '{print $2}' | grep -v "google");do lynx --dump "$url" | awk '/(http|https):\/\// {print $2}';done