Possible to dump AJAX content from webpage? - javascript

I would like to dump all the names on this page and all the remaining 146 pages.
The red/orange previous/next buttons uses JavaScript it seams, and gets the names by AJAX.
Question
Is it possible to write a script to crawl the 146 pages and dump the names?
Does there exist Perl modules for this kind of thing?

You can use WWW::Mechanize or another Crawler for this. Web::Scraper might also be a good idea.
use Web::Scraper;
use URI;
use Data::Dump;
# First, create your scraper block
my $scraper = scraper {
# grab the text nodes from all elements with class type_firstname (that way you could also classify them by type)
process ".type_firstname", "list[]" => 'TEXT';
};
my #names;
foreach my $page ( 1 .. 146) {
# Fetch the page (add page number param)
my $res = $scraper->scrape( URI->new("http://www.familiestyrelsen.dk/samliv/navne/soeginavnelister/godkendtefornavne/drengenavne/?tx_lfnamelists_pi2[gotopage]=" . $page) );
# add them to our list of names
push #names, $_ for #{ $res->{list} };
}
dd \#names;
It will give you a very long list with all the names. Running it may take some time. Try with 1..1 first.

In general, try using WWW::Mechanize::Firefox which will essentially remote-control Firefox.
For that particular page though, you can just use something as simple as HTTP::Tiny.
Just make POST requests to the URL and pass the parameter tx_lfnamelists_pi2[gotopage] from 1 to 146.
Example at http://hackst.com/#4sslc for page #30.
Moral of the story: always look in Chrome's Network tab and see what requests the web page makes.

Related

scrapy + selenium: <a> tag has no href, but content is loaded by javascript

I'm almost there with my first try of using scrapy, selenium to collect data from website with javascript loaded content.
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.webdriver.common.by import By
import time
class FreePlayersSpider(scrapy.Spider):
name = 'free_players'
allowed_domains = ['www.forge-db.com']
start_urls = ['https://www.forge-db.com/fr/fr11/players/?server=fr11']
driver = {}
def __init__(self):
self.driver = webdriver.Chrome('/home/alain/Documents/repository/web/foe-python/chromedriver')
self.driver.get('https://forge-db.com/fr/fr11/players/?server=fr11')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
#time.sleep(1)
sel = Selector(text = self.driver.page_source)
players = sel.xpath('.//table/tbody/tr')
for player in players:
joueur = player.xpath('.//td[3]/a/text()').get()
guilde = player.xpath('.//td[4]/a/text()').get()
yield {
'player' : joueur,
'guild' : guilde
}
next_page_btn = self.driver.find_element_by_xpath('//a[#class="paginate_button next"]')
if next_page_btn:
time.sleep(2)
next_page_btn.click()
yield scrapy.Request(url = self.start_urls, callback=self.parse)
# Close the selenium driver, so in fact it closes the testing browser
self.driver.quit()
def parse_players(self):
pass
I want to collect user names and their relative guild and output to a csv file.
For now my issue is to proceed to NEXT PAGE and to parse again the content loaded by javascript.
if i'm able to simulate click on NEXT tag, i'm not 100% sure that code will proceed all pages and i'm not able to parse the new content using the same function.
Any idea how could i solve this issue ?
thx.
Instead of using selenium, you should try recreate the request to update the table. If you look closely at the HTML under chrometools. You can see that the request is made with parameters and a response is sent back with the data in a nice structured format.
Please see here with regards to dynamic content in scrapy. As it explains the first step to think about is it necessary to recreate browser activity ? Or can I get the information I need from reverse engineering HTTP get requests. Sometimes the information is hidden with <script></script> tags and you can use some regex or some string methods to gain what you want. Rendering the page and then using browser activity should be thought of as a last step.
Now before I go into some background on reverse engineering the requests, this website you're trying to get information from requires only to reverse engineer the HTTP requests.
Reverse Engineering HTTP requests in Scrapy
Now in terms of the actual web itself we can use chrome devtools by right clicking inspect on a page. Clicking the network tab allows you to see all requests the browser makes to render the page. In this case you want to see what happens when you click next.
Image1: here
Here you can see all the requests made when you click next on the page. I always look for the biggest sized response as that'll most likely have your data.
Image2: here
Here you can see the request headers/params etc... things you need to make a proper HTTP request. We can see that the referring URL is actually getplayers.php with all the params to get the next page added on. If you scroll down you can see all the same parameters it sends to getplayers.php. Keep this in mind, sometimes we need to send headers, cookies and parameters.
Image3: here
Here is the preview of the data we would get back from the server if we make the correct request, it's a nice neat format which is great for scraping.
Now You could copy the headers and parameters, cookies here into scrapy, but after a bit of searching and it's always worth checking this first, if just by passing in an HTTP request with the url will you get the data you want then that is the simplest way.
In this case it's true and infact you get in a nice need format with all the data.
Code example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['forge-db.com']
def start_requests(self):
url = 'https://www.forge-db.com/fr/fr11/getPlayers.php?'
yield scrapy.Request(url=url)
def parse(self,response):
for row in response.json()['data']:
yield {'name':row[2],'guild':row[3] }
Settings
In settings.py, you need to set ROBOTSTXT_OBEY = False The site doesn't want you to access this data so we need to set it to false. Be careful, you could end getting banned from the server.
I would also suggest a couple of other settings to be respectful and cache the results so if you want to play around with this large dataset you don't hammer the server.
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 3
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
Comments on the code
We make a request to https://www.forge-db.com/fr/fr11/getPlayers.php? and if you were to print the response you get all the data from the table, it's quite a lot... Now it looks like it's in json format so we use scrapy's new feature to handle json and convert into a python dictionary. response.json() be sure that you have uptodate scrapy to take advantage of this. Otherwise you could use the json library that python provides to do the same thing.
Now you have to look at the preview data abit here but the individual rows are within response.json()['data'][i] where i in the row of data. The name and guild are within response.json()['data'][i][2] and response.json()['data'][i][3]. So looping over every response.json()['data']and grabbing the name and guild.
If the data wasn't so structured as it is here and it needed modifying I would strongly urge you to use Items or ItemLoaders for creating the fields that you can then output the data. You can modifying the extracted data more easily with ItemLoaders and you can interact with duplicates items etc using a pipeline. These are just some thoughts for in the future, I almost never use yielding a dictionary for extracting data particularly large datasets.

Selenium/Beautiful Soup scraper failing after looping through one page (Javascript)

I'm attempting to scrape data on food seasonality from the Seasonal Food Guide but hitting a snag. The site has a fairly simple URL structure:
https://www.seasonalfoodguide.org/produce_name/state_name
I've been able to use Selenium and Beautiful Soup to successfully scrape the seasonality information from one page, but on subsequent loops the section of text I'm looking for doesn't actually load so I get AttributeError: 'NoneType' object has no attribute 'text'. I know it's because months_list_raw is coming back empty due to the fact that the 'wheel-months-list' portion of the page isn't loading on the second loop. Code is below. Any ideas?
for ingredient in produce_list:
for state in state_list:
# grab page content
search_url = 'https://www.seasonalfoodguide.org/{}/{}'.format(ingredient,state)
driver.get(search_url)
page_soup = soup(driver.page_source, 'lxml')
# grab list of months
months_list_raw = page_soup.find('p',{'id':'wheel-months-list'})
months_list = months_list_raw.text
The page is being rendered on the client side, which means when you open the page, another request is being made to a backend server to fetch the data based on your selected filters. So the issue is that when you open the page and read the HTML, the content is not fully loaded yet. The simplest thing you could do is sleep for some time after opening the page with Selenium in order to wait for it to fully load. I've tested your code by throwing in time.sleep(3) after the driver.get(search_url) and it worked fine.
To prevent the error from occuring and continuing with your loop you need to do a check for when the months_list_raw element is not None. It seems like some of the produce pages do not have any data for some states, so you will need to handle that in your program how you want.
for ingredient in produce_list:
for state in state_list:
# grab page content
search_url = 'https://www.seasonalfoodguide.org/{}/{}'.format(ingredient,state)
driver.get(search_url)
page_soup = soup(driver.page_source, 'lxml')
# grab list of months
months_list_raw = page_soup.find('p',{'id':'wheel-months-list'})
if months_list_raw is not None:
months_list = months_list_raw.text
else:
# Handle case where ingredient/state data doesn't exist

Perl Mechanize with Javascript

I started working on perl mechanize and took a task to automate but got stuck with javascript in website.
the website I am trying my code on has a javascript based navigation (url remains same) between menu sections.
Take a look here
the code so far I have written gets me the link which redirects to the menu as shown in image.
$url="https://my.testingurl.com/nav/";
my $mech=WWW::Mechanize->new(autocheck => 1);
$mech->get($url);
$mech->form_name("LoginForm");
$mech->field('UserName','username');
$mech->field('UserPassword','password');
$mech->submit_form();
my $page=$mech->content;
if($page =~ /<meta\s+http-equiv="refresh"\s+content="\d+;\s*url=([^"+]*)"/mi)
{$url=$1 }
$mech->get($url);
print Dumper $mech->find_link(text_regex=>qr/View Results/);
and this is the output.
$VAR1 = bless( [
'#',
'View Results',
undef,
'a',
bless( do{\(my $o = 'https://my.testingurl.com/nav/')}, 'URI::https' ),
{
'onclick' => 'PageActionGet(\'ChangePage\',\'ResultsSection\',\'\',\'\', true)',
'href' => '#'
}
], 'WWW::Mechanize::Link' );
Now I am clueless how to proceed by clicking on the link shown in output and do the same with another part of navigation.
Please Help.
You can't. WWW:Mechanize doesn't support Javascript.
WWW::Mechanize doesn't support JavaScript. This leaves you with two basic options:
Reverse engineer the JavaScript, scrape any data you need out of it with Mechanize, then trigger any HTTP interactions yourself. In this case, it might involve extracting the "ResultsSection" string and matching it to some data from elsewhere in the page (or possibly an external JavaScript file).
Switch to a different tool which does support JavaScript (such as Selenium).

How to display 100K XML based DOM data on a JSP client page?

I need some help in understanding what to do next.
I need to write a web based search function to find medical records from an XML file.
The operator can enter either part or all of a patient name and Hit Search on the JSP web page.
The server is suppose to then return a list of possible patient names with the opportunity for the operator to go to next page until a possible patient is found. They can then select the person and view more details.
On the Server side
I have an XML file with about 100,000 records. There are five different types of records in the file. (This is roughly about 20,000 x 5 = 100,000).
I have a java class to source the xml file and create a DOM to traverse the data elements found on the file.
-- XML File Begin
100k - XML file outline
<hospital>
<infant key="infant/0002DC15" diagtype="general entry" mdate="2015-02-18">
<patient>James Holt</patient>
<physician>Michael Cheng</physician>
<physician>David Long</physician>
<diagnosisCode>IDC9</diagnosisCode>
..
</infant>
<injury key="injury/0002IC15" diagtype="general entry" mdate="2015-03-14">
<patient>Sara Lee</patient>
<physician>Michael Cheng</physician>
<diagnosisCode>IEC9</diagnosisCode>
..
</injury>
<terminal key="terminal/00X2IC15" diagtype="terminal entry" mdate="2015-05-14">
<patient>Jason Man</patient>
<physician>John Hoskin</physician>
<diagnosisCode>FEC9</diagnosisCode>
<diagnosisCode>FXC9</diagnosisCode>
..
</terminal>
<aged key= xxxx ... >
...
</aged>
<sickness key= xxxx ... >
...
</sickness>
</hospital>
approx 5 ( )x 20,000 = 100K records.
Key and patient are the only mandatory fields. The rest of the elements are Optional or multiple elements.
-- XML File End
Here is where I need help
Once I have the DOM how do I go forward in letting the client know what was found in the XML file?
Do I create a MAP to hold the element node links and then forward say 50 links at a time to the JSP and then wait to send some more links when the user hits next page?
Is there an automated way of displaying the links, either via a Java Script, Jquery, XSLT or do I just create a table in HTML and place patient links inside the rows? Is there some rendering specific thing I have to do in order to display the data depending on the browser used by client?
Any guidance, tutorials, examples or books I can refer to would be greatly appreciated.
Thank you.
I don't know an automatic way to match the type in jQuery, but you can test the attributes, something like verify if a non optional attribute in the object is present:
// Non optional Infant attribute
if(obj.nonOptionalAttribute) {
// handle Infant object
}
Or you may add an attribute to differentiate the types (something like a String or int attribute to test in your Javascript).
if(obj.type == 'infant') {
// handle Infant object
}
#John.west,
You can try to bind the XML to a list of objects (something like Injure implements MyXmlNodeMapping, Terminal implements MyXmlNodeMapping, Infant implements MyXmlNodeMapping and go on and have a List) to iterate and search by the value at the back end or you can pass this XML file to a Javascript (if you are using jQuery you can use a get or a post defining the result type as XML) and iterate over the objects to find what the user is trying to find...
Your choice may be based on the preference to use processor time in the server side or in the client side...

Scraping with Nokogiri and Ruby before and after JavaScript changes the value

I have a program that scrapes value from https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj
My current code is:
doc = Nokogiri::HTML(open(source_url))
puts doc.css('span.indexDate').text
date = doc.css('span.indexDate').text
date = Date.parse(date)
puts date
values = doc.css('table#CdsIndexTable td.col2 span')
puts values
This scrapes the date and values of the second column from the "CDS Indexes" table correctly which is fine. Now, I want to scrape the similar values from the "Bond Indexes" table where I am facing the problem.
I can see a JavaScript function changes it without loading the page and without changing the URL of the page. The difference between these two tables is their IDs are different which is exactly that it should be. But, unfortunately when I try with:
values = doc.css('table#BondIndexTable')
puts values
I get nothing from the Bond Indexes table. But I get values from CDS Indexes table if I use:
values = doc.css('table#CdsIndexTable')
puts values
How can I get the values from both tables?
You can use Capybara with the Poltergeist driver to execute the Javascript and format the page. Poltergeist is a wrapper for the PhantomJS headless browser. Here's an example of how you can do it:
require 'rubygems'
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.default_driver = :poltergeist
Capybara.run_server = false
module GetPrice
class WebScraper
include Capybara::DSL
def get_page_data(url)
visit(url)
doc = Nokogiri::HTML(page.html)
doc.css('td.col2 span')
end
end
end
scraper = GetPrice::WebScraper.new
puts scraper.get_page_data('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj').map(&:text).inspect
Visit here for a complete example using Amazon.com:
https://github.com/wakproductions/amazon_get_price/blob/master/getprice.rb
If you don't want to use PhantomJS you can also use the network sniffer on Firefox or Chrome development tools, and you will see that the HTML table data is returned with a javascript POST request to the server.
Then instead of opening the original page URL with Nokogiri, you'd instead run this POST from your Ruby script and parse and interpret that data instead. It looks like it's just JSON data with HTML embedded into it. You could extract the HTML and feed that to Nokogiri.
It requires a bit of extra detective work, but I've used this method many times with JavaScript web pages and scraping. It works OK for most simple tasks, but it requires a bit of digging into the inner workings of the page and network traffic.
Here's an example of the JSON data from the Javascript POST request:
Bonds:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ
CDS:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ
Here's the quick and dirty solution just so you get an idea. This will grab the cookie from the initial page and use it in the request to get the JSON data, then parse the JSON data and feed the extracted HTML to Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'json'
# Open the initial page to grab the cookie from it
p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj')
# Save the cookie
cookie = p1.meta['set-cookie'].split('; ',2)[0]
# Open the JSON data page using our cookie we just obtained
p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ',
'Cookie' => cookie)
# Get the raw JSON
json = p2.read
# Parse it
data = JSON.parse(json)
# Feed the html portion to Nokogiri
doc = Nokogiri.parse(data['html'])
# Extract the values
values = doc.css('td.col2 span')
puts values.map(&:text).inspect
=> ["0.02%", "0.02%", "n.a.", "-0.03%", "0.02%", "0.04%",
"0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"]
PhantomJS is a headless browser with a JavaScript API. Since you need to run the scripts on the page you are scraping, a browser will do that for you; and PhantomJS will allow you to manipulate and scrape the page after the script execution.

Categories