Is it possible to extract data from these websites that don't output data in the HTML source code? - javascript

Many years ago I used to use Perl and Python to crawl through some websites by looking at data in the HTML source code.
Now I would like to do another personal project that involves extracting numerical data from:
Table elements on this PredictIt Website
Individual graph elements (x and y for each) on this PredictWise Website
Individual graph elements (x and y for each) on this Five Thirty Eight Website
None of these web pages' HTML source code contain the numerical data. Is there a way to extract these data? If so, where?
I feel like there must be a way, because these are all front-end information that the browser needs to render the charts and graphs.
(I can't find raw-data provided to developers on these webpages. So I guess I have to extract data myself.)

The table elements on the first link are indeed readable from the rendered HTML. If using Chrome, right click on the text and choose "Inspect." The Chrome debugger will show you the exact HTML element that contains the data.
The other links are more difficult. I don't see a way to view the data in raw HTML, but on the second link I am able to see the JSON data supplying the graphs with their data from the server. You may be able to parse that for your project.
The data look like this:
{"id":"1687","name":"Hawaii Caucus - DEM","notes":"","suppress_timestamp":"0","header":["Outcome","PredictWise","Derived Betfair Price","Betfair Back","Betfair Lay","Pollster","Derived PredictIt"],"default_sort":"2","default_sort_dir":"desc","shade_cols":["1"],"history":[{"timestamp":"03-17-2016 1:03PM","table":[["Hillary Clinton","43 %",null,null,null,null,"$ 0.425"],["Bernie Sanders","57 %",null,null,null,null,"$ 0.570"]]},...
Open the Chrome debugger on that website and goto the Network tab. From there, look for requests for "table_xxxx.json" . You can see the URL for requesting the data, and the raw data returned from the server.
Hope this helps!

Related

Converting Excel Data to a Chart in HTML dynamically

Is it possible to be able to upload an excel document with varying ranges of data, and have that data dynamically displayed in a basic form of chart(bar, pie, etc.) on our company website.
After doing some research I figured the only two possible ways to maybe do something like this is to use a very complicated macro in VBA or a Javascript parser to read the data and display it then. The data that will eventually go in here will have sensitive information so I cannot use google charts or anything like that.
This problem has to be divided into two parts.
One -part is to gather and process the information needed to display the chart.
Second - This is the easiest, a way to display a chart in HTML. For this, you can use www.c3js.org javascript library to display the chart in HTML.
Regarding part one, it depends in which technology is built your website.
For example, If it is in php, you will need to find a library in php, which can read and parse excel files.
Then you have to create a service in your website, where the data is going to be provided. For example,
www.yourcompany.com/provideChartData.php
You can format the response as json format.
Once you have solved that, you only have to call the service from your page, and the data will be dynamically displayed. You can call it using jquery library for javascript ($.post("www.yourcompany.com/provideChartData.php",function (data) { code to display chart ....}))
There is no real easy way to do this that I have found. I have had to manually parse these things in the past but there are some libraries out there for node that might help you.
https://www.npmjs.com/package/node-xlsx
You can also export form excel as CSV. When you do this, me sure to set the custom separator to something other than ',' and you should be fine to import it into a large array and get the data/charts you need.
https://github.com/wdavidw/node-csv
Hope that helps.

How to query a website that (I think) uses javascript in data table?

I use Excel to query web pages for data to analyze sports stats (NBA, MLB). Most of the queries have no problem bringing in the data from the web page, but NBA.com gives me a lot of issues. A specific example is http://stats.nba.com/league/player/#!/advanced/ - when I make a web query in Excel, none of the actual player data imports. My question is is there a way for me to dynamically get this data? Also, if the data actually did import, NBA.com splits the data into multiple pages that cannot be combined into one big table. Can I get all the data at once without copying and pasting from the 9+ different table "rows"?

Admin Panel like Custom CMS Framework

I want to Create a Framework , like Admin panel , which can rule almost all the aspects of what is shown on the frontend.
For an eg (Most basic one): If suppose the links which are to be shown in a Navigation area is passed from the server, with the order and the url , etc.
The whole aim is to save the time on the tedious tasks. U can just start creating Menus and start assigning pages to it. Give Url, Actual files which are to be rendered (in case of static files.), In case of dynamic files , giving the file accordingly.
And this is fully server side manageable using different portlets, sort of things.
So basic Roadmap is having :
Areas like :
Header Area - Which can contain logos , links etc.
Navigation Area - Which can contains links and submenus.
Content Area - Now this is where the tricky part is that that It has Zones like : Left , Center & Right. It contains Order in which it has to be displayed. So, when someday we want to change the way the articles appear on the page, we can do so easily, without any deployments. Now these Zones , can have n no of internal elements, like the word cloud , or the advertisment area.
Footer Area : Again similar as Header Area.
Currently there is a Framework already existing, this is using XSLT files for pulling out data from server sides.
For an eg: If there's a Grid it will be having a tag embedded in the XSLT file. Now whatever might be the source of the data , we serialize this as XML and give it to the XSLT file and the html is derived from this and is appended to the layer in a page.
The problem with this approach is :
The XSLT conversion is occurring on the server side, so the server is responsible for getting the data , running XSLT transform, and append the html generated to the layer div. So, according to me firstly this isn't the server's concern to do so. Secondly for larger applications this might be slower.
Debugging isn't possible for XSLT transformation. So, whenever we face problems with data its always a bit of "Trial & Error method".
Maintaining it is a bit of an eerie job i.e. styling changes, and other stuff.
Adding dynamic values. Like the JavaScript cant be actually very easily used in this. Secondly, we cant use JQuery or any other libraries with this since this is all occurring onto the server.
For now what i have thought about is using Templating - Javascript - JSON combination at the place of XSLT , this will be offloaded to the client and the rendering will take place accordingly.
What could be other ways to implement a Custom CMS ?
What could be the problem with JavaScript based approach ?
Are there any existing frameworks for similar usage ?

read information off website and store in excel file

I am trying to build this application that when provided a .txt file filled with isbn numbers will visit the isbn.nu page for that isbn number by simply appending the isbn to the url www.isbn.nu/your isbn number.
After pulling up the page, I want to scan it for information about the book, and store that in an excel file.
I was thinking about creating a file stream of the url in Java, but I am not really sure how to extract the information from the html page. Storing the information will be done using the JExcel Java package.
My best guess would be using javascript to extract the information, but I don't know how to call the javascript from my java program.
Is my idea plausible? if not, what do you guys suggest I do.
my goal: retrieve information from an html page and store it in an excel file for each ISBN in a text file. There can be any number of isbn's in a text file.
This isn't homework btw, I am simply doing this for an organization that donates books to Sudan. Currently they have 5 people cataloging these books manually and I am one of them.
Jsoup is a useful tool for parsing a web page and getting data from it. You can do it in Java and it's pretty easy.
You can parse the text file, build the URL with a string, send it in with JSoup then use JSoup to parse out the information using the html tags on the page. Then you can store it out however you want. You really don't need to use Javascript at all if you're more comfortable with Java.
Example for reading a page and parsing it with Jsoup:
Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");
Use a div in which you load your link (example here how to do that http://api.jquery.com/load/).
After that when load is complete you can check what is the name of the div's or spans used in the webpage and get that content with val (http://api.jquery.com/val/) or text (http://api.jquery.com/text/)
Here is text from the main page of www.isbn.nu:
Please note that isbn.nu is designed for manual searching by individuals. It is not intended as an information resource for automated retrieval, nor as a research tool for companies. isbn.nu reserves the right to deny access based on excessive requests.
Why not just use the free Google books API that would return book details in XML format. There are many classes available in Java to parse XML feeds and would make your life much easier.
See http://code.google.com/apis/books/ for more info.
Here are the steps needed:
Create CURL request (you can use multiple curl requests)
Get body data
Parse data
Make excel file
You can read HTML information using this guide.
A simple solution might be to use a Google Docs spreadsheet function like ImportXML(URL,path-expression).
More information and examples here:
http://www.seerinteractive.com/blog/importxml-cookbook/
http://www.distilled.net/blog/distilled/guide-to-google-docs-importxml/
http://blog.ouseful.info/2008/10/14/data-scraping-wikipedia-with-google-spreadsheets/

what language do i use to write a webpage in to automatically update from a database of sorts?

I have what I consider a bit of a tricky question. I am currently working on quite a large spread sheet (266 rows aith 70 coloumns and its only going to get bigger) that is a database of sorts and I want to remove it from Excel and put it on to an intranet page. I am currently writing it in a combination of HTML and Javascript for functionality, but it is becoming very hard to ensure that the data is in the right place. I am wondering if there is a possible way of being able to save the Excel spreadsheet into a certain format (like CSV or XML) and then write a program (for on a HTML page) that would display all of the infomation in a table automatically? is this even possible?
Unfortunatly i do not have access to a server to be able help with this, it all needs to be able to be coded in the page itself.
Thankyou for all your input Guys and Gals
Based on your comment, a normalized database for this type of thing would look like this:
table `workers`
- id
- name
- ...
table `trainings`
- id
- title
- description
- ...
table `workers_in_training`
- worker_id
- training_id
This allows you to create a logical matrix as well without the need to change the schema (keep adding columns) for each new training/worker. Of course, this realistically requires a database server of some sort and knowledge in a server side programming language (PHP, Python, Ruby, C#, anything). If you don't have that, an Access database/app may be an acceptable compromise. Doing it all in Javascript is certainly interesting, but is an idea you should abandon as early as possible.
Given your constraints, I would save the Excel spreadsheet as a CSV and put it in the same location as your HTML file, then use AJAX to fetch the contents of the CSV and dynamically generate a HTML table based on the contents.
Look here for how to fetch a URL's contents using AJAX (jQuery library): http://api.jquery.com/jQuery.get/
After fetching the URL content, you will have the CSV as a big string in a JavaScript variable. I'll let you have the fun of figuring out how to parse it :-)
Once you know how to parse your CSV string to recognise rows and columns, look here for how to generate HTML table dynamically using jQuery library: Building an HTML table on the fly using jQuery

Categories