Can I use Node.js packages, such as request, to scrape the AngularJS response on this WA government website? - javascript

I'm trying to go to the WA Secretary of State Corporations website (https://ccfs.sos.wa.gov/#/AdvancedSearch) to scrape data on newly incorporated companies. All of this data is publicly available.
I filter the data by setting Business Type to WA PROFIT CORPORATION (towards bottom), Business Status to ACTIVE, and any random 30 day window for Start Date and End Date for the Date of Incorporation date range. I then click Search.
The first thing I notice is there is no query string, so the DB isn't accessible via a query string. So, I opened up Chome Dev Tools and went to the Network tab. If you refresh the page you'll notice that there's an AngularJS XHR file that loads in under the Name GetAdvanceBusinessSearchList.
If I Preview this file, all of the data I need is neatly structured in JSON format. If I try opening the file in another tab to see the query string I receive an error "The requested resource does not support http method 'GET'".
I've tried accessing the data using the Node Request module. I've tried both GET requests and POST requests. I assumed POST was the correct route once I received the GET error mentioned above. When fired off my POST request I also included some Form Data that I found in the Dev Tools, but the response I received was that it didn't support multipart/form-data.
I've also tried using the Puppeteer module, and I can get to the search results, but then because the content is loaded in using an Angular file, none of the HTML elements have IDs and it becomes a sloppy mess trying to mine all of the data.
const request = require('request');
request.get('https://cfda.sos.wa.gov/api/BusinessSearch/GetAdvanceBusinessSearchList', (err, res, body) => { console.log(body) });
My goal is to get access to the JSON structured data that can be found by previewing the GetAdvanceBusinessSearchList file in the Network tab of the Chrome Dev Tools once you've submitted a search.
Any help would be hugely appreciated.

This worked for me:
curl 'https://cfda.sos.wa.gov/api/BusinessSearch/GetAdvanceBusinessSearchList' -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:68.0) Gecko/20100101 Firefox/68.0' -H 'Accept: application/json, text/plain, */*' -H 'Accept-Language: en-US,en;q=0.8,es-AR;q=0.5,es;q=0.3' --compressed -H 'Referer: https://ccfs.sos.wa.gov/' -H 'Content-Type: application/x-www-form-urlencoded; charset=utf-8' -H 'Origin: https://ccfs.sos.wa.gov' -H 'Connection: keep-alive' --data 'Type=Agent&BusinessStatusID=0&SearchEntityName=&SearchType=&BusinessTypeID=0&AgentName=&PrincipalName=&StartDateOfIncorporation=&EndDateOfIncorporation=&ExpirationDate=&IsSearch=true&IsShowAdvanceSearch=true&&&AgentAddress%5BIsAddressSame%5D=false&AgentAddress%5BIsValidAddress%5D=false&AgentAddress%5BisUserNonCommercialRegisteredAgent%5D=false&AgentAddress%5BIsInvalidState%5D=false&AgentAddress%5BbaseEntity%5D%5BFilerID%5D=0&AgentAddress%5BbaseEntity%5D%5BUserID%5D=0&AgentAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&AgentAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&AgentAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&AgentAddress%5BID%5D=0&&&&AgentAddress%5BState%5D=WA&&AgentAddress%5BCountry%5D=USA&&&&&&&&PrincipalAddress%5BIsAddressSame%5D=false&PrincipalAddress%5BIsValidAddress%5D=false&PrincipalAddress%5BisUserNonCommercialRegisteredAgent%5D=false&PrincipalAddress%5BIsInvalidState%5D=false&PrincipalAddress%5BbaseEntity%5D%5BFilerID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BUserID%5D=0&PrincipalAddress%5BbaseEntity%5D%5BCreatedBy%5D=0&&PrincipalAddress%5BbaseEntity%5D%5BModifiedBy%5D=0&&PrincipalAddress%5BFullAddress%5D=%2C%20WA%2C%20USA&PrincipalAddress%5BID%5D=0&&&&PrincipalAddress%5BState%5D=&&PrincipalAddress%5BCountry%5D=USA&&&&&&PageID=1&PageCount=25'

Related

Azure Translation API - Throttling client requests

I'm trying to throttle the number of requests a client can make to my translator service which uses Azure Translation API.
The following link from Microsoft describes how to limit requests, but it's not clear where in the request this throttling information should be added. I assume the request headers?
https://learn.microsoft.com/en-us/azure/api-management/api-management-sample-flexible-throttling
Here is the curl. Note the rate limiting headers at the end. Is this the way to do it?
// Pass secret key and region using headers to a custom endpoint
curl -X POST " my-ch-n.cognitiveservices.azure.com/translator/text/v3.0/translate?to=fr" \
-H "Ocp-Apim-Subscription-Key: xxx" \
-H "Ocp-Apim-Subscription-Region: switzerlandnorth" \
-H "Content-Type: application/json" \
-H "rate-limit-by-key: calls=10 renewal-period=60 counter-key=1.1.1.1" \
-d "[{'Text':'Hello'}]" -v
The link you've shared is from API Management, a managed API Gateway available on Azure. The idea is to generate "products" and let your users to subscribe to them. This way, you'll be able to track the requests and perform the throttle using a rate limit policy (the link you've shared).
if needed, please watch this quick video showing this functionality in use:
https://www.youtube.com/watch?v=dbF7uVkGOw0

Javascript: How to Upload a large file over HTTP using Transfer encoding header

We have a third party api for uploading a file which requires Transfer-Encoding header to be set to chunked but this header gets ignored from the header if I set it manually using xhr.setRequestHeader . After investigating more on this we found that user agent is responsible for setting this header but seems user agent is only setting Content-Length header.
Also if we upload a file using following curl command then it works fine.
curl -X POST -H 'Transfer-Encoding: chunked' -H 'content-type: text/csv' -H 'filename: us-500.csv' -T './Downloads/us-500.csv' http://serverapi:8090/upload
Can someone please help to understand that is there any other way of uploading a large file using Transfer-encoding header.
you're not allowed to set that header as it's controlled by the user agent.
For the full set of headers, see 4.6.2 The setRequestHeader() method from W3C XMLHttpRequest Level 1 and note that Transfer-Encoding is one of the headers that are controlled by the user agent to let it control those aspects of transport.
Accept-Charset
Accept-Encoding
Access-Control-Request-Headers
Access-Control-Request-Method
Connection
Content-Length
Cookie
Cookie2
Date
DNT
Expect
Host
Keep-Alive
Origin
Referer
TE
Trailer
Transfer-Encoding
Upgrade
User-Agent
Via
There is a similar list in the WhatWG Fetch API Living Standard. https://fetch.spec.whatwg.org/#terminology-headers

Converting cURL to angular 2 HTTP POST

I want to convert this cURL to angular 2 post request
curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -H "Authorization: Basic cGJob2xlOmlJelVNR3o4" -H "Origin: http://localhost:4200/form" -H "Postman-Token: fbf7ede1-4648-a330-14ee-85e6c29ee80d" -d 'content=Queue: tsi-testdesk' "https://testdesk.ebi.ac.uk/REST/1.0/ticket/new?user=USER&pass=PASS"
here is the code i wrote but its not working.
addForm(form: Form): Observable<Form> {
console.log(" SUBMITTING FORM");
let headers = new Headers();
this.loginService.writeAuthToHeaders(headers);
// JSON.stringify(headers);
// headers.append('Accept', 'application/x-www-form-urlencoded');
headers.append('Content-Type', 'application/x-www-form-urlencoded');
// let text = JSON.stringify(form)
let content = ('content:Queue: tsi-testdesk');
console.log(content);
return this.http.post('https://testdesk.ebi.ac.uk/REST/1.0/ticket/new?user='+this.credentialsService.getUsername()+'&pass='+this.credentialsService.getPassword(), content, { headers: headers })
// .map(response => <Form>response.json())
.catch(this.handleError);
}
It is giving me pre-flight response fail error but it works fine with cURL as well as POSTMAN and also I Dont have access to server side I am contacting it through API
CORS is a policy that is enforced by the web browser. Ultimately, it is up to the browser, whether or not it will allow a cross-origin request. In the case of cURL or Postman, there is no browser, there is no current HOST, so there is not even the concept of a cross-origin request. Technically Postman is a Chrome extension, but it is not at all the same thing as loading a web page and making cross-origin requests.
Public-facing API's (probably like the one you are trying to access) already have CORS enabled. The likely culprit is your own server. You must enable CORS requests on your web server so it will allow you to make requests to outside APIs.

Curl to Javascript

I am making a Chrome Extension that talks to a website via an api. I want it to pass information about a current tab to my website via a cors request.
I have a POST api request already working. It looks like this:
...
var url = "https://webiste.com/api/v1/users/sendInfo"
...
xhr.send(JSON.stringify({user_name:user_name, password:password, info:info}));
Its corresponding curl statement is something like this:
curl -X POST https://website.com/api/v1/users/sendInfo -d '{ username:"username", password:"password", info: "Lot's of info" }' --header "Content-type: application/json
But, this is not as secure as we want. I was told to mirror the curl command below:
curl --basic -u username:password <request url> -d '{ "info": "Lot's of info" }'
But, one cannot just write curl into javascript.
If someone could either supply javascript that acts like this curl statement or explain exactly what is going on in that basic option of the curl script I think that I could progress from there.
The curl command is setting a basic Authorization header. This can be done in JavaScript like
var url = "https://webiste.com/api/v1/users/sendInfo",
username = "...",
password = "...";
xhr.open('POST', url, true, username, password);
xhr.send(...);
This encodes the username/password using base 64, and sets the Authorization header.
Edit As arcyqwerty mentioned, this is no more secure than sending username/password in the request body JSON. The advantage of using the basic authentication approach is that it's a standard way of specifying user credentials which integrates well with many back-ends. If you need security, make sure to send your data over HTTPS.
curl is the curl binary which fetches URLs.
--basic tells curl to use "HTTP Basic Authentication"
-u username:password tells curl supply a given username/password for the authentication. This authentication information is base64 encoded in the request. Note the emphasis on encoded which is different from encrypted. HTTP basic auth is not secure (although it can be made more secure by using an HTTPS channel)
-d tells curl to send the following as the data for the request
You may be able to specify HTTP basic authentication in your request by making the request to https://username:password#website.com/api/v1/users/sendInfo

Post to AmazonAWS Kinesis Stream Using Query String Authentication

I've got the interesting (and theoretically impossible) task of getting AmazonAWS Kinesis analytics from IE 8 and 9. According to Amazon's own SDK, this is not possible since XDomainRequest does not allow custom headers. Contrary to this statement, however, AmazonAWS allows you to authenticate using query string parameters. My goal was to write a shim for XMLHttpRequest which utilized the XDomainRequest object and converted all Amazon headers into query string parameters.
The actual implementation turned out to be much more difficult than I would have liked. Since Amazon's query string authentication only uses the "host" for SignedHeaders (whereas the AmazonAWS SDK was attempting to use host, date, and target) I had to re-compute the signature. This meant CryptoJS and lots of experimentation to get everything working.
After 4 hours of receiving "Computed signature did not match", I finally started getting a different error code: Unable to determine service/operation name to be authorized
Googling this error was not very helpful: anything from a typo to an extra new-line character to using a datestamp instead of a version number. However I tried everything and nothing helped.
Below is an example cURL request and the return value:
curl -H "Content-Type:text/plain" --data "{\"Data\":\"VALID BASE64 DATA\",\"PartitionKey\":\"PARTITION\",\"StreamName\":\"STREAM\"}" "https://kinesis.us-east-1.amazonaws.com/?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAJMAGAYBGGRNZQI4A/20140723/us-east-1/kinesis/aws4_request&X-Amz-Date=20140723T153144Z&X-Amz-SignedHeaders=host&X-Amz-Target=Kinesis_20131202.PutRecord&X-Amz-User-Agent=aws-sdk-js/2.0.0&X-Amz-Signature=VALID_SIGNATURE"
Return:
<AccessDeniedException>
<Message>Unable to determine service/operation name to be authorized</Message>
</AccessDeniedException>
I've tried appending Action and Version parameters (noting that the Version should be in YYYY-MM-DD format as opposed to YYYYMMDD) and this didn't help. I also tried escaping all of my / characters or escaping all of my . characters (or both).
For comparison, here's the same request through Google Chrome using headers instead of a query string:
Remote Address:176.32.102.203:443
Request URL:https://kinesis.us-east-1.amazonaws.com/
Request Method:POST
Status Code:200 OK
Request Headers
Accept:* / *
Accept-Encoding:gzip,deflate,sdch
Accept-Language:en-US,en;q=0.8
Authorization:AWS4-HMAC-SHA256 Credential=AKIAJMAGAYBGGRNZQI4A/20140723/us-east-1/kinesis/aws4_request, SignedHeaders=host;x-amz-date;x-amz-target, Signature=OMITTED
Cache-Control:no-cache
Connection:keep-alive
Content-Length:3236
Content-Type:application/x-amz-json-1.1
Host:kinesis.us-east-1.amazonaws.com
Origin:OMITTED
Pragma:no-cache
Referer:OMITTED
User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36
X-Amz-Date:20140723T145554Z
X-Amz-Target:Kinesis_20131202.PutRecord
X-Amz-User-Agent:aws-sdk-js/2.0.0
Request Payload
OMITTED (because it's long)
Response:
{"SequenceNumber":"49540780386103606919741841581837328106424971136629473281","ShardId":"shardId-000000000000"}
Does anyone know what I'm doing wrong, and why I can't communicate with Kinesis?
Going through and cleaning up some old questions that never got answers.
Per Michael - sqlbot:
Plan B is to send your request to your application server and proxy it to kinesis
This turned out to be the only solution for communicating with Kinesis. Set up a proxy which allowed me to pass custom headers as a query string, then it recombobulated it and sent the request onward

Categories