How to get html string representation for all types of Dom nodes?

How to get html string representation for all types of Dom nodes? - javascript

question
How to get html string representation for all types of Dom nodes?
(We know there are many node types,
from https://developer.mozilla.org/en-US/docs/Web/API/Node/nodeType (& https://developer.mozilla.org/en-US/docs/Web/API/TreeWalker ))
(I know how to get the html string representation for Node.TEXT_NODE & Node.ELEMENT_NODE,
but what about the other special the node types?)
question-example
For example,
I want to create a treeWalker, with NodeFilter.SHOW_ALL,
on elt_main -- which may contain any type of node,
and loop through all the nodes inside, and get the html string representation for each of the node.
(Again,) Please make sure there is no missing of any data.
ie: When the treeWalker is looped through, every single html string in that elt_main should be included in arr_html_EltMain
(if redundant/repeated info are included, its fine,
but if you have a way to remove those redundant info, its better.)
(for clarity of what I mean by "no missing of any data", please refer to Does `elt.outerHTML` really captures the whole html string representation accuratly? (a sub question that branches out))
#code::
const elt_main = $('')[0];
const treeWalker = document.createTreeWalker(elt_main, NodeFilter.SHOW_ALL);
const arr_html_EltMain = [];
let html_curr;
let node_curr;
while ((node_curr = treeWalker.nextNode())) {
const nt = node_curr.nodeType;
if (nt === Node.TEXT_NODE) {
html_curr = (/** #type {Text} */ (node_curr)).textContent;
arr_html_EltMain.push(html_curr);
} else if (nt === Node.ELEMENT_NODE) {
html_curr = (/** #type {Element} */ (node_curr)).outerHTML;
arr_html_EltMain.push(html_curr);
} else if (nt === Node.COMMENT_NODE) {
html_curr = undefined; // << what property should I refer to? // (same for below ones)
} else if (nt === Node.ATTRIBUTE_NODE) {
} else if (nt === Node.CDATA_SECTION_NODE) {
} else if (nt === Node.PROCESSING_INSTRUCTION_NODE) {
} else if (nt === Node.DOCUMENT_NODE) {
} else if (nt === Node.DOCUMENT_TYPE_NODE) {
} else if (nt === Node.DOCUMENT_FRAGMENT_NODE) {
} else {
throw new TypeError('No other nodeType');
}
}

Related

JavaScript getting innerHtml for all text nodes in nested elements [duplicate]

<div class="title">
I am text node
<a class="edit">Edit</a>
</div>
I wish to get the "I am text node", do not wish to remove the "edit" tag, and need a cross browser solution.

var text = $(".title").contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
This gets the contents of the selected element, and applies a filter function to it. The filter function returns only text nodes (i.e. those nodes with nodeType == Node.TEXT_NODE).

You can get the nodeValue of the first childNode using
$('.title')[0].childNodes[0].nodeValue
http://jsfiddle.net/TU4FB/

Another native JS solution that can be useful for "complex" or deeply nested elements is to use NodeIterator. Put NodeFilter.SHOW_TEXT as the second argument ("whatToShow"), and iterate over just the text node children of the element.
var root = document.querySelector('p'),
iter = document.createNodeIterator(root, NodeFilter.SHOW_TEXT),
textnode;
// print all text nodes
while (textnode = iter.nextNode()) {
console.log(textnode.textContent)
}
<p>
<br>some text<br>123
</p>
You can also use TreeWalker. The difference between the two is that NodeIterator is a simple linear iterator, while TreeWalker allows you to navigate via siblings and ancestors as well.

ES6 version that return the first #text node content
const extract = (node) => {
const text = [...node.childNodes].find(child => child.nodeType === Node.TEXT_NODE);
return text && text.textContent.trim();
}

If you mean get the value of the first text node in the element, this code will work:
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
if (curNode.nodeName === "#text") {
firstText = curNode.nodeValue;
break;
}
}
You can see this in action here: http://jsfiddle.net/ZkjZJ/

Pure JavaScript: Minimalist
First off, always keep this in mind when looking for text in the DOM.
MDN - Whitespace in the DOM
This issue will make you pay attention to the structure of your XML / HTML.
In this pure JavaScript example, I account for the possibility of multiple text nodes that could be interleaved with other kinds of nodes. However, initially, I do not pass judgment on whitespace, leaving that filtering task to other code.
In this version, I pass a NodeList in from the calling / client code.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Pre-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length; // Because you may have many child nodes.
for (var i = 0; i < length; i++) {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
}
return null;
}
Of course, by testing node.hasChildNodes() first, there would be no need to use a pre-test for loop.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Post-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length,
i = 0;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
i++;
} while (i < length);
return null;
}
Pure JavaScript: Robust
Here the function getTextById() uses two helper functions: getStringsFromChildren() and filterWhitespaceLines().
getStringsFromChildren()
/**
* Collects strings from child text nodes.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #version 7.0
* #param parentNode An instance of the Node interface, such as an Element. object.
* #return Array of strings, or null.
* #throws TypeError if the parentNode is not a Node object.
*/
function getStringsFromChildren(parentNode)
{
var strings = [],
nodeList,
length,
i = 0;
if (!parentNode instanceof Node) {
throw new TypeError("The parentNode parameter expects an instance of a Node.");
}
if (!parentNode.hasChildNodes()) {
return null; // We are done. Node may resemble <element></element>
}
nodeList = parentNode.childNodes;
length = nodeList.length;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE)) {
strings.push(nodeList[i].nodeValue);
}
i++;
} while (i < length);
if (strings.length > 0) {
return strings;
}
return null;
}
filterWhitespaceLines()
/**
* Filters an array of strings to remove whitespace lines.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param textArray a String associated with the id attribute of an Element.
* #return Array of strings that are not lines of whitespace, or null.
* #throws TypeError if the textArray param is not of type Array.
*/
function filterWhitespaceLines(textArray)
{
var filteredArray = [],
whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
if (!textArray instanceof Array) {
throw new TypeError("The textArray parameter expects an instance of a Array.");
}
for (var i = 0; i < textArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
if (filteredArray.length > 0) {
return filteredArray ; // Leave selecting and joining strings for a specific implementation.
}
return null; // No text to return.
}
getTextById()
/**
* Gets strings from text nodes. Robust.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param id A String associated with the id property of an Element.
* #return Array of strings, or null.
* #throws TypeError if the id param is not of type String.
* #throws TypeError if the id param cannot be used to find a node by id.
*/
function getTextById(id)
{
var textArray = null; // The hopeful output.
var idDatatype = typeof id; // Only used in an TypeError message.
var node; // The parent node being examined.
try {
if (idDatatype !== "string") {
throw new TypeError("The id argument must be of type String! Got " + idDatatype);
}
node = document.getElementById(id);
if (node === null) {
throw new TypeError("No element found with the id: " + id);
}
textArray = getStringsFromChildren(node);
if (textArray === null) {
return null; // No text nodes found. Example: <element></element>
}
textArray = filterWhitespaceLines(textArray);
if (textArray.length > 0) {
return textArray; // Leave selecting and joining strings for a specific implementation.
}
} catch (e) {
console.log(e.message);
}
return null; // No text to return.
}
Next, the return value (Array, or null) is sent to the client code where it should be handled. Hopefully, the array should have string elements of real text, not lines of whitespace.
Empty strings ("") are not returned because you need a text node to properly indicate the presence of valid text. Returning ("") may give the false impression that a text node exists, leading someone to assume that they can alter the text by changing the value of .nodeValue. This is false, because a text node does not exist in the case of an empty string.
Example 1:
<p id="bio"></p> <!-- There is no text node here. Return null. -->
Example 2:
<p id="bio">
</p> <!-- There are at least two text nodes ("\n"), here. -->
The problem comes in when you want to make your HTML easy to read by spacing it out. Now, even though there is no human readable valid text, there are still text nodes with newline ("\n") characters in their .nodeValue properties.
Humans see examples one and two as functionally equivalent--empty elements waiting to be filled. The DOM is different than human reasoning. This is why the getStringsFromChildren() function must determine if text nodes exist and gather the .nodeValue values into an array.
for (var i = 0; i < length; i++) {
if (nodeList[i].nodeType === Node.TEXT_NODE) {
textNodes.push(nodeList[i].nodeValue);
}
}
In example two, two text nodes do exist and getStringFromChildren() will return the .nodeValue of both of them ("\n"). However, filterWhitespaceLines() uses a regular expression to filter out lines of pure whitespace characters.
Is returning null instead of newline ("\n") characters a form of lying to the client / calling code? In human terms, no. In DOM terms, yes. However, the issue here is getting text, not editing it. There is no human text to return to the calling code.
One can never know how many newline characters might appear in someone's HTML. Creating a counter that looks for the "second" newline character is unreliable. It might not exist.
Of course, further down the line, the issue of editing text in an empty <p></p> element with extra whitespace (example 2) might mean destroying (maybe, skipping) all but one text node between a paragraph's tags to ensure the element contains precisely what it is supposed to display.
Regardless, except for cases where you are doing something extraordinary, you will need a way to determine which text node's .nodeValue property has the true, human readable text that you want to edit. filterWhitespaceLines gets us half way there.
var whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
for (var i = 0; i < filteredTextArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredTextArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
At this point you may have output that looks like this:
["Dealing with text nodes is fun.", "Some people just use jQuery."]
There is no guarantee that these two strings are adjacent to each other in the DOM, so joining them with .join() might make an unnatural composite. Instead, in the code that calls getTextById(), you need to chose which string you want to work with.
Test the output.
try {
var strings = getTextById("bio");
if (strings === null) {
// Do something.
} else if (strings.length === 1) {
// Do something with strings[0]
} else { // Could be another else if
// Do something. It all depends on the context.
}
} catch (e) {
console.log(e.message);
}
One could add .trim() inside of getStringsFromChildren() to get rid of leading and trailing whitespace (or to turn a bunch of spaces into a zero length string (""), but how can you know a priori what every application may need to have happen to the text (string) once it is found? You don't, so leave that to a specific implementation, and let getStringsFromChildren() be generic.
There may be times when this level of specificity (the target and such) is not required. That is great. Use a simple solution in those cases. However, a generalized algorithm enables you to accommodate simple and complex situations.

.text() - for jquery
$('.title').clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text(); //get the text of element

This will ignore the whitespace as well so, your never got the Blank textNodes..code using core Javascript.
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
whitespace = /^\s*$/;
if (curNode.nodeName === "#text" && !(whitespace.test(curNode.nodeValue))) {
firstText = curNode.nodeValue;
break;
}
}
Check it on jsfiddle : - http://jsfiddle.net/webx/ZhLep/

Simply via Vanilla JavaScript:
const el = document.querySelector('.title');
const text = el.firstChild.textContent.trim();

You can also use XPath's text() node test to get the text nodes only. For example
var target = document.querySelector('div.title');
var iter = document.evaluate('text()', target, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE);
var node;
var want = '';
while (node = iter.iterateNext()) {
want += node.data;
}

There are some overcomplicated solutions here but the operation is as straightforward as using .childNodes to get children of all node types and .filter to extract e.nodeType === Node.TEXT_NODEs. Optionally, we may want to do it recursively and/or ignore "empty" text nodes (all whitespace).
These examples convert the nodes to their text content for display purposes, but this is technically a separate step from filtering.
const immediateTextNodes = el =>
[...el.childNodes].filter(e => e.nodeType === Node.TEXT_NODE);
const immediateNonEmptyTextNodes = el =>
[...el.childNodes].filter(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
const firstImmediateTextNode = el =>
[...el.childNodes].find(e => e.nodeType === Node.TEXT_NODE);
const firstImmediateNonEmptyTextNode = el =>
[...el.childNodes].find(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(immediateTextNodes(p).map(text));
console.log(immediateNonEmptyTextNodes(p).map(text));
console.log(text(firstImmediateTextNode(p)));
console.log(text(firstImmediateNonEmptyTextNode(p)));
// if you want to trim whitespace:
console.log(immediateNonEmptyTextNodes(p).map(e => text(e).trim()));
<p>
<span>IGNORE</span>
<b>IGNORE</b>
foo
<br>
bar
</p>
Recursive alternative to a NodeIterator:
const deepTextNodes = el => [...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE ? e : deepTextNodes(e)
);
const deepNonEmptyTextNodes = el =>
[...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
? e : deepNonEmptyTextNodes(e)
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(deepTextNodes(p).map(text));
console.log(deepNonEmptyTextNodes(p).map(text));
<p>
foo
<span>bar</span>
baz
<span><b>quux</b></span>
</p>
Finally, feel free to join the text node array into a string if you wish using .join(""). But as with trimming and text content extraction, I'd probably not bake this into the core filtering function and leave it to the caller to handle as needed.

Why Does IE11 Handle Node.normalize() Incorrectly for the Minus Symbol?

I have been experiencing an issue where DOM text nodes with certain characters behave strangely in IE when using the Node.normalize() function to concatenate adjacent text nodes.
I have created a Codepen example which allows you to reproduce the bug in IE11: http://codepen.io/anon/pen/BxoKH
Output in IE11: '- Example'
Output in Chrome and earlier versions of IE: 'Test - Example'
As you can see, this truncates everything prior to the minus symbol which is apparently treated as a delimiting character, apparently due to a bug in the native implementation of normalize() in Internet Explorer 11 (but not IE10, or IE8, or even IE6).
Can anyone explain why this happens, and does anyone know of other sequences of characters which cause this issue?
Edit - I have written a codepen that will test sections of Unicode characters to identify characters that cause this behavior. It appears to affect many more characters than I originally realized:
http://codepen.io/anon/pen/Bvgtb/
This tests Unicode characters from 32-1000 and prints those that fail the test (truncate data when nodes are normalized) You can modify it to test other ranges of characters, but be careful of increasing the range too much in IE or it will freeze.
I've created an IE bug report and Microsoft reports being able to reproduce it based on the code sample I provided. Vote on it if you're also experiencing this issue:
https://connect.microsoft.com/IE/feedback/details/832750/ie11-node-normalize-dom-implementation-truncates-data-when-adjacent-text-nodes-contain-a-minus-sign

The other answers here are somewhat verbose and incomplete — they do not walk the full DOM sub-tree. Here's a more comprehensive solution:
function normalize (node) {
if (!node) { return; }
if (node.nodeType == 3) {
while (node.nextSibling && node.nextSibling.nodeType == 3) {
node.nodeValue += node.nextSibling.nodeValue;
node.parentNode.removeChild(node.nextSibling);
}
} else {
normalize(node.firstChild);
}
normalize(node.nextSibling);
}

I created a workaround by simply reimplementing the normalize method in JS, but struggled with this for many hours, so I figured I'd make a SO post to help other folks out, and hopefully get more information to help satisfy my curiosity about this bug which wasted most of my day, haha.
Here is a codepen with my workaround which works in all browsers: http://codepen.io/anon/pen/ouFJa
My workaround was based on some useful normalize code I found here: https://stackoverflow.com/a/20440845/1504529 but has been tailored to this specific IE11 bug rather than the one discussed by that post:
Here's the workaround, which works in all browsers I've tested, including IE11
function isNormalizeBuggy(){
var testDiv = document.createElement('div');
testDiv.appendChild(document.createTextNode('0-'));
testDiv.appendChild(document.createTextNode('2'));
testDiv.normalize();
return testDiv.firstChild.length == 2;
}
function safeNormalize(DOMNode) {
// If the normalize function doesn't have the bug relating to minuses,
// we use the native normalize function. Otherwise we use our custom one.
if(!isNormalizeBuggy()){
el.normalize();
return;
}
function getNextNode(node, ancestor, isOpenTag) {
if (typeof isOpenTag === 'undefined') {
isOpenTag = true;
}
var next;
if (isOpenTag) {
next = node.firstChild;
}
next = next || node.nextSibling;
if (!next && node.parentNode && node.parentNode !== ancestor) {
return getNextNode(node.parentNode, ancestor, false);
}
return next;
}
var adjTextNodes = [], nodes, node = el;
while ((node = getNextNode(node, el))) {
if (node.nodeType === 3 && node.previousSibling && node.previousSibling.nodeType === 3) {
if (!nodes) {
nodes = [node.previousSibling];
}
nodes.push(node);
} else if (nodes) {
adjTextNodes.push(nodes);
nodes = null;
}
}
adjTextNodes.forEach(function (nodes) {
var first;
nodes.forEach(function (node, i) {
if (i > 0) {
first.nodeValue += node.nodeValue;
node.parentNode.removeChild(node);
} else {
first = node;
}
});
});
};

Not the exact answer, but helped in my case.
function safeNormalize(el) {
function recursiveNormalize(elem)
{
for (var i = 0; i < elem.childNodes.length; i++) {
if (elem.childNodes[i].nodeType != 3) {
recursiveNormalize(elem.childNodes[i]);
}
else {
if (elem.childNodes[i].nextSibling != null && elem.childNodes[i].nextSibling.nodeType == 3) {
elem.childNodes[i].nodeValue = elem.childNodes[i].nodeValue + elem.childNodes[i].nextSibling.nodeValue;
elem.removeChild(elem.childNodes[i].nextSibling);
i--;
}
}
}
}
recursiveNormalize(el);
}

The normalise code looks a little convoluted, the following is a bit simpler. It traverses the siblings of the node to be normalised, collecting the text nodes until it hits an element. Then it calls itself and collects that element's text nodes, and so on.
I think seprating the two functions makes for cleaner (and a lot less) code.
// textNode is a DOM text node
function collectTextNodes(textNode) {
// while there are text siblings, concatenate them into the first
while (textNode.nextSibling) {
var next = textNode.nextSibling;
if (next.nodeType == 3) {
textNode.nodeValue += next.nodeValue;
textNode.parentNode.removeChild(next);
// Stop if not a text node
} else {
return;
}
}
}
// element is a DOM element
function normalise(element) {
var node = element.firstChild;
// Traverse siblings, call normalise for elements and
// collectTextNodes for text nodes
while (node) {
if (node.nodeType == 1) {
normalise(node);
} else if (node.nodeType == 3) {
collectTextNodes(node);
}
node = node.nextSibling;
}
}

function mergeTextNode(elem) {
var node = elem.firstChild, text
while (node) {
var aaa = node.nextSibling
if (node.nodeType === 3) {
if (text) {
text.nodeValue += node.nodeValue
elem.removeChild(node)
} else {
text = node
}
} else {
text = null
}
node = aaa
}
}

Extract text from HTML while preserving block-level element newlines

Background
Most questions about extracting text from HTML (i.e., stripping the tags) use:
jQuery( htmlString ).text();
While this abstracts browser inconsistencies (such as innerText vs. textContent), the function call also ignores the semantic meaning of block-level elements (such as li).
Problem
Preserving newlines of block-level elements (i.e., the semantic intent) across various browsers entails no small effort, as Mike Wilcox describes.
A seemingly simpler solution would be to emulate pasting HTML content into a <textarea>, which strips HTML while preserving block-level element newlines. However, JavaScript-based inserts do not trigger the same HTML-to-text routines that browsers employ when users paste content into a <textarea>.
I also tried integrating Mike Wilcox's JavaScript code. The code works in Chromium, but not in Firefox.
Question
What is the simplest cross-browser way to extract text from HTML while preserving semantic newlines for block-level elements using jQuery (or vanilla JavaScript)?
Example
Consider:
Select and copy this entire question.
Open the textarea example page.
Paste the content into the textarea.
The textarea preserves the newlines for ordered lists, headings, preformatted text, and so forth. That is the result I would like to achieve.
To further clarify, given any HTML content, such as:
<h1>Header</h1>
<p>Paragraph</p>
<ul>
<li>First</li>
<li>Second</li>
</ul>
<dl>
<dt>Term</dt>
<dd>Definition</dd>
</dl>
<div>Div with <span>span</span>.<br />After the break.</div>
How would you produce:
Header
Paragraph
First
Second
Term
Definition
Div with span.
After the break.
Note: Neither indentation nor non-normalized whitespace are relevant.

Consider:
/**
* Returns the style for a node.
*
* #param n The node to check.
* #param p The property to retrieve (usually 'display').
* #link http://www.quirksmode.org/dom/getstyles.html
*/
this.getStyle = function( n, p ) {
return n.currentStyle ?
n.currentStyle[p] :
document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}
/**
* Converts HTML to text, preserving semantic newlines for block-level
* elements.
*
* #param node - The HTML node to perform text extraction.
*/
this.toText = function( node ) {
var result = '';
if( node.nodeType == document.TEXT_NODE ) {
// Replace repeated spaces, newlines, and tabs with a single space.
result = node.nodeValue.replace( /\s+/g, ' ' );
}
else {
for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
result += _this.toText( node.childNodes[i] );
}
var d = _this.getStyle( node, 'display' );
if( d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
node.tagName == 'BR' || node.tagName == 'HR' ) {
result += '\n';
}
}
return result;
}
http://jsfiddle.net/3mzrV/2/
That is to say, with an exception or two, iterate through each node and print its contents, letting the browser's computed style tell you when to insert newlines.

This seems to be (nearly) doing what you want:
function getText($node) {
return $node.contents().map(function () {
if (this.nodeName === 'BR') {
return '\n';
} else if (this.nodeType === 3) {
return this.nodeValue;
} else {
return getText($(this));
}
}).get().join('');
}
DEMO
It just recursively concatenates the values of all text nodes and replaces <br> elements with line breaks.
But there is no semantics in this, it completely relies the original HTML formatting (the leading and trailing white spaces seem to come from how jsFiddle embeds the HTML, but you can easily trim those). For example, notice how it indents the definition term.
If you really want to do this on a semantic level, you need a list of block level elements, recursively iterate over the elements and indent them accordingly. You treat different block elements differently with respect to indentation and line breaks around them. This should not be too difficult.

based on https://stackoverflow.com/a/20384452/3338098
and fixed to support TEXT1<div>TEXT2</div>=>TEXT1\nTEXT2 and allow non-DOM nodes
/**
* Returns the style for a node.
*
* #param n The node to check.
* #param p The property to retrieve (usually 'display').
* #link http://www.quirksmode.org/dom/getstyles.html
*/
function getNodeStyle( n, p ) {
return n.currentStyle ?
n.currentStyle[p] :
document.defaultView.getComputedStyle(n, null).getPropertyValue(p);
}
//IF THE NODE IS NOT ACTUALLY IN THE DOM then this won't take into account <div style="display: inline;">text</div>
//however for simple things like `contenteditable` this is sufficient, however for arbitrary html this will not work
function isNodeBlock(node) {
if (node.nodeType == document.TEXT_NODE) {return false;}
var d = getNodeStyle( node, 'display' );//this is irrelevant if the node isn't currently in the current DOM.
if (d.match( /^block/ ) || d.match( /list/ ) || d.match( /row/ ) ||
node.tagName == 'BR' || node.tagName == 'HR' ||
node.tagName == 'DIV' // div,p,... add as needed to support non-DOM nodes
) {
return true;
}
return false;
}
/**
* Converts HTML to text, preserving semantic newlines for block-level
* elements.
*
* #param node - The HTML node to perform text extraction.
*/
function htmlToText( htmlOrNode, isNode ) {
var node = htmlOrNode;
if (!isNode) {node = jQuery("<span>"+htmlOrNode+"</span>")[0];}
//TODO: inject "unsafe" HTML into current DOM while guaranteeing that it won't
// change the visible DOM so that `isNodeBlock` will work reliably
var result = '';
if( node.nodeType == document.TEXT_NODE ) {
// Replace repeated spaces, newlines, and tabs with a single space.
result = node.nodeValue.replace( /\s+/g, ' ' );
} else {
for( var i = 0, j = node.childNodes.length; i < j; i++ ) {
result += htmlToText( node.childNodes[i], true );
if (i < j-1) {
if (isNodeBlock(node.childNodes[i])) {
result += '\n';
} else if (isNodeBlock(node.childNodes[i+1]) &&
node.childNodes[i+1].tagName != 'BR' &&
node.childNodes[i+1].tagName != 'HR') {
result += '\n';
}
}
}
}
return result;
}
the main change was
if (i < j-1) {
if (isNodeBlock(node.childNodes[i])) {
result += '\n';
} else if (isNodeBlock(node.childNodes[i+1]) &&
node.childNodes[i+1].tagName != 'BR' &&
node.childNodes[i+1].tagName != 'HR') {
result += '\n';
}
}
to check neighboring blocks to determine the appropriateness of adding a newline.

I would like to suggest a little edit from the code of svidgen:
function getText(n, isInnerNode) {
var rv = '';
if (n.nodeType == 3) {
rv = n.nodeValue;
} else {
var partial = "";
var d = getComputedStyle(n).getPropertyValue('display');
if (isInnerNode && d.match(/^block/) || d.match(/list/) || n.tagName == 'BR') {
partial += "\n";
}
for (var i = 0; i < n.childNodes.length; i++) {
partial += getText(n.childNodes[i], true);
}
rv = partial;
}
return rv;
};
I just added the line break before the for loop, in this way we have a newline before the block, and also a variable to avoid the newline for the root element.
The code should be invocated:
getText(document.getElementById("divElement"))

Use element.innerText
This not return extra nodes added from contenteditable elements.
If you use element.innerHTML the text will contain additional markup, but innerText will return what you see on the element's contents.
<div id="txt" contenteditable="true"></div>
<script>
var txt=document.getElementById("txt");
var withMarkup=txt.innerHTML;
var textOnly=txt.innerText;
console.log(withMarkup);
console.log(textOnly);
</script>

How to get the text node of an element?

<div class="title">
I am text node
<a class="edit">Edit</a>
</div>
I wish to get the "I am text node", do not wish to remove the "edit" tag, and need a cross browser solution.

var text = $(".title").contents().filter(function() {
return this.nodeType == Node.TEXT_NODE;
}).text();
This gets the contents of the selected element, and applies a filter function to it. The filter function returns only text nodes (i.e. those nodes with nodeType == Node.TEXT_NODE).

You can get the nodeValue of the first childNode using
$('.title')[0].childNodes[0].nodeValue
http://jsfiddle.net/TU4FB/

Another native JS solution that can be useful for "complex" or deeply nested elements is to use NodeIterator. Put NodeFilter.SHOW_TEXT as the second argument ("whatToShow"), and iterate over just the text node children of the element.
var root = document.querySelector('p'),
iter = document.createNodeIterator(root, NodeFilter.SHOW_TEXT),
textnode;
// print all text nodes
while (textnode = iter.nextNode()) {
console.log(textnode.textContent)
}
<p>
<br>some text<br>123
</p>
You can also use TreeWalker. The difference between the two is that NodeIterator is a simple linear iterator, while TreeWalker allows you to navigate via siblings and ancestors as well.

ES6 version that return the first #text node content
const extract = (node) => {
const text = [...node.childNodes].find(child => child.nodeType === Node.TEXT_NODE);
return text && text.textContent.trim();
}

If you mean get the value of the first text node in the element, this code will work:
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
if (curNode.nodeName === "#text") {
firstText = curNode.nodeValue;
break;
}
}
You can see this in action here: http://jsfiddle.net/ZkjZJ/

Pure JavaScript: Minimalist
First off, always keep this in mind when looking for text in the DOM.
MDN - Whitespace in the DOM
This issue will make you pay attention to the structure of your XML / HTML.
In this pure JavaScript example, I account for the possibility of multiple text nodes that could be interleaved with other kinds of nodes. However, initially, I do not pass judgment on whitespace, leaving that filtering task to other code.
In this version, I pass a NodeList in from the calling / client code.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Pre-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length; // Because you may have many child nodes.
for (var i = 0; i < length; i++) {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
}
return null;
}
Of course, by testing node.hasChildNodes() first, there would be no need to use a pre-test for loop.
/**
* Gets strings from text nodes. Minimalist. Non-robust. Post-test loop version.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #param nodeList The child nodes of a Node, as in node.childNodes.
* #param target A positive whole number >= 1
* #return String The text you targeted.
*/
function getText(nodeList, target)
{
var trueTarget = target - 1,
length = nodeList.length,
i = 0;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE) && (i === trueTarget)) {
return nodeList[i].nodeValue; // Done! No need to keep going.
}
i++;
} while (i < length);
return null;
}
Pure JavaScript: Robust
Here the function getTextById() uses two helper functions: getStringsFromChildren() and filterWhitespaceLines().
getStringsFromChildren()
/**
* Collects strings from child text nodes.
* Generic, cross platform solution. No string filtering or conditioning.
*
* #author Anthony Rutledge
* #version 7.0
* #param parentNode An instance of the Node interface, such as an Element. object.
* #return Array of strings, or null.
* #throws TypeError if the parentNode is not a Node object.
*/
function getStringsFromChildren(parentNode)
{
var strings = [],
nodeList,
length,
i = 0;
if (!parentNode instanceof Node) {
throw new TypeError("The parentNode parameter expects an instance of a Node.");
}
if (!parentNode.hasChildNodes()) {
return null; // We are done. Node may resemble <element></element>
}
nodeList = parentNode.childNodes;
length = nodeList.length;
do {
if ((nodeList[i].nodeType === Node.TEXT_NODE)) {
strings.push(nodeList[i].nodeValue);
}
i++;
} while (i < length);
if (strings.length > 0) {
return strings;
}
return null;
}
filterWhitespaceLines()
/**
* Filters an array of strings to remove whitespace lines.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param textArray a String associated with the id attribute of an Element.
* #return Array of strings that are not lines of whitespace, or null.
* #throws TypeError if the textArray param is not of type Array.
*/
function filterWhitespaceLines(textArray)
{
var filteredArray = [],
whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
if (!textArray instanceof Array) {
throw new TypeError("The textArray parameter expects an instance of a Array.");
}
for (var i = 0; i < textArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
if (filteredArray.length > 0) {
return filteredArray ; // Leave selecting and joining strings for a specific implementation.
}
return null; // No text to return.
}
getTextById()
/**
* Gets strings from text nodes. Robust.
* Generic, cross platform solution.
*
* #author Anthony Rutledge
* #version 6.0
* #param id A String associated with the id property of an Element.
* #return Array of strings, or null.
* #throws TypeError if the id param is not of type String.
* #throws TypeError if the id param cannot be used to find a node by id.
*/
function getTextById(id)
{
var textArray = null; // The hopeful output.
var idDatatype = typeof id; // Only used in an TypeError message.
var node; // The parent node being examined.
try {
if (idDatatype !== "string") {
throw new TypeError("The id argument must be of type String! Got " + idDatatype);
}
node = document.getElementById(id);
if (node === null) {
throw new TypeError("No element found with the id: " + id);
}
textArray = getStringsFromChildren(node);
if (textArray === null) {
return null; // No text nodes found. Example: <element></element>
}
textArray = filterWhitespaceLines(textArray);
if (textArray.length > 0) {
return textArray; // Leave selecting and joining strings for a specific implementation.
}
} catch (e) {
console.log(e.message);
}
return null; // No text to return.
}
Next, the return value (Array, or null) is sent to the client code where it should be handled. Hopefully, the array should have string elements of real text, not lines of whitespace.
Empty strings ("") are not returned because you need a text node to properly indicate the presence of valid text. Returning ("") may give the false impression that a text node exists, leading someone to assume that they can alter the text by changing the value of .nodeValue. This is false, because a text node does not exist in the case of an empty string.
Example 1:
<p id="bio"></p> <!-- There is no text node here. Return null. -->
Example 2:
<p id="bio">
</p> <!-- There are at least two text nodes ("\n"), here. -->
The problem comes in when you want to make your HTML easy to read by spacing it out. Now, even though there is no human readable valid text, there are still text nodes with newline ("\n") characters in their .nodeValue properties.
Humans see examples one and two as functionally equivalent--empty elements waiting to be filled. The DOM is different than human reasoning. This is why the getStringsFromChildren() function must determine if text nodes exist and gather the .nodeValue values into an array.
for (var i = 0; i < length; i++) {
if (nodeList[i].nodeType === Node.TEXT_NODE) {
textNodes.push(nodeList[i].nodeValue);
}
}
In example two, two text nodes do exist and getStringFromChildren() will return the .nodeValue of both of them ("\n"). However, filterWhitespaceLines() uses a regular expression to filter out lines of pure whitespace characters.
Is returning null instead of newline ("\n") characters a form of lying to the client / calling code? In human terms, no. In DOM terms, yes. However, the issue here is getting text, not editing it. There is no human text to return to the calling code.
One can never know how many newline characters might appear in someone's HTML. Creating a counter that looks for the "second" newline character is unreliable. It might not exist.
Of course, further down the line, the issue of editing text in an empty <p></p> element with extra whitespace (example 2) might mean destroying (maybe, skipping) all but one text node between a paragraph's tags to ensure the element contains precisely what it is supposed to display.
Regardless, except for cases where you are doing something extraordinary, you will need a way to determine which text node's .nodeValue property has the true, human readable text that you want to edit. filterWhitespaceLines gets us half way there.
var whitespaceLine = /(?:^\s+$)/; // Non-capturing Regular Expression.
for (var i = 0; i < filteredTextArray.length; i++) {
if (!whitespaceLine.test(textArray[i])) { // If it is not a line of whitespace.
filteredTextArray.push(textArray[i].trim()); // Trimming here is fine.
}
}
At this point you may have output that looks like this:
["Dealing with text nodes is fun.", "Some people just use jQuery."]
There is no guarantee that these two strings are adjacent to each other in the DOM, so joining them with .join() might make an unnatural composite. Instead, in the code that calls getTextById(), you need to chose which string you want to work with.
Test the output.
try {
var strings = getTextById("bio");
if (strings === null) {
// Do something.
} else if (strings.length === 1) {
// Do something with strings[0]
} else { // Could be another else if
// Do something. It all depends on the context.
}
} catch (e) {
console.log(e.message);
}
One could add .trim() inside of getStringsFromChildren() to get rid of leading and trailing whitespace (or to turn a bunch of spaces into a zero length string (""), but how can you know a priori what every application may need to have happen to the text (string) once it is found? You don't, so leave that to a specific implementation, and let getStringsFromChildren() be generic.
There may be times when this level of specificity (the target and such) is not required. That is great. Use a simple solution in those cases. However, a generalized algorithm enables you to accommodate simple and complex situations.

.text() - for jquery
$('.title').clone() //clone the element
.children() //select all the children
.remove() //remove all the children
.end() //again go back to selected element
.text(); //get the text of element

This will ignore the whitespace as well so, your never got the Blank textNodes..code using core Javascript.
var oDiv = document.getElementById("MyDiv");
var firstText = "";
for (var i = 0; i < oDiv.childNodes.length; i++) {
var curNode = oDiv.childNodes[i];
whitespace = /^\s*$/;
if (curNode.nodeName === "#text" && !(whitespace.test(curNode.nodeValue))) {
firstText = curNode.nodeValue;
break;
}
}
Check it on jsfiddle : - http://jsfiddle.net/webx/ZhLep/

Simply via Vanilla JavaScript:
const el = document.querySelector('.title');
const text = el.firstChild.textContent.trim();

You can also use XPath's text() node test to get the text nodes only. For example
var target = document.querySelector('div.title');
var iter = document.evaluate('text()', target, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE);
var node;
var want = '';
while (node = iter.iterateNext()) {
want += node.data;
}

There are some overcomplicated solutions here but the operation is as straightforward as using .childNodes to get children of all node types and .filter to extract e.nodeType === Node.TEXT_NODEs. Optionally, we may want to do it recursively and/or ignore "empty" text nodes (all whitespace).
These examples convert the nodes to their text content for display purposes, but this is technically a separate step from filtering.
const immediateTextNodes = el =>
[...el.childNodes].filter(e => e.nodeType === Node.TEXT_NODE);
const immediateNonEmptyTextNodes = el =>
[...el.childNodes].filter(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
const firstImmediateTextNode = el =>
[...el.childNodes].find(e => e.nodeType === Node.TEXT_NODE);
const firstImmediateNonEmptyTextNode = el =>
[...el.childNodes].find(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(immediateTextNodes(p).map(text));
console.log(immediateNonEmptyTextNodes(p).map(text));
console.log(text(firstImmediateTextNode(p)));
console.log(text(firstImmediateNonEmptyTextNode(p)));
// if you want to trim whitespace:
console.log(immediateNonEmptyTextNodes(p).map(e => text(e).trim()));
<p>
<span>IGNORE</span>
<b>IGNORE</b>
foo
<br>
bar
</p>
Recursive alternative to a NodeIterator:
const deepTextNodes = el => [...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE ? e : deepTextNodes(e)
);
const deepNonEmptyTextNodes = el =>
[...el.childNodes].flatMap(e =>
e.nodeType === Node.TEXT_NODE && e.textContent.trim()
? e : deepNonEmptyTextNodes(e)
);
// example usage:
const text = el => el.textContent;
const p = document.querySelector("p");
console.log(deepTextNodes(p).map(text));
console.log(deepNonEmptyTextNodes(p).map(text));
<p>
foo
<span>bar</span>
baz
<span><b>quux</b></span>
</p>
Finally, feel free to join the text node array into a string if you wish using .join(""). But as with trimming and text content extraction, I'd probably not bake this into the core filtering function and leave it to the caller to handle as needed.

How to check if element has any children in Javascript?

Simple question, I have an element which I am grabbing via .getElementById (). How do I check if it has any children?

A couple of ways:
if (element.firstChild) {
// It has at least one
}
or the hasChildNodes() function:
if (element.hasChildNodes()) {
// It has at least one
}
or the length property of childNodes:
if (element.childNodes.length > 0) { // Or just `if (element.childNodes.length)`
// It has at least one
}
If you only want to know about child elements (as opposed to text nodes, attribute nodes, etc.) on all modern browsers (and IE8 — in fact, even IE6) you can do this: (thank you Florian!)
if (element.children.length > 0) { // Or just `if (element.children.length)`
// It has at least one element as a child
}
That relies on the children property, which wasn't defined in DOM1, DOM2, or DOM3, but which has near-universal support. (It works in IE6 and up and Chrome, Firefox, and Opera at least as far back as November 2012, when this was originally written.) If supporting older mobile devices, be sure to check for support.
If you don't need IE8 and earlier support, you can also do this:
if (element.firstElementChild) {
// It has at least one element as a child
}
That relies on firstElementChild. Like children, it wasn't defined in DOM1-3 either, but unlike children it wasn't added to IE until IE9. The same applies to childElementCount:
if (element.childElementCount !== 0) {
// It has at least one element as a child
}
If you want to stick to something defined in DOM1 (maybe you have to support really obscure browsers), you have to do more work:
var hasChildElements, child;
hasChildElements = false;
for (child = element.firstChild; child; child = child.nextSibling) {
if (child.nodeType == 1) { // 1 == Element
hasChildElements = true;
break;
}
}
All of that is part of DOM1, and nearly universally supported.
It would be easy to wrap this up in a function, e.g.:
function hasChildElement(elm) {
var child, rv;
if (elm.children) {
// Supports `children`
rv = elm.children.length !== 0;
} else {
// The hard way...
rv = false;
for (child = element.firstChild; !rv && child; child = child.nextSibling) {
if (child.nodeType == 1) { // 1 == Element
rv = true;
}
}
}
return rv;
}

As slashnick & bobince mention, hasChildNodes() will return true for whitespace (text nodes). However, I didn't want this behaviour, and this worked for me :)
element.getElementsByTagName('*').length > 0
Edit: for the same functionality, this is a better solution:
element.children.length > 0
children[] is a subset of childNodes[], containing elements only.
Compatibility

You could also do the following:
if (element.innerHTML.trim() !== '') {
// It has at least one
}
This uses the trim() method to treat empty elements which have only whitespaces (in which case hasChildNodes returns true) as being empty.
NB: The above method doesn't filter out comments. (so a comment would classify a a child)
To filter out comments as well, we could make use of the read-only Node.nodeType property where Node.COMMENT_NODE (A Comment node, such as <!-- … -->) has the constant value - 8
if (element.firstChild?.nodeType !== 8 && element.innerHTML.trim() !== '' {
// It has at least one
}
let divs = document.querySelectorAll('div');
for(element of divs) {
if (element.firstChild?.nodeType !== 8 && element.innerHTML.trim() !== '') {
console.log('has children')
} else { console.log('no children') }
}
<div><span>An element</span>
<div>some text</div>
<div> </div> <!-- whitespace -->
<div><!-- A comment --></div>
<div></div>

You can check if the element has child nodes element.hasChildNodes(). Beware in Mozilla this will return true if the is whitespace after the tag so you will need to verify the tag type.
https://developer.mozilla.org/En/DOM/Node.hasChildNodes

Try the childElementCount property:
if ( element.childElementCount !== 0 ){
alert('i have children');
} else {
alert('no kids here');
}

Late but document fragment could be a node:
function hasChild(el){
var child = el && el.firstChild;
while (child) {
if (child.nodeType === 1 || child.nodeType === 11) {
return true;
}
child = child.nextSibling;
}
return false;
}
// or
function hasChild(el){
for (var i = 0; el && el.childNodes[i]; i++) {
if (el.childNodes[i].nodeType === 1 || el.childNodes[i].nodeType === 11) {
return true;
}
}
return false;
}
See:
https://github.com/k-gun/so/blob/master/so.dom.js#L42
https://github.com/k-gun/so/blob/master/so.dom.js#L741

A reusable isEmpty( <selector> ) function.
You can also run it toward a collection of elements (see example)
const isEmpty = sel =>
![... document.querySelectorAll(sel)].some(el => el.innerHTML.trim() !== "");
console.log(
isEmpty("#one"), // false
isEmpty("#two"), // true
isEmpty(".foo"), // false
isEmpty(".bar") // true
);
<div id="one">
foo
</div>
<div id="two">
</div>
<div class="foo"></div>
<div class="foo"><p>foo</p></div>
<div class="foo"></div>
<div class="bar"></div>
<div class="bar"></div>
<div class="bar"></div>
returns true (and exits loop) as soon one element has any kind of content beside spaces or newlines.

<script type="text/javascript">
function uwtPBSTree_NodeChecked(treeId, nodeId, bChecked)
{
//debugger;
var selectedNode = igtree_getNodeById(nodeId);
var ParentNodes = selectedNode.getChildNodes();
var length = ParentNodes.length;
if (bChecked)
{
/* if (length != 0) {
for (i = 0; i < length; i++) {
ParentNodes[i].setChecked(true);
}
}*/
}
else
{
if (length != 0)
{
for (i = 0; i < length; i++)
{
ParentNodes[i].setChecked(false);
}
}
}
}
</script>
<ignav:UltraWebTree ID="uwtPBSTree" runat="server"..........>
<ClientSideEvents NodeChecked="uwtPBSTree_NodeChecked"></ClientSideEvents>
</ignav:UltraWebTree>

We Keep Coding

JavaScript is the programming language of the Web.

How to get html string representation for all types of Dom nodes? - javascript

Related

JavaScript getting innerHtml for all text nodes in nested elements [duplicate]

Why Does IE11 Handle Node.normalize() Incorrectly for the Minus Symbol?

Extract text from HTML while preserving block-level element newlines

How to get the text node of an element?

How to check if element has any children in Javascript?

Categories

Resources