For simpler cases, the regex could be tweaked to achieve the desired result.
But if you are trying to do anything much more complicated than that, I would use Node, fetch the content of the page, parse it as HTML, then do whatever you need to do:
import fetch from 'node-fetch';
import { JSDOM } from 'jsdom';
const url = 'https://hibbard.eu/about/';
async function fetchData() {
try {
const response = await fetch(url);
const text = await response.text();
const dom = new JSDOM(text);
const links = [...dom.window.document.querySelectorAll('a')]
.map(link => link.textContent.trim())
.filter(text => text);
console.log(links);
} catch (error) {
console.error('Error:', error);
}
}
fetchData();
Notice that if you were to run both of the above, the output is slighly different. The shell one-liner chokes on my site title:
Oh hm I wasn’t aware of this, apparently xmllint just doesn’t recognize these tags. SO suggests to redirect stderr to /dev/null then… but as the comment says that’s not nice indeed. :-/ I’ll admit I have only used xmllint for actual XML yet. ^^
I would not know about tools that hackers use. I do not do hacking.
You do not say what operating system you use. Microsoft has provided an official JavaScript engine (outside of a browser) for Windows for about as long as JavaScript has existed, no need to do any hacking.
Is any of the sample code you provided actually JavaScript? I do not know JavaScript really well but I cannot find anything that says your code is JavaScript.
Hi Smmuel
I didn’t mean “hack” in the modern day typical usage of term but as a parallel for “code trick”.
Anyway, in this case, I use CentOS operating system and the code example I gave is a pseudocode just to example how an abstract code that do what I try to do, may look like.
See the dictionaries. The Merriam-Webster dictionary defines hack (in part) as gain access to a computer illegally and similar manners. There are unrelated definitions that are off-topic. It is true that in the modern day people are beginning to use hack with positive connotations but you say you did not mean it in a modern day usage.
Please do not think my response is due to anything in the past. I did not look at who posted this until now.
That helps. That is the type of information that is best provided in a question initially.
I never worked with Node.JS but I would have gladly tried and getting the textContent of an element in a webpage from a CentOS Bash terminal could be a very nice tutorial to start with, but in my shared hosting where I actually need to do this there is no Node.JS installed according to the node -v command and I don’t have a root access to directly install it there so even if I use some “Node.JS application” Cpanel tool I don’t think I’ll be able to freely work with Node.JS as I could freely work with Perl or Python which are natually shipped with this environment.
I don’t know how best to isolate only the 10.2.2 part. How to match only numbers and dots from the entire match of grep -oP '<span class="version-number">'.*
It is very brittle though, e.g. it relies on finding the exact string <span class="version-number"> and would break if it found something like <span class="version-number current"> or <span class='version-number'>
FWIW, you don’t need admin rights to install Node on a server, just shell access with git.
If you can install your own packages, here is the same script as above in Python (untested). You can use the requests library for fetching web content and BeautifulSoup for parsing HTML.
import requests
from bs4 import BeautifulSoup
url = 'https://example.com'
def fetch_data():
try:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = [link.get_text().strip() for link in soup.find_all('a') if link.get_text().strip()]
print(links)
except Exception as error:
print('Error:', error)
fetch_data()
Hi there! Well, I’ve tried it once. Basically, there is a way to run JavaScript in a shell terminal to extract data from a webpage. You can use a tool like node-fetch along with Node.js to achieve this.