Fetching an RSS Feed (Without a Library)

Apr 9, 2020
Apr 10, 2020

Here’s a feeble attempt to stay up to date with the news without actually reading it (that’s what people do, right?). I wanted to write something that grabbed headlines from different sources to display overtop of another project I was working on. A little research turned up RSS (Really Simple Syndication) feeds in all their 2000’s glory. I dug around some more and looked for ways to encorporate them. This was actually a little more difficult than I’d originally thought. Google Feed API was deprecated in 2015. Yahoo! Query Language, another tool used to retrieve data through a single web interface, was retired last year. How easy would this have been…

select title, link from rss where url = 'https://www.news-site.com/rss.xml'

Eventually I discovered two built-in javascript APIs for retrieving data for this use case (are there more?): one older and one newer.

XMLHttpRequest and Fetch both allow you to issue HTTP requests for exchanging data over the web.

Starting out I was able to use fetch to request the xml from the nytimes with little hacking to get it to work. Fetch uses a ‘promise’ and ‘response’ system. It takes the path of the desired resource and returns a ‘promise’ that resolves to a ‘response’ stream once it is available. The ‘response’ stream returns whether the fetch fails or not, so it’s necessary to error check the response.

fetch('/path/to/resource')
  .then((response) => {
    //fetch doesn't reject on http errors, so check that it's okay
    if(response.ok){
      //do stuff with the response
    }
  })

Once the fetch returns and everything is good to go, there are multiple ways to deal with the data. I return it using the body.text() method, which returns string data. That string can then be parsed using the DOMParser interface, turning the raw XML into more managable pieces, which can in turn be looped through while querying our relevant information. For my purposes I only wanted the titles and the links to the articles. Easy enough.

const DOMPARSER = new DOMParser();
response.text().then((xmlTEXT) =>{
  try{
    let doc = DOMPARSER.parseFromString(xmlTEXT, "text/xml");
    doc.querySelectorAll('item').forEach((item) => {
      var title = item.querySelector('title').textContent;
      var link = item.querySelector('link').textContent;
      myHeadlineArray.push({"title": title, "link": link});
    })
  }catch(e){ console.error("Error in parsing feed."); }
})

DOMPARSER takes the name of the DOMString containing the information to be parsed as well as the return type of the method. Then each item returned by the parser captures the title and link values using querySelector(), which returns the first instance of an descendent element found in the DOM document. The final step is to push those newly returned headlines into an array that is eventually passed back to be randomized and displayed!

The Fetch() approach worked perfectly for the nytimes’ xml, but when I went to query other RSS feeds from other news sources, I ran into CORS errors. Basically, any HTTP request for data must request that data from the same origin (such as the origin’s server). So, mydomain.com cannot request a resource from yourdomain.com unless you have given explicit permission to mydomain.com, or set the appropriate CORS header to give (public) permission for access. If the pre-flight request for data is blocked, then the client’s browser comes back and throws the CORS error. I assume that the nytimes has enabled CORS for all requests, and I set out to find a different solution for other news outlets (you don’t get your news from just one source, do you?).

At this point I found that I could take one of multiple routes. There’s a browser plugin that allows you to send cross-domain requests. This might be fine in a development context, but can’t be expected to work for everyone that visits the site without that plugin installed. I know that another way to do this would be to make periodic requests to the RSS locations from my server, cache the results, then query the backend instead of the news sites directly. However, since I’m making a proof-of-concept and hosting on Github pages-which allows static hosting for free with no backend-I needed to find a different option.

There are proxies that can be used for this exact function. They act as a kind of middleware between the client browser and the desired resource. I discovered rss2json and an open source CORS proxy project called CORS Anywhere. I’m sure there are many more. Anyway, rss2json is fine for this project, so I plugged the api url into my project and it worked! There are some limitations for rss2json (pay for more frequent updates, more requests, and more feeds), but I don’t anticipate that using the free tier will be an issue for this blog.

const apiURL = 'https://api.rss2json.com/v1/api.json?rss_url=';
function getNews(url){
  var encodedURL = apiURL + encodeURIComponent(url);
  fetch(encodedURL){
    //process fetch as usual
  }
}

The rss2json proxy requires that the site url be encoded hence the encodeURIComponent call. A limitation of this approach is that rss2json obviously only returns json data, whereas fetching the nytimes (which didn’t require a proxy) returns the whole XML DOM tree. This makes processing the return values seperate and therefore annoying:

if(siteURL.indexOf('nytimes') === -1){ //This is not the new york times.');
  response.json().then((data) => {
    //process this data as a json
  })
}else{
  response.text().then((data) => {
    //process this data in xml text form
  })

Sloppy but it works.