Learning F# — Web Scraping With F# Data

- fsharp

In my ongoing efforts to learn F# properly, I recently stumbled upon the F# Data library, which implements type providers and other useful tools for working with data in CSV, HTML, JSON and XML formats.

To try it out I hacked together a quick .fsx script which uses the library’s HTML parser to scrape the top three F# posts from one of my favorite programming websites, dev.to. Please note that the site does have an API, which would be the preferred way of doing this, the following is for demonstration purposes only.

Simple Scraping

The first step is to reference the downloaded Fsharp.Data assembly relative to the script’s directory and importing its contents with the open keyword:

#r "fsharp-data/lib/net45/FSharp.Data.dll"

open FSharp.Data

Next we can use HtmlDocument.Load to fetch and parse dev.to’s top F# posts page from which we extract the individual articles via the CssSelect method and the appropriate CSS selector:

let doc = HtmlDocument.Load("https://dev.to/t/fsharp/top/infinity")
let posts = doc.CssSelect(".single-article")

We then iterate over the posts, extract the title, author and reaction count and take the top three posts (which are already sorted by number of reactions):

let top3 =
    posts
    |> Seq.map(fun post ->
        post |> firstChildText ".content h3",
        post |> firstChildText "h4 > a" |> cleanName,
        post |> firstChildText ".reactions-count .engagement-count-number")
    |> Seq.take 3

This uses two small helpers, firstChildText which is a little convenience function to smooth over the fact that CssSelect doesn’t yet support the :fist-child pseudo-selector, and cleanName which removes unnecessary characters around the author’s name:

let firstChildText selector (post : HtmlNode) =
    post.CssSelect(selector).[0].DirectInnerText().Trim()

let cleanName (name : string) = name.Replace("・", "")

Lastly, we use a simple for loop with pattern matching to output the results:

for (title, author, reactions) in top3 do
    printf "\"%s\" (%s): %s\n" title author reactions

Running this on my Mac with .NET Core produces the following output (which hopefully will include this post in the future 😉):

$ fsharpi --exec devto.fsx
"F# for JS Devs" (Jason): 84
"Mere Functional Programming in F#" (Kasey Speakman): 65
"F# is Pretty Cool" (Ben Lovy): 51

The full script can be found in this gist.

Asynchronous scraping

After I finished the first version of this script, I thought that this could also be a good opportunity to finally explore F#’s async programming model. So I decided to write a second version which fetches the top posts for F#, Elm, and Haskell, and displays the top five across all three tags. The overall code is quite similar to the previous script, so I’ll skip over all the parts I reused. The full script is again available in a gist.

To generate a sequence of the top five posts, the following function is used:

let top5 =
    ["fsharp"; "elm"; "haskell"]
    |> getPosts
    |> Seq.collect(fun posts ->
        posts
        |> Seq.map(fun post ->
            post |> firstChildText ".content h3",
            post |> firstChildText "h4 > a" |> cleanName,
            post |> firstChildText ".reactions-count .engagement-count-number"))
    |> Seq.sortBy(fun (_, _, score) -> -(int score))
    |> Seq.take 5

This invokes the getPosts helper, which uses Async.Parallel to execute three calls to fetchTagAsync in parallel, as well as Async.RunSynchronously to wait for all of them to finish:

let getPosts tags =
    tags
    |> List.map fetchTagAsync
    |> Async.Parallel
    |> Async.RunSynchronously

fetchTagAsync is relatively simple: inside an async workflow it first constructs a URL for the given tag, then uses an asynchronous let! binding to wait for the result of the computation, before returning the posts:

let fetchTagAsync tag =
    async {
        let url = sprintf "https://dev.to/t/%s/top/infinity" tag
        let! doc = HtmlDocument.AsyncLoad(url)
        return doc.CssSelect(".single-article")
    }

When executing this it produces the following result:

$ fsharpi --exec devto.fsx
"What the heck is polymorphism?" (Jan van Brügge): 212
"Tour of an Open-Source Elm SPA" (Richard Feldman): 140
"How I (Finally) Built an App in Elm" (Ali Spittel): 110
"F# for JS Devs" (Jason): 84
"Elm 0.19 brings better collections" (Robin Heggelund Hansen): 78

Summary

Despite only having started seriously looking at F# around 2 months ago, I already feel quite productive in it. Its pragmatic support for functional and object programming (but not object-oriented, see Don Syme’s talk “F# Code I Love”), as well as the good standard library, make for a pleasant developer experience and the language really lends itself to solving problems in terms of data and transformations.

Comments powered by Talkyard.