Learning F# — Web Scraping With F# Data
- fsharp
In my ongoing efforts to learn F# properly, I recently stumbled upon the F# Data library, which implements type providers and other useful tools for working with data in CSV, HTML, JSON and XML formats.
To try it out I hacked together a quick .fsx
script which uses the library’s HTML parser to scrape the top three F# posts from one of my favorite programming websites, dev.to. Please note that the site does have an API, which would be the preferred way of doing this, the following is for demonstration purposes only.
Simple Scraping
The first step is to reference the downloaded Fsharp.Data
assembly relative to the script’s directory and importing its contents with the open
keyword:
#r "fsharp-data/lib/net45/FSharp.Data.dll"
open FSharp.Data
Next we can use HtmlDocument.Load
to fetch and parse dev.to’s top F# posts page from which we extract the individual articles via the CssSelect
method and the appropriate CSS selector:
let doc = HtmlDocument.Load("https://dev.to/t/fsharp/top/infinity")
let posts = doc.CssSelect(".single-article")
We then iterate over the posts, extract the title, author and reaction count and take the top three posts (which are already sorted by number of reactions):
let top3 =
posts
|> Seq.map(fun post ->
post |> firstChildText ".content h3",
post |> firstChildText "h4 > a" |> cleanName,
post |> firstChildText ".reactions-count .engagement-count-number")
|> Seq.take 3
This uses two small helpers, firstChildText
which is a little convenience function to smooth over the fact that CssSelect
doesn’t yet support the :fist-child
pseudo-selector, and cleanName
which removes unnecessary characters around the author’s name:
let firstChildText selector (post : HtmlNode) =
post.CssSelect(selector).[0].DirectInnerText().Trim()
let cleanName (name : string) = name.Replace("・", "")
Lastly, we use a simple for
loop with pattern matching to output the results:
for (title, author, reactions) in top3 do
printf "\"%s\" (%s): %s\n" title author reactions
Running this on my Mac with .NET Core produces the following output (which hopefully will include this post in the future 😉):
$ fsharpi --exec devto.fsx
"F# for JS Devs" (Jason): 84
"Mere Functional Programming in F#" (Kasey Speakman): 65
"F# is Pretty Cool" (Ben Lovy): 51
The full script can be found in this gist.
Asynchronous scraping
After I finished the first version of this script, I thought that this could also be a good opportunity to finally explore F#’s async programming model. So I decided to write a second version which fetches the top posts for F#, Elm, and Haskell, and displays the top five across all three tags. The overall code is quite similar to the previous script, so I’ll skip over all the parts I reused. The full script is again available in a gist.
To generate a sequence of the top five posts, the following function is used:
let top5 =
["fsharp"; "elm"; "haskell"]
|> getPosts
|> Seq.collect(fun posts ->
posts
|> Seq.map(fun post ->
post |> firstChildText ".content h3",
post |> firstChildText "h4 > a" |> cleanName,
post |> firstChildText ".reactions-count .engagement-count-number"))
|> Seq.sortBy(fun (_, _, score) -> -(int score))
|> Seq.take 5
This invokes the getPosts
helper, which uses Async.Parallel
to execute three calls to fetchTagAsync
in parallel, as well as Async.RunSynchronously
to wait for all of them to finish:
let getPosts tags =
tags
|> List.map fetchTagAsync
|> Async.Parallel
|> Async.RunSynchronously
fetchTagAsync
is relatively simple: inside an async workflow it first constructs a URL for the given tag, then uses an asynchronous let!
binding to wait for the result of the computation, before returning the posts:
let fetchTagAsync tag =
async {
let url = sprintf "https://dev.to/t/%s/top/infinity" tag
let! doc = HtmlDocument.AsyncLoad(url)
return doc.CssSelect(".single-article")
}
When executing this it produces the following result:
$ fsharpi --exec devto.fsx
"What the heck is polymorphism?" (Jan van Brügge): 212
"Tour of an Open-Source Elm SPA" (Richard Feldman): 140
"How I (Finally) Built an App in Elm" (Ali Spittel): 110
"F# for JS Devs" (Jason): 84
"Elm 0.19 brings better collections" (Robin Heggelund Hansen): 78
Summary
Despite only having started seriously looking at F# around 2 months ago, I already feel quite productive in it. Its pragmatic support for functional and object programming (but not object-oriented, see Don Syme’s talk “F# Code I Love”), as well as the good standard library, make for a pleasant developer experience and the language really lends itself to solving problems in terms of data and transformations.