Emails in OCaml | MirageOS

The security of communications poses a seemingly never-ending challenge across Cyberspace. From sorting through mountains of spam to protecting our private messages from malicious hackers, cybersecurity has never been more important than it is today. It takes considerable technical skills and dependable infrastructure to run an email service, and sadly, most companies with the ability to handle the billions of emails sent daily make money off mining your sensitive data.

Five years ago, we started to explore an incredible endeavour on how to securely send and receive email. It was my final year in an internship at Cambridge, and the goal was to develop an OCaml library that could parse and craft emails. Thus, Mr. MIME was born. I even gave a presentation on it at ICFP 2016 and introduced Mr. MIME in a previous post. Mr. MIME was also selected by the [NGI DAPSI initiative]((https://tarides.com/blog/2022-03-08-secure-virtual-messages-in-a-bottle-with-scop) last year.

I'm thrilled to shine a spotlight on Mr. MIME as part of the MirageOS 4 release! It was essential to create several small libraries when building and testing Mr. MIME. I've included some samples of how to use Mr. MIME to parse and serialise emails in OCaml, as well as receiving and sending SMTP messages. I then explain how to use all of this via CLI tools. Since unikernels were the foundation on which I built Mr. MIME, the final section explains how to deploy unikernels to handle email traffic.

A Tour of the Many Email Libraries

The following libraries were created to support Mr. MIME:

pecu as the quoted-printable serialiser/deserialiser. First, if we strictly consider standards, email transmission can use a 7-bit channel, so we made different encodings in order to safely transmit 8-bit contents via such channels. quoted-printable is one of them, where any non-ASCII characters are encoded.

Another encoding is the famous UTF-7 (the one from RFC2152, not the one from RFC2060.5.1.3), which is available in the yuscii library. Please note, Yukoslavian engineers created YUSCII encoding to replace the imperial ASCII one.
rosetta is a little library that normalises some inputs such as KOI8-{U,R} or ISO-8859-* to Unicode. This ability permits mrmime to produce only UTF-8 results that remove the encoding problem. Then, as according to RFC6532 and the Postel law, Mr. MIME can produce only UTF-8 emails.
ke is a small library that implements a ring buffer with bigarray. This library has only one purpose: to restrict a transmission's memory consumption via a ring buffer, like the famous Xen's shared-memory ring buffer.
emile may be the most useful library for many users. It parses and re-encodes an email address according to standards. Email addresses are hard! Many details exist, and some of them have meaning while others don't. emile proposes the most standardised way to parse email addresses, and it has the smaller dependencies cone, so it could be used by any project, regardless of size.
unstrctrd may be the most obscure library, but it's the essential piece of Mr. MIME. From archeological research into multiple standards, which describe emails along that time, we discovered the most generic form of any values available in your header: the unstructured form. At least email addresses, Date (RFC822), or DKIM-Signature follow this form. More generally, a form such as this can be found in the Debian package description (the RFC822 form). unstrctrd implements a decoder for it.
prettym is the last developed library in this context. It's like the Format module with ke, and it produces a continuation, which fills a fixed-length buffer. prettym describes how to encode emails while complying with the 80-columns rule, so any emails generated by Mr. MIME fit into a catodic monitor! More importantly, with the 7-bit limitation, this rule comes from the MTU limitation of routers, and it's required from the standard point-of-view.

From all of these, we developed mrmime, a library that can transform your email into an OCaml value and create an email from it. This work is related to necessary pieces in multiple contexts, especially the multipart format. We decided to extract a relevant piece of software and make a new library more specialised for the HTTP (which shares many things from emails), then integrate it into Dream. For example see multipart_form.

A huge amount of work has been done on mrmime to ensure a kind of isomorphism, such as x = decode(encode(x)). For this goal, we created a fuzzer that can generate emails. Next, we tried to encode it and then decode the result. Finally, we compared results and checked if they were semantically equal. This enables us to generate many emails, and Mr. MIME won't alter their values.

We also produced a large corpus of emails (a million) that follows the standards. It's really interesting work because it offers the community a free corpus of emails where implementations can check their reliability through Mr. MIME. For a long time after we released Mr. MIME, users wondered how to confirm that what they decoded is what they wanted. It's easy! Just do as we did! Give a billion emails to Mr. MIME and see for yourself. It never fails to decode them all!

At first, we discovered a problem with this implemenation because we couldn't verify Mr. MIME correctly parsed the emails, but we fixed that through our work on hamlet.

hamlet proposes a large corpus of emails, which proves the reliability of Mr. MIME, and mrmime can parse any of these emails. They can be re-encoded, and mrmime doesn't alter anything at any step. We ensure correspondance between the parser and the encoder, and we can finally say that mrmime gives us the expected result after parsing an email.

Parsing and Serialising Emails with Mr. MIME

It's pretty easy to manipulate and craft an email with Mr. MIME, and from our work (especially on hamlet), we are convinced it's reliabile. Here are some examples of Mr. MIME in OCaml to show you how to create an email and how to introspect & analyse an email:

open Mrmime

let romain_calascibetta =
  let open Mailbox in
  Local.[ w "romain"; w "calascibetta" ] @ Domain.(domain, [ a "gmail"; a "com" ])

let tarides =
  let open Mailbox in
  Local.[ w "contact" ] @ Domain.(domain, [ a "tarides"; a "com" ])

let date = Date.of_ptime ~zone:Date.Zone.GMT (Ptime_clock.now ())

let content_type =
  Content_type.(make `Text (Subtype.v `Text "plain") Parameters.empty)

let subject =
  let open Unstructured.Craft in
  compile [ v "A"; sp 1; v "simple"; sp 1; v "email" ]

let header =
  let open Header in
  empty
  |> add Field_name.date Field.(Date, date)
  |> add Field_name.subject Field.(Unstructured, subject)
  |> add Field_name.from Field.(Mailboxes, [ romain_calascibetta ])
  |> add (Field_name.v "To") Field.(Addresses, Address.[ mailbox tarides ])
  |> add Field_name.content_encoding Field.(Encoding, `Quoted_printable)

let stream_of_stdin () = match input_line stdin with
  | line -> Some (line, 0, String.length line)
  | exception _ -> None

let v =
  let part = Mt.part ~header stream_of_stdin in
  Mt.make Header.empty Mt.simple part

let () =
  let stream = Mt.to_stream v in
  let rec go () = match stream () with
    | Some (str, off, len) ->
      output_substring stdout str off len ;
      go ()
    | None -> () in
  go ()

(* $ ocamlfind opt -linkpkg -package mrmime,ptime.clock.os in.ml -o in.exe
   $ echo "Hello World\\!" | ./in.exe > mail.eml
*)

In the example above, we wanted to create a simple email with an incoming body using the standard input. It shows that mrmime is able to encode the body correctly according to the given header. For instance, we used the quoted-printable encoding (implemented by pecu).

Then, in the example below from the standard input, we wanted to extract the incoming email's header and extract the email addresses (from the From, To, Cc, Bcc and Sender fields). Then, we show them:

open Mrmime

let ps =
  let open Field_name in
  Map.empty
  |> Map.add from Field.(Witness Mailboxes)
  |> Map.add (v "To") Field.(Witness Addresses)
  |> Map.add cc Field.(Witness Addresses)
  |> Map.add bcc Field.(Witness Addresses)
  |> Map.add sender Field.(Witness Mailbox)

let parse ic =
  let decoder = Hd.decoder ps in
  let rec go (addresses : Emile.mailbox list) =
    match Hd.decode decoder with
    | `Malformed err -> failwith err
    | `Field field ->
      ( match Location.prj field with
      | Field (_, Mailboxes, vs) ->
        go (vs @ addresses)
      | Field (_, Mailbox, v) ->
        go (v :: addresses)
      | Field (_, Addresses, vs) ->
        let vs =
          let f = function
            | `Group { Emile.mailboxes; _ } ->
              mailboxes
            | `Mailbox m -> [ m ] in
          List.(concat (map f vs)) in
        go (vs @ addresses)
      | _ -> go addresses )
    | `End _ -> addresses
    | `Await -> match input_line ic with
      | "" -> go addresses
      | line
        when String.length line >= 1
          && line.[String.length line - 1] = '\\r' ->
        Hd.src decoder (line ^ "\\n") 0
          (String.length line + 1) ;
        go addresses
      | line ->
        Hd.src decoder (line ^ "\\r\\n") 0
          (String.length line + 2) ;
        go addresses
      | exception _ ->
        Hd.src decoder "" 0 0 ;
        go addresses in
  go []

let () =
  let vs = parse stdin in
  List.iter (Format.printf "%a\\n%!" Emile.pp_mailbox) vs

(* $ ocamlfind opt -linkpkg -package mrmime out.ml -o out.exe
   $ echo "Hello World\\!" | ./in.exe | ./out.exe
   romain.calascibetta@gmail.com
   contact@tarides.com
*)

From this library, we're able to process emails correctly and verify some meta-information, or we can include some meta-data, such as the Received: field for example.

Sending Emails with SMTP

Of course, when we talk about email, we must talk about SMTP (described by RFC5321). This protocol is an old one (see RFC821 - 1982), and it comes with many things such as:

8BITMIME support (1993)
PLAIN authentication (1999)
STARTTLS (2002)
or TLS to submit an email (2018)
and some others (such as pipeline or enhancement of status code)

Throughout this protocol's history, we tried to pay attention to CVEs like:

The TURN command (see CVE-1999-0512)
Authentication into a non-securise channel (see CVE-2017-15042)
And many others due to buffer overflow

A reimplementation of the SMTP protocol becomes an archeological job where we must be aware of its story via the evolution of its standards, usages, and experimentations; so we tried to find the best way to implement the protocol.

We decided to implement a simple framework in order to describe the state machine of an SMTP server that can upgrade its flow to TLS, so we created colombe as a simple library to implement the foundations of the protocol. In the spirit of MirageOS projects, colombe doesn't depend on lwt, async, or any specific TCP/IP stack, so we ensure the ability to handle incoming/outcoming flow during the process, especially when we want to test/mock our state machine.

With such a design, it becomes easy to integrate a TLS stack. We decided to provide (by default) the SMTP protocol with the STARTTLS command via the great ocaml-tls project. Of course, the end user can choose something else if they want.

From all the above, we recently implemented sendmail (and it's derivation with STARTTLS), which is currently used by some projects such as letters and Sihl or Dream, to send an email to some existing services (see Mailgun or Sendgrid). Thanks to these outsiders for using our work!

Manipulate Emails with CLI tools

mrmime is the bedrock of our email stack. With mrmime, it's possible to manipulate emails as the user wants, so we developed several tools to help the user manipulate emails:

ocaml-dkim provides a tool to verify and sign an email. This tool is interesting because we put a lot of effort into ensuring that the verification is really memory-bound. Indeed, many tools that verify the DKIM signature do two passes: one to extract the signature and the second to verify. However, it's possible to combine these two steps into one and ensure that such verification can be "piped" into a larger process (such as an SMTP reception server).
uspf provides a verification tool for meta-information (such as the IP address of the sender), like the email's source, and ensure that the email didn't come from an untrusted source. Like ocaml-dkim, it's a simple tool that can be "piped" into a larger process.
ocaml-maildir is a MirageOS project that manipulates a maildir "store." Similar to MirageOS, ocaml-maildir provides a multitude of backends, depending on your context. Of course, the default backend is Unix, but we planned to use ocaml-maildir with Irmin.
ocaml-dmarc is finally the tool which aggregates SPF and DKIM meta-information to verify an incoming email (if it comes from an expected authority and wasn't altered).
spamtacus is a tool which analyses the incoming email to determine if it's spam or not. It filters incoming emails and rejects spam.
conan is an experimental tool that re-implements the command file to recognise the MIME type of a given file. Its status is still experimental, but outcomes are promising! We hope to continue the development of it to improve the whole MirageOS stack.
blaze is the end-user tool. It aggregates many small programs in the Unix spirit. Every tool can be used with "pipe" (|) and allows the user to do something more complex in its emails. It permits an introspection of our emails in order to aggregate some information, and it proposes a "functional" way to craft and send an email, as you can see below:

$ blaze.make --from din@osau.re \\
  | blaze.make wrap --mixed \\
  | blaze.make put --type image/png --encoding base64 image.png \\
  | blaze.submit --sender din@osau.re --password ****** osau.re

Currently, our development mainly follows the same pattern:

Make a library that manipulate emails
Provide a simple tool that does the job implemented by our library
Integrate it into our "stack" with MirageOS

blaze is a part of this workflow where you can find:

blaze.dkim which uses ocaml-dkim
blaze.spf which uses uspf
blaze.mdir which uses ocaml-maildir
and many small libraries such as:
- blaze.recv to produce a graph of the route of our email
- blaze.send/blaze.submit to send an email to a recipient/an authority
- blaze.srv which launches a simple SMTP server to receive on email
- blaze.descr which describes the structure of your email
- and some others...

It's interesting to split and prioritise goals of all email possibilities instead of making a monolithic tool which supports far too wide a range of features, although that could also be useful. We ensure a healthy separation between all functionalities and make the user responsible through a self-learning experience, because the most useful black-box does not really help.

Deploying Email Services as Unikernels

As previously mentioned, we developed all of these libraries in the spirit of MirageOS. This mainly means that they should work everywhere, given that we gave great attention to dependencies and abstractions. The goal is to provide a full SMTP stack that's able to send and receive emails.

This work was funded by the NGI DAPSI project, which was jointly funded by the EU's Horizon 2020 research and innovation programme (contract No. 871498) and the Commissioned Research of National Institute of Information.

Such an endeavour takes a huge amount of work on the MirageOS side in order to "scale-up" our infrastructure and deploy many unikernels automatically, so we can propose a coherent final service. We currently use:

albatross as the daemon which deploys unikernels
ocurrent as the Continuous Integration pipeline that compiles unikernels from the source and asks albatross to deploy them

We have a self-contained infrastructure. It does not require extra resources, and you can bootstrap a full SMTP service from what we did with required layouts for SPF, DKIM, and DMARC. Our SMTP stack requires a DNS stack already developed and used by mirageos.org. From that, we provide a submit service and a receiver that redirects incoming emails to their real identities.

This graph shows our infrastructure:

As you can see, we have seven unikernels:

A simple submission server, from a Git database, that's able to authenticate clients or not
A DKIM signer that contains your private key that notifies the primary DNS server to record your public key and let receivers verify the integrity of your sent emails
The primary DNS server that handles your domain name
The SMTP relay that transfers incoming emails to their right destinations. For instance, for a given user (i.e.,foo@<my-domain>) from the Git database, the relay knows that the real address is foo@gmail.com. Thus, it will retransfer the incoming email to the correct SMTP service.
The SMTP relay needs a DNS resolver to get the IP of the destination. This is our fifth unikernel to ensure that we don't use extra resources or control anything necessary to send and receive emails.
The SMTP receiver does a sanity check on incoming emails, such as SPF and DKIM (DMARC), and prepends the incoming email with results.
Finally, we have a spam filter that prepends incoming emails with meta information, which helps us to determine if they're spam or not.

An eighth unikernel can help provide a Let's Encrypt certificate under your domain name. This ensures a secure TLS connection from a recognised authority. At the boot of the submission server and the receiver, they ask this unikernel to obtain and use a certificate. Users can now submit emails in a secure way, and senders can transmit their emails in a secure way, too.

The SMTP stack is pretty complex, but any of these unikernels can be used separately from the others. Finally, a full tutorial to deploy this stack from scratch is available here, and the development of unikernels is available in the ptt (Poste, Télégraphe, and Téléphone) repository.