For our customer in motorsport, we are working on a performance analysis platform called equilibrium. What exactly that means doesn’t matter here - suffice it to say: data from and about race cars is managed. In this article I speak about some back end implementation details which are way more boring than this intro makes it sound.

the problem

In equilibrium, we are working with a fairly big and fairly heterogeneous data set, and the sources of the data in the set are not particularly trustworthy in the sense that the subsets are often generated manually and therefore have mistakes and inconsistencies in them. Since the types are very document-y, as in, there is a lot of bucketing, subtyping and collecting going on, we have chosen to use a document oriented database in mongoDB*. However, the data itself is in all kinds of different formats - the best case scenario are structured CSV files in a sort-of-structured directory tree, the worst case and more common scenario is “you have to extract data from these three different PDF and other files manually”. So getting it into the data base means

  • we have to structure it
  • we have to clean it or straight up generate it via manual compilation
  • it somehow needs identifier, and to retain its relational semantics

and on top of that, the “getting it into the data base” step must be a reliable, repeatable process, as we are going to do it a lot during development, staging and, finally, once during deployment.

the solution™

After defining the intended structure in JSON documents - mongoDB takes BSON, but the structure is obviously the same - data must be brought into that form. Existing CSV, MS Excel and similar formats must be elevated via structured meta data; in our case, that means packaging collections of such data in a semantically meaningful way and adding a TOML** file with meta data and further data describing the collection. Other data must be manually collected in structured files - again, that is TOML in our case.

Subsequently, the TOML formatted data and the preformatted data must be read and transformed into BSON, to be uploaded to the mongoDB database. This is most easily done by using the same types as when interacting with the database; however, an identifier must be chosen or defined to retain relational semantics. Each document, to put it more simply, needs a key. Writing these keys manually is not only impractical, it is also error prone and not sustainable as equilibrium grows, and thus we need a way to instantiate data types with newly generated, but sometimes dependent on the internal fields of the type, keys which they can be identified with.

the implementation

As serde, the ubiquitous Rust serialization-deserialization project, can’t do that, we have to design around it a little bit. What we want to end up with is something like this:

let toml = indoc!(
                  r#"
  startno = "99"
  team = "Target Competition"
  car = "4ed2adb7-64e7-3646-9c45-56c75857af2d"
  cw_ballast = 40

  [driver]
  first = "Josh"
  last = "Files"
  "#
);

assert!(toml::from_str::<Competitor>(&toml).is_err());

let competitor = Competitor::from_toml_str(&toml).unwrap();
assert_eq!(competitor.startno, "99");
assert_eq!(competitor.driver,
           Driver { first: "Josh".to_string(),
                    last:  "Files".to_string(), });
assert_eq!(competitor.team, "Target Competition");
let car_key =
  Uuid::parse_str("4ed2adb7-64e7-3646-9c45-56c75857af2d").unwrap();
assert_eq!(competitor.car, car_key);
assert_eq!(competitor.cw_ballast, 40);

Specifially, there is a TOML string which lacks the “key” field, and we want to make sure it can’t be serialized just like that:

assert!(toml::from_str::<Competitor>(&toml).is_err());

However, it should be perfectly fine to do it explicitly:

let competitor = Competitor::from_toml_str(&toml).unwrap();

So how do we achieve that? Well, first we need a proxy type to deserialize through:

#[derive(Deserialize)]
pub struct DocProxy {
  #[serde(flatten)]
  proxy_data: HashMap<String, Json>,
}

Then, we need a trait which kind of collects functions for the host type:

pub trait ProxyHost {
  type Host: ProxyHost + for<'de> Deserialize<'de>;

  fn key_fields() -> Option<Vec<String>>;

  fn from_toml_str(toml: &str) -> Result<Self::Host> {
    toml::from_str::<DocProxy>(&toml)?.to_doc()
  }

  fn from_json_str(json: &str) -> Result<Self::Host> {
    serde_json::from_str::<DocProxy>(&json)?.to_doc()
  }
}

Here we can already see some of the generic goodness here: the “toml::from_str” and the “serde_json::from_str” methods are clear indicators of what’s going on there, yet no type needs to implement these function.

Then finally, we need the program itself:

impl DocProxy {
  pub fn to_doc<Doc>(mut self) -> Result<Doc>
    where Doc: ProxyHost + for<'de> Deserialize<'de> {
    // if a key has been deserialized, use it
    if !self.proxy_data.contains_key("key") {
      let key_fields = Doc::key_fields();

      let key = match key_fields {
        None => json!(Uuid::new_v4()),
        Some(fields) => {
          // else, generate UUID from MD5 hash based on key elements
          let mut key_elements: Vec<String> = Vec::new();
          for field in fields {
            let value =
              self.proxy_data
                  .get(&field)
                  .ok_or(
                    Fubar::new(&format!("{} field missing", field))
                  )?;

            key_elements.push(
              serde_json::from_value(value.clone())?
            );
          }
          json!(Uuid::new_v3(&Uuid::NAMESPACE_OID,
                             &key_elements.join(" ").into_bytes()))
        }
      };

      self.proxy_data.insert("key".to_string(), key);
    }
    // this does what it looks like: it serializes the HashMap down
    // to JSON, then deserializes that JSON into the host type. the
    // simple reason why this JSON detour is necessary is that
    // deserialization does not work with HashMap; the alternative
    // would have been to leave everything as JSON, and that was
    // implemented but ended up being much harder to read.
    //
    // should this turn out to be a bottleneck for requests in
    // production, optimize not this code; optimize its usage. this
    // whole proxy type can be skipped if one can be sure that the
    // `key` field is populated in the base data - then, the host
    // type can be deserialized directly.
    Ok(
      serde_json::from_value(
        serde_json::to_value(self.proxy_data)?
      )?
    )
  }
}

All of this is essentially a very long way of saying “look boys, don’t touch it, we’ll handle it” and that’s what we’re going for. The key is either already there, in which case we use it; or the type has no preferred basis for its checksum, i.e. zero interest in the key, in which case it gets randomly chosen; or it does provide such keys, in which case it’s open.

Conclusion

It may seem strange from the outside, but this is actually very cool and elegant. It achieves the desired abstraction without sacrificing the option for optimization where necessary and possible, and it works very robustly in our particular use case. In any case, Rust offers a lot of flexibility and - in my opinion - also barriers in all the right places. It also makes it easy to spot over complex code sections and therefore to reduce mental load.

Hope you enjoyed this article!

we.are.#baremetal

* Our topmost preference was couchbase, actually, but couchbase’s Rust driver still has a lot of non-Rust dependencies and was very fidgety to get running smoothly across developer machines and CI builds - whereas mongoDB’s Rust driver is stable and can be pulled in simply as a Cargo dependency.

** We chose TOML for several reasons, however most importantly it is a format which can model JSON data perfectly while being more readable and therefore also easier to write. Yet this introduces another problem