Under the hood of personal semantics application
Meshin application attempts to unify various personal communication streams and provide value on top of that. User registers streams from cloud-based carriers and installs Meshin app on their smartphone. Cloud-based carriers including Gmail, Facebook messages, Twitter DMs, LinkedIn messages and smartphone data sources like phone book, call history and SMS messages are continuously crawled and indexed by Meshin. Separate index in maintained for each user. Index includes aggregated contacts and each contact’s corresponding message stream across all the carriers and smart phone data sources. Contacts represent both people and organizations. Person contact is derived from display name of email address or similar metadata in other carriers. Organization contact currently is derived from email address Internet domain name. Once initial indexing is performed Meshin smartphone app allows user to group and filter communications in various ways. We call it slice and dice streams. Simple ones include streams from specific contacts — people and organizations. Also it is possible to define explicit groups of contacts. Each slice and dice stream of messages allows further filtering by time and keyword search. With every message Meshin associates read/unread status, so that user can see what’s new in each stream. Currently Meshin read/unread status is maintained independently of carrier read/unread status. In addition to explicit user defined grouping of contacts Meshin creates implicit group of contacts with high importance. Importance metric is calculated by proprietary algorithm and is based on interaction patterns accross all carriers. Factors like frequency of communication, recency, directness and reply timing are used.
Meshin index is implemented as triple store based on key-value backend database engine. Separate crawler and indexer exists for each carrier to deal with specifics of carrier API and data model. Indexers convert carrier data model into Meshin triples. Example of such triple is connection between person object and address object. For example email indexer produces following triples.
Then Twitter indexer produces.
Now if both indexer agree on the way of person object identity generation we achieve disambiguation where now both address objects point to the same person object.
This can be done by applying hash function to person name. But what if Twitter metadata included variation of the name that is not exactly the same with one in email metadata. Variation can be simple capitalization alternative (john smith), inclusion of middle names and prefixes and suffixes (Mr John K Smith III PhD) or simply variations of first name (Johnny Smith). Meshin employs several normalization steps in order to get unambiguous identity. But often this approach still does not work and we end up with several person objects representing same person. In this case Meshin allows user to manually unite several person objects into one. Another problem may arise when false disambiguation happened. In previous example this may be when John Smith via email and John Smith on Twitter are two different people. In this case, again we allow user to split person object into several where each has dedicated address association. As I mentioned above organization object is recognized based on Internet domain name of email address using correspondence dictionary. For example email indexer produces following triples.
Meshin organization dictionary is mainly derived from Dbpedia dataset. Each organization can have multiple Internet domain names associated with it. Organization node identity is generated as hash of Dbpedia identity. Additional attributes such as Wikipedia page link, home page link and short abstract can be included.
Messages are represented by corresponding objects with identity derived from of unique reference from corresponding carrier or, if such does not exist, hash of combination of key attributes. Message object has attributes such as time stamp, subject, body preview and also references participating addresses through intermediate object. For typical email messages indexer produces following triples.
In addition message objects are referenced by inverted keyword index on all text attributes such as subject and body. Keyword index is stored within the same key-value back-end database engine. The combination of “person-of-address”, “organization-of-address”, “address-of-participant” and “message-of-participant” predicates, ordered date-time predicate and inverted keyword index allow Meshin to express variety of slice and dice queries mentioned above. Such queries are expressed as dataflows specific to Meshin back-end database engine.
Indexer submits triples to Meshin store in batches. Sizes of such batches differ depending on mode that indexer operates in. When indexing new user and carrier combination batches can be large, while if indexer “catches up” with newly arrived messages they are small. After each indexer submission store checks for affected predicates and may trigger additional analysis within user’s shard. Such “trigger” for example is used to calculate person prominence (or importance in previous terms). We decided to perform this calculation in the store instead of indexer because we wanted computation to be closer to the data and to enable ability to compute with data across carriers.
Meshin index store is powered by back-end database engine based on Redis. Redis is open source in-memory database with basic key-value capabilities plus support for higher order values such as ordered lists, sets, hash sets and trees. Redis is often mentioned in context of NoSQL alternatives. The decision to use Redis instead of traditional SQL database or full fledged triple store was mainly driven by requirement for low level optimization and full control over query execution plans via hardcoded dataflows. Additionally many features of both SQL database or big triple store would remain unused while having significant runtime overhead. Michael Stonebraker et al have great discussion of the later point in their paper. In my next post I will describe in greater detail usage of Redis as simplified triple database back-end.