Read-heavy reporting engine Architecture Design Pattern on AWS

Read-heavy reporting engine Architecture Design Pattern on AWS

Previous patterns help address pushing data and events from one service to others, optimizing scale for write-heavy services. This pattern optimizes the data for different accessing and querying patterns that clients need, optimizing scale for read-heavy services.

Most applications are read-intensive. This is particularly true for our Big Media clients where there are far fewer users generating content than users consuming the content. The ratio between them can be huge like 1:100000.

Caching data and creating specialized views of the most frequent access patterns help services scale effectively.

Caching means temporarily copying frequently used into memory or a shared repository. Caching data is one of the most frequently used strategies to improve performance and scale reads.

In Materialized Views, the data is replicated and transformed in a way that is optimized for querying and displaying. Materialized views can be new tables or completely different data stores where the data is mapped to be displayed in a new format or limited to a specific subset of the data.

Materialized views are also useful to “bridge” different stores to take advantage of their stronger capabilities.

With Index Tables, the data is replicated in a new table using specialized indexes specific to common queries. Specialized indexes can be composite keys, secondary keys, and partially denormalized ones.

“It’s very common with read-heavy applications to hit the limits of downstream data engines that are not specialized for the different querying patterns that clients use.

Caching data and creating specialized views of the most queried data can help mitigate the load impact of a read-heavy service.”

Dynamo streams and lambda functions are the perfect tools to create specialized views. In the example, we have three endpoints — search, tweet and timeline. Each one needs a slightly different querying pattern where the data needs to be optimized in a particular way.

  • /search queries the Tweets Index in Elasticsearch. Providing things like phonetic search, typos, related terms, and suggestions. There is no need to index all the data from the original tweet, maybe it only includes the text, location, media url — for a pretty preview, and hashtags. We use the stream to trigger a lambda on TweetCreated that strips all the data we don’t need and indexes the tweet.

  • /timeline is created from the most interesting tweets on the network, and the activity of my connections.

We use a Dynamo table to keep the Top Tweets — an indexed table limited to the 1000 most viewed items. Tweets are updated via the stream on the TweetViewed event. A lambda function receives the event, queries the Tweets Collection, and saves the result.

Getting the activity of someone’s connection is easier on a Graph Database like Neptune. Another lambda triggered by the TweetCreated event creates a record on Neptune maintaining activities for our connection’s streams of tweets.

In Microsoft’s Cloud Architecture, this is a mix of different patterns — Data pipeline, Cache-aside, Materialized views, and Index Table.

Handling cache can be hard, remember to follow industry best practices. Maintaining materialized views, and index tables can be complicated. Duplicating the data will add cost, effort, and logic. With very large datasets it can be very difficult to maintain consistency, and keeping the data in sync could slow down the system. Views and Index tables are not fully consistent.