OLAP infrastructure at Vizury
Ad-Tech companies like Vizury are constantly querying multi-dimensional data based on client’s request or internal reporting purposes. For example, one needs a system which is able to answer queries such as:
- What was Vizury’s revenue for the month of April 2014?
- What was the breakdown of impressions served by banner sizes during the first week of April 2014?
- What were top 10 campaigns by clicks during the period of April 7 to April 10?
The system which enables queries like these is called OLAP: Online Analytical Processing. OLAP stores the most granular data in a database and sums up the rows as needed. This leads to reading large number of rows for queries.
Olap at Vizury:
- The underlying database: We use Vertica, which is a columnar database useful for persistent storage of the most granular data.
- Middle layer: We use Mondrian server to store multi-dimensional data structure. This middle layer serves as a “smart cache”. For example, if the first query looks at data for April 1st, then the Mondrian layer caches it. For a subsequent query for dates April 1 and 2, the Mondrian layer retrieves data for April 1st from its cache and queries the database for the data for April 2nd.
- Front end: We use Saiku as the front-end server. It provides a rich user interface to perform multidimensional queries.
At Vizury, we use open source components wherever possible. Both Mondrian and Saiku are open source and the advantage with open source solutions is that you can modify the source code of various components and customize them to suit your requirements.
Tips for implementing OLAP infrastructure
Here are some key lessons learnt while deploying and tuning our OLAP infrastructure.
- Use of columnar database is a must: Initially, we started out by using MySQL, but were soon faced with scaling issues. So we switched to Vertica and our scaling woes were gone. We realized that OLAP workloads work best with columnar databases. This is because of two reasons:(a)In row oriented databases like MySQL, entire row needs to be read even though you may need only a few columns. On the other hand, in columnar databases, storage is column by
column, and hence only required columns need to be read. Now, in a typical OLAP query a few out of many columns need to be queried and so when using columnar databases, I/O is reduced.(b) Since the storage is columnar, each column can be efficiently encoded and compressed, further decreasing I/O.
- Columnar databases are CPU bound: Having used MySQL earlier, we were accustomed to the idea that “database workloads are disk bound”. However, in scaling up Vertica, we realised that it is bound by the CPU capacity. For optimum Vertica performance, the CPU utilization should be mostly below 70%. If it is not so, shifting to higher number of cores or more machines helps boost Vertica performance. Once the CPU is utilization is below 70%, we were able to derive small amount of performance boost by shifting to a disk with higher IOPS.
- Some columns are more compressible than others: If there are fewer values over a large number of rows, then the compression can be very efficient. For instance, Vizury has over 500 campaigns (each campaign is a string), while the number of rows are more than a billion. The campaign column can be compressed to a very small size by using a dictionary to map campaigns to integers.
- Aggregated tables help: We figured that while we store the most granular data in the database, most (>50%) queries never access some of the dimensions. Hence, we aggregated the tables across those dimensions to come up with aggregated tables. We saw up to 30% increase in average query times by using aggregated tables.
- Vertica design tool is not of much help: We tried changing projections as per Vertica design tool, but were not able to derive noticeable performance gain due to this.
- Mondrian needs to be customized: We were able to make several changes to the Mondrian server to maximize its performance. In particular, we found that Mondrian makes needless “read tuple” queries apart from the required “segment load” queries. By avoiding a large fraction of these “read tuple” queries we observed a huge increase in the performance.
We were able to bring down the average response times of OLAP infrastructure continuously, while the number of rows in the database continued to increase, as depicted by the following graph:
A big shoutout to Arvind Jaganathan who worked on all the above the improvements, and brought them from theory to practice! We continue to work on several improvements to further bring down the response time.