Motivation
When you are designing a database (OLTP or OLAP), it is common to design an index based on a query pattern. In this blog, we are going to talk about bitmap indexing and when to select them compared with another index such as B+ tree.
database indexing (OLAP and OLTP)
OLAP (online application platform) is a database that is designed facing analytics team, typically in large organizations (snowflake, red shift) and OLTP (online transaction platform) is a database that is facing customers that specialized in transactions.
OLTP | OLAP | |
Workload | Transactional processing; customer facing | Analytical processing; internal BI team facing |
design goal | Optimized for speed and efficiency of individual transactions | Optimized for complex queries on large amounts of data and easier for data analysts to query |
data model | OLTP use a normalized data model to reduce redundancy. | OLAP is usually more denormalized to have multiple copy of same column to reduce the number of joins in RDBM |
Access (query) pattern | characterized by a high volume, high-frequency, small transactions that require fast response times | characterized by less frequent but larger transactions that require longer response time |
performance | fast response times | usually can survive with a couple of minutes |
As for OLAP platform, the fact table has billions of rows and a SQL query would take forever.
In simple words, a normalized database means less duplication of the same data across different tables within the database.
Indexing is a strategy that sacrifices some space for time. It is the same idea as the book has indexing pages for you to quickly locate where the information you stores
In the database, it has lots of indexes such as
B+ tree
Hash index
Bitmap index
You sacrifice a couple of pages for indexing and table of contents, it saves the reader lots of time to locate the commands.
The question comes down to how we select the right index for your task, let's dive into some concept
Cardinality
In simple words, cardinality in the context of a database refers to the number of distinct entries in a column. Let's take a look at the table with billions of rows shown below
Employee_ID | gender | Province |
1 | M | ON |
2 | F | AB |
3 | F | PE |
4 | M | SK |
5 | F | AB |
6 | F | ON |
.........
1,000,000,000 | F | QC |
1,000,000,001 | M | NS |
1,000,000,002 | F | NL |
Cardinality refs to number of distinct elements in each column:
Employee ID
: cardinality is 1,000,000,002Gender
: cardinality is 2Province
: cardinality is all the provinces in canada. Up to 13- AB, BC, MB, NB, NL, NS, NT, NU, ON, PE, QC, SK, YT
The general rule of thumb is that the higher the cardinality the less likely you are going to use bitmap. Or only use bitmap when it doesn't have many unique values.
Bitmap
Let's take the gender column for example How bitmap works for the gender column is like
We only need to store two indexes, one column each
As for the province, we can create indexes such as
You only need up to 13 indexes for the analysis, let's say you wish to analyze how many employees in Ontario and Alberta,
you would do
SELECT
count(e.employee_id)
from
employee as e
where e.province in ('ON', 'QC')
How it's working behind the scene is that it will take the ON
and QC
bitmap index and perform an AND operation
Based on your query pattern, you could perform bitwise AND, bitwise OR or bitwise NOT operation.
Those access patterns are quite common for business analytics and choose the index properly when cardinality is relatively small compared with total number of entries in this column.
Summary
In this section, we discuss the bitmap index in the OLAP platform and how to select it based on cardinality.
Extra reading
If you interested
-
- clear explanation of bitmap in 5 mins
Youtube: B tree and B+ tree by Abdul Bari
- goes full length but really clear if you are really interested in database