Speeding up Superset by choosing the right database

Javier Fortea
Job&Talent Engineering
11 min readMar 11, 2019

--

Figure 1: Sample screenshot of a Superset Dashboard called `World’s Bank Data’. Credits: https://superset.incubator.apache.org/

Data is one of the key ingredients in the success and development of any organization, and here at Jobandtalent, we take this really seriously. For this reason, we gather and analyze lots of data in order to extract insights and trends, and ultimately to help in the internal decision making.

In order to be able to do that, raw data needs to be transformed into information and be contextualized. For that, we have different ETL pipelines (ETL stands for “Extract, transform, load”), that are in charge of selecting the required data from different data sources (in our case, most of the data comes from our Redshift cluster), processing and aggregating it and finally loading the resulting data to a database so it can be consumed for further analysis.

Besides data processing, visualizations are an essential tool to present and make sense of processed data. They are usually presented in the form of dashboards that group different kinds of charts, putting data in context, identifying trends and correlations, etc.

In this post, we’ll talk about the tool that we use for our visualizations, Superset, focusing on some performance challenges that we found, and how we solved them.

Superset overview

Apache Superset is a data exploration and visualization web application originally developed by Airbnb. It’s currently “an effort undergoing incubation at The Apache Software Foundation (ASF)”, meaning that the project has yet to be fully endorsed by the ASF, but the project is mature in terms of stability and feature richness.

It’s a modern business intelligence tool, using some well-known Python libraries and frameworks (Flask, SQLAlchemy, Gunicorn…).

Figure 2: Setting up a chart in Superset.

We use it as our main visualization and dashboards tool at Jobandtalent because of its ease of use and its support for the databases and the data flow that we need.

Creating your first dashboard in Superset is just a matter of setting up the database connection to your data sources and it will automatically identify the tables there, together with the table columns, creating even some predefined metrics (e.g. average, sum… metrics will be automatically created for the numeric columns). You can later create your own table metrics that can be re-used in your charts.

Superset is really flexible allowing you to create independent charts that can be later added to one or more dashboards. You can choose from a large variety of types of charts (time series, heat-maps, Sankey diagrams, box-plots, etc. — you can get an idea of the diversity checking the partial list in fig. 3) in order to achieve the best visualization of your data, or interactively explore your data using the “SQL Lab” that Superset includes, saving queries for later use.

Figure 3: Visualization types. Partial view of the list of the different kind of visualizations you can choose in Superset.

Our previous data flow

Our data warehouse sits on Redshift. Our operational databases are replicated in Redshift using “AWS Database Migration Service” (DMS), so our ETL processes (that are executed in Airflow) can easily access to all the data that they need.

Our ETL pipelines processed the data and used to store the resulting data in separate tables in Redshift, that were visualized on our Superset dashboards. Redshift was then the main data source from all of our dashboards in Superset. Our data pipelines used Redshift both as data source and data storage, as described graphically in figure 4:

Figure 4: Diagram of our previous data flow (our ETL processes, executed by Airflow, used to read data from Redshift and generate new data that was stored in the same database, from where Superset dashboards consumed it).

Performance problems with our first Superset setup

At first, our setup showed serious performance issues. The heaviest dashboards were taking too long to load the first time (when they were not cached) and some charts in them were not shown at all when many filters were applied, just an error message instead. When this happened, often, the whole Superset website became unresponsive, and any other request that came in at that moment took also a long time to get a response.

After checking our Superset configurations, we confirmed that the cache (we are using Redis in the cache layer) was correctly configured, as the behaviour of the website was already revealing. The problem was clearly only with the dashboards with data that had not been cached yet, and that’s something that we could reproduce easily using the “Force refresh dashboard” option in Superset.

The AWS instance in charge of serving the Superset website had enough resources, so that wasn’t the limiting factor. Also, another interesting point is that Superset tries to execute as many queries as possible in parallel, but even with complex dashboards with a large number of charts, “Redshift dashboard” in AWS was showing quite a low number of concurrent queries.

We found out that the Gunicorn configuration that we were using (that was the configuration used in the Docker image referenced in the Superset documentation and repository) was limiting too much the number of processes that Superset could create.

Gunicorn uses the “pre-fork worker model”, which means that a bunch of processes is created when the server is started. You can choose which type of worker class you want to use: from the simplest synchronous worker class (“sync” worker class) to different asynchronous worker classes ("eventlet”, “gevent”, etc) and a threaded worker (“gthread”), and how many processes and/or threads you want to start, among other useful options.

After increasing the number of pre-forked processes and changing the default worker class (at the beginning, we were using only 2 processes and the “sync” worker class), we were able to increase substantially the number of requests and queries we were able to handle at the same time (see below the Redshift concurrent queries graph — figure 5). In addition to changing the server configuration, we also started to use a multi-server architecture for high-availability and performance reasons.

Figure 5: View of Redshift concurrent queries in the “Cluster Performance” section.

Improving Redshift performance

After these changes, our business intelligence web application was able to deal with a much larger number of concurrent requests, but it still felt slow when it had to deal with non-cached complex dashboards (some of our dashboards have a large number of filters, so it’s quite common for the users of our dashboards to visualize non-cached data).

To be able to improve the performance of Redshift is necessary to understand quite well its architecture (reading Data Warehouse System Architecture and checking Internal Architecture and System Operation is a good start). The most important ideas are that Redshift is a distributed database with parallel processing (it has a “leader” node and a number of “compute” nodes, that can execute queries in a cooperative way) and is column-oriented (being the main advantage that some aggregations can be done easily without reading unnecessary data).

In order to achieve a reasonable performance from Redshift, you need to configure correctly different elements (more about that below) and, even with that configuration, it is a database that is designed with a very specific workload in mind. The Redshift Workload Management allows you to create query queues associated to certain users or groups with some resources guaranteed, so as Redshift is also used by other users, we created one to guarantee resources to Superset.

We also tested different DistKeys (distribution keys) for our tables. In Redshift, you can define how the data is going to be distributed among the nodes (‘All’ to have a copy of the table in each of the nodes, ‘Even’ to spread the data, or ‘Key’ to define a column you want to distribute the data on), so you can avoid some network traffic and aggregations at the “leader” node if your tables are small and you distribute them to all the nodes (you have to choose the most efficient strategy for your tables depending on your data access pattern).

Figure 6: Redshift distribution styles. Using “ALL distribution”, a copy of the entire data is distributed to every node; with “Even distribution”, Redshift distributes the rows in a round-robin fashion; and with “Key distribution”, rows are distributed according to the values in the column that you choose.

One of the Redshift limitations is that you cannot define indexes as in a traditional RDBMS. You can define a single SortKey per table, composed by one or more columns, and Redshift will store your data on disk according to it. We reviewed the SortKeys that we already had on our tables and added them to a few tables that didn’t have them yet.

After all these changes, we improved the performance of Redshift, but still, the executions times were not fast enough to be used from an interactive web environment, so we decided to look for an alternative database for our dashboards.

Choosing PostgreSQL as the main database for our dashboards

As the tables that Superset was consuming were quite small, and after understanding better the drawbacks of using a database as Redshift, we decided to check the performance using a traditional RDBMS. At that point, we were pretty confident that the behaviour improvement could be quite significant.

We chose PostgreSQL because it is a database that we know pretty well (it’s the database of choice at the company), we knew what to expect from it, and how to bring out the best in it.

After some successful and promising tests using the anonymized production data in our local environment, we decided to go ahead and adapt our ETL processes so we could have the same data that we were processing and storing in Redshift, also in Postgres.

Easing the path: using dblink and postgres_fwd

Once we had decided to start using Postgres in our dashboards, we were interested in moving forward as fast as possible, because the user experience of the dashboards was quite worrying.

We found out that the fastest way of confirming that our decision was correct was to use “dblink”. Dblink is a Postgres extension that allows you to execute queries in an external database from PostgreSQL. We just need to configure the connection from Postgres to Redshift using a “CREATE SERVER” clause (defining the host, port and dbname) and then map the Postgres user with the Redshift user (using a “CREATE USER MAPPING” clause where we specify the foreign database user credentials).

Figure 7: Connecting from PostgreSQL to Redshift using dblink. Credits: AWS Big Data Blog (https://aws.amazon.com/blogs/big-data/join-amazon-redshift-and-amazon-rds-postgresql-with-dblink/)

After that, we can execute queries in Postgres that will fetch data from Redshift. Of course, we would not get any performance advantage if Superset were connecting to Postgres and Postgres was hitting Redshift for each query; but we can use this connection to create a “materialized view” or just a normal table with the data retrieved from Redshift (e.g. we can execute the usual “INSERT INTO…SELECT…” but using “dblink” in the query part to get the data from an external database).

You can find more information about this technique on the article JOIN Amazon Redshift AND Amazon RDS PostgreSQL WITH dblink in the “AWS Big Data Blog”, or read how Intermix uses this technique to use Postgres as a caching layer for Redshift data.

As we are generating the data consumed in our dashboards only from our ETL processes, we only need to add a step in our processes to copy the data from Redshift to Postgres, as the last step after the data has already been processed and stored in Redshift.

In figure 8, you can see the visual representation of one of our ETL pipelines in Airflow after adding this last task. The pipeline flow goes from left to right: tasks in purple read or write data from Redshift (data is retrieved in the first step, processed in the second one, and then loaded to another table in Redshift in the third step); the rightmost orange task is the new task that we added, being in charge of copying the data from Redshift to PostgreSQL using the technique that we just explained (just using “dblink” and the correct configuration in PostgreSQL to connect to Redshift).

Figure 8: ETL graph view in Airflow interface. View of an ETL process in Airflow with a task (the one furthest to the right) that copies the data from Redshift to Postgres.

As the tables that we are creating in Redshift and Postgres have the same structure but with some small subtleties (e.g. in Postgres we create indexes, that doesn’t exist in Redshift; or in Redshift, our tables have SortKeys, that are not valid for Postgres), we also re-implemented the creation of tables using SQLAlchemy and sqlalchemy-redshift, getting a simpler and more maintainable code.

Adapting our Superset dashboards

Superset allowed us to easily change the charts of our dashboards to get the data from Postgres instead of Redshift. It was just a matter of creating the new database connection and editing the tables, changing the source “database”, automatically linking our charts to the tables in Postgres. We needed to fix a few metrics because of small differences between Redshift and Postgres, but configuring Superset to use the Postgres database was really straightforward.

Figure 9: Detail of the “Edit Table” form in Superset. There you can change in which database a table is (automatically changing where the dashboard gets the data from).

New data flow

The results using Postgres have been really positive: the dashboards load really fast even when applying many filters, the website is not getting unresponsive anymore, and this was possible without using lots of development time or computational resources.

We don’t need to store the data originated in our ETL processes in Redshift anymore to be used in our dashboards, so we are adapting our pipelines so they only write the results to our Postgres database. At the moment, we are in an intermediary step where our pipelines are still storing the results both in Redshift and PostgreSQL, but we are already doing this in parallel, loading the data in PostgreSQL directly from the processed data, without getting the data through Redshift anymore.

Figure 10: same ETL pipeline than in Figure 8 updated to write in parallel and in an independent way both in Redshift and in PostgreSQL. The data loading to PostgreSQL is done directly using the data generated in the second task (identified as “compute_marketing_costs”), without any dependency from Redshift.

As having all data aggregated into one data source simplify a lot some operations, we plan to use “AWS Database Migration Service” (DMS) to sync the resulting data from Postgres to Redshift, so our ETL pipelines can stop writing to Redshift. This way, they will have fewer responsibilities, but the data will still be available and updated in our warehouse, just relying on a trustworthy tool like “AWS DMS”, that is already used by us to make the data in all our databases available in Redshift.

Figure 11: Diagram of our new data flow (our ETL processes run by Airflow read data from Redshift, but generate data that is stored in a Postgres database in RDS, from where Superset dashboards consume it).

Conclusion

Redshift is an amazing database, but it’s not a general purpose database. It performs well with big data analytical workloads where you need to process lots of data in a relatively small amount of time, but it’s not a good fit when it comes to processing small tables and/or you need low-latency. If that’s your case, you will probably obtain a much better result using a traditional RDBMS.

If instead, you need to execute analytical queries across a large number of rows, Redshift will be able to do a good job, as it can use the computing resources of the cluster and get you where traditional databases simply can’t.

We need to look for the right tool for our needs. Even if Superset (and other dashboard solutions) supports lots of databases, not all of them are a good fit for the job, and this will probably depend on your current circumstances (data size, queries complexity…).

Special thanks to David for his time and efforts investigating the performance issues. Thanks Michele, Joaquín, Jose, John, Luis, Sergio for reviewing this post and for all the valuable feedback!

If you want to know more about it’s like work at Jobandtalent you can read the first impressions of some of our teammates in this blog post or visit our twitter.

--

--