How To Query Multiple Databases and Generate Reports
Dr. Ramon Lawrence | Director of Distributed Database Laboratory | University of British Columbia
It is very common that the data needed for a report, business decision, or application is spread across several data sources which may be relational databases, spreadsheets, documents, NoSQL sources, or "Big Data" sources. This often happens as companies add new systems to their organization via growth in new business, mergers and acquisitions, or replacing and upgrading existing systems. These data sources may be physically distributed, in virtualization and cloud environments, and be hosted on software of various operating systems and database vendors. The challenge is how to manage and use this data with minimal investment in time and cost. This article covers the various options for querying and reporting data from multiple sources and different techniques and software that is available. Please send comments or suggestions to Dr. Ramon Lawrence.
We will use a simple example that consists of two databases. One database, called customers, consists of customer information and has one table:
customer(custId, custName, address)
The other database, called orders, contains order information:
orderinfo(orderId, orderDate, custId, totalAmount) orderitem(orderId, productId, quantity, itemCost)
The primary key fields are underlined, and there is a foreign key from orderitem.orderId to order.orderId. Click here for the MySQL DDL for this example.
Types of Multiple Database Queries
The goal is to be able to extract the necessary information, preferably with one query, from the data sources. The first task is to determine what "multiple database" means in your particular situation as that determines the options available. There are three key scenarios:
- Multiple Databases on One Server Instance
- Multiple Databases on Multiple Servers from the Same Database Vendor
- Multiple Databases on Multiple Servers from Different Vendors
Multiple Databases on One Server Instance
The easiest case is when all of the databases are on a single server instance. For example, you can have a MySQL server installation on machine X that hosts both the customers and orders database. It is possible to use SQL to write one query that combines the information from many databases. There are two requirements:
- You must prefix all table references with the database name (e.g. customers.customer instead of just customer).
- The account issuing the query must have sufficient privileges on the tables in both databases to execute the query.
The query to calculate the total order amount per customer is:
SELECT O.custId, C.custName, SUM(totalAmount) FROM orders.orderinfo O INNER JOIN customers.customer C ON O.custId = C.custId GROUP BY O.custID, C.custName
This query is identical to a query where all tables are in a single database except for providing the database name in the table alias (in bold). Every major database vendor supports this functionality. Below is a table with documentation references for each vendor.
|Microsoft SQL Server||Supported. Make sure to use databaseName.schemaName.tableName for table alias. References: Multiple Database Query, FROM clause syntax and specifying table aliases|
|Oracle||Supported. Include schema name in table alias. Reference: SELECT syntax
Use dblink to access remote servers including non-Oracle machines.
Supported by MySQL. Make sure to include database name in table alias.
References: MySQL Join Syntax, Tutorial on Inner JOIN
|PostgreSQL||NOT supported by PostgreSQL if orders and customers are two databases. Reference:
No Multiple Database Support
However, if you create orders and customers as two schemas in a single database, then it does work. A schema in PostgreSQL is a namespace for objects that allow you to control ownership and permissions of its contained objects. In some situations, rather than creating separate databases you can create one database with multiple schemas that has the same effect.
|SQLite||Supported. Use the ATTACH command to add additional databases to a connection. Then prefix the table name with the database name as usual. Reference: Tutorial|
|Sybase||Supported. Reference: SELECT syntax|
|HSQL||Supported with multiple schemas. Reference: Documentation|
|IBM DB2||Supported for schemas in the same database but not for different databases instances. For different database instances look at federation. Reference: Example|
In summary, if all your databases are on one server instance, then multiple database querying is as easy as prefixing the table name with the database or schema name depending on your database software. In other cases, you need to have one database with multiple schemas to make this technique work.Back to Scenarios
Multiple Databases on Multiple Servers from the Same Database Vendor
The second scenario is when you have multiple databases that reside on different servers but the database software is the same on all servers. For example, you may have two server instances on machines X and Y, but both server instances are Microsoft SQL Server. Different "servers" may refer to distinct physical machines or multiple virtual machines on the same physical hardware. The key factor is that there are multiple installations of the database software.
In this case, it is no longer possible to issue a single query that retrieves data from multiple servers without server configuration and setup. First, you must have sufficient privileges to modify one or more of the databases involved in the query as well as permission to read the corresponding tables. Second, not all database vendors support the ability to issue one query across multiple instances. The table below contains information on each database system support and instructions for use.
|Microsoft SQL Server||Use linked servers that allow connection to any data source (not just other SQL Server instances). References: Linked Servers Overview, Distributed Queries, Article with Example, Linked servers with Excel Example, SQL Server Linked Server Forum (for help with platform-specific installation issues)|
|Oracle||Use Oracle Database Gateways which is an add-on feature to Oracle. It allows querying both Oracle and non-Oracle sources. References: Oracle Tutorial|
|MySQL||No direct support. It may be possible to federate with another database (Oracle, MySQL) or use a multiple query tool such as UnityJDBC.|
|PostgreSQL||Use dblink which supports connections to other PostgreSQL databases. This is an add-on module to the PostgreSQL installation.|
|Sybase||Use Component Integration Services (similar to linked tables in Microsoft SQL Server). References: Sybase Documentation|
|IBM DB2||Use Federated Servers or IBM WebSphere Information Integration.|
When the data is on multiple instances from the same vendor, support depends on the particular database system. Microsoft and Sybase products support data linking to any source for free while for Oracle and IBM it is an expensive add-on feature. PostgreSQL has limited support with dblink, but the SQL statements need to explicitly reference the other sources. MySQL has no built-in support nor does HSQLDB. If your product does not have the support or it is too costly, other options include extracting the data and inserting it locally, replication, or client-side code. You can also use various multiple query products such as Toad Data Point (formerly Toad for Data Analysts) or UnityJDBC.
All federation systems need to be concerned with security. Server-side database modifications require careful analysis of security and since they require server modification may not be possible at all in some environments. External federation tools have the advantage that server-side modifications are not required.Back to Scenarios
Multiple Databases on Multiple Servers from Different Vendors
The most challenging case is when the databases are on different servers and some of the servers run different database software. For example, the customers database may be hosted on machine X on Oracle, and the orders database may be hosted on machine Y with Microsoft SQL Server. Even if both databases are hosted on machine X but one is on Oracle and the other on Microsoft SQL Server, the problem is the same: somehow the information in these databases must be shared across the different platforms. Many commercial databases support this feature using some form of federation, integration components, or table linking (e.g. IBM, Oracle, Microsoft), but support in the open-source databases (HSQL, MySQL, PostgreSQL) is limited.
There are various techniques to handling this problem:
- Table Linking and Federation - link tables from one source into another for querying
- Custom Code - write code and multiple queries to manually combine the data
- Data Warehousing/ETL - extract, transform, and load the data into another source
- Mediation Software - write one query that is translated by a mediator to extract the data required
Table Linking and Federation
Several database products, such Microsoft Access, Microsoft SQL Server, Sybase, IBM, and Oracle, have the ability to "link" a table from another source into the current server instance. Then you can write queries as if that table was actually stored on the particular server. This is a very useful feature that will often allow you to avoid loading data into another source while still being able to execute one query. Unfortunately, not all database vendors, especially the open source products, MySQL and PostgreSQL, support table linking. The systems that support it are described in the previous table.
The most straightforward approach is to manually combine the data from the sources. For example, you could write a program that uses a database connection interface (ODBC, JDBC, etc.) to pull the data from each source into the program and then write your own code to perform the query processing to get the merged result. Although this works for small cases, the whole benefit of the relational model and SQL is that the declarative nature of the language makes queries easier to write and optimize. How likely are you to write code with no errors? How about with good performance? How general will the code be or will you have to write individual solutions for each query? There is a reason why the query processing and optimization system is the most complicated part of a database - it is challenging to get it correct and efficient. In all but the simplest cases, you will spend way more time than necessary by doing it yourself.
Another technique is to write code that extracts data from one database and migrates it to another. This is basically rolling your own ETL software (discussed in the next section). Once again, you will likely invest a lot of effort to produce a non-general system that does not have the required efficiency.
A common solution to handle multiple data sources is to produce a single data warehouse that contains data from multiple sources. This data has been extracted from its original source, translated, summarized, and aggregated into a form suitable for efficient querying and reporting, and loaded into another database. The major benefit of data warehousing is an organization now has an integrated repository suitable for reporting that contains information from many locations. This article cannot go into depth on data warehousing and the steps involved, but some good references can be found here: Data Warehouse Introduction, Course Notes. If there is a significant amount of data to be analyzed (> 1 TB), then data warehousing will guarantee the best performance as the data is loaded once and placed in a system optimized for query performance.
The major issue with data warehousing is that it represents a significant infrastructure investment that is time-consuming to produce. Designing a data warehouse is a complicated procedure that requests buy-in from many organization stakeholders. It may take months or years to produce a data warehouse and involve outside consultants and significant personnel time. The tools to perform ETL functions and data warehouse management are expensive. Once produced a data warehouse must be maintained (personnel, licensing costs, maintenance, etc.). A data warehouse is not an investment to be considered lightly.
The second major issue with data warehousing relates to its freshness. The freshness of a data warehouse is how up-to-date (or real-time) the information it contains. A data warehouse may only have information that is 24 hours old (based on when it was extracted from the operational systems) which may not be sufficient for some business decisions. A larger issue is that adding new information to a data warehouse as the organizational needs evolve is time consuming. For example, if an organization acquires another smaller organization and wants to merge its customer information into its data warehouse, this may take several months before it can be accomplished. Maintaining a data warehouse is costly.
The idea of mediation software is to leave the data where it resides and only extract the required data on demand. The user writes one query submitted to the mediation software that is responsible for optimizing the query to determine an efficient execution plan, translating each query to extract the relevant data from each source, and merging the results from sources into a single answer. From the user perspective, one query produces one answer from a single "virtual" database. Mediator systems have a long history in the database research community. There are also several commercial products that provide this functionality.
Advantages of using mediation software:
- No new systems are required as they leave the data on the data sources (no data movement).
- No data source modifications are required. Avoids security issues related to federated or linking servers at the database server level.
- Data is always up-to-date and real-time as it comes directly from the source.
- No database vendor lock-in compared to using linked servers.
- Query translation allowing portable SQL and code. Can migrate database servers and application is not affected.
- Rapid deployment and installation.
- Very good performance and scalability. Can create multiple federation servers separate from the database servers so data sources are not overloaded.
- Easy to use. Integration happens in a standard SQL query.
Disadvantages of using mediation software:
- Lower performance than data warehousing for very large data sets as the data must be migrated from each system to answer a query. This issue is present in all approaches that do not centralize the data as large amounts of data may need to be extracted and combined from the sources.
- May not be required for simple situations that can be solved by linking a few tables.
UnityJDBC is a JDBC-based mediation software that allows one query to combine information from many sources. It supports standard SQL syntax and functions and allows joins across databases. UnityJDBC automatically translates the query into the specific dialect for each database system and has an advanced optimizer to extract only the information required to answer the query.
Applications of UnityJDBC:
- Web-based reporting - Reporting software such as Jasper Reports and Pentaho have limits on generating reports requiring data from multiple sources. With UnityJDBC, the multiple sources appear as one to the reporting software and real-time integration queries can be done easily using only SQL.
- Developing portable applications - If you are developing an application that may use different databases, a major challenge is that each database dialect is slightly different (especially functions). UnityJDBC provides function translation and substitution for major database dialects. You can develop your application with one dialect of SQL and UnityJDBC will translate it to MySQL, Oracle, SQL Server, or any database that is used.
- Rapid ETL and Querying - Do you quickly need data for a report or application from multiple places and do not have the time or permission to modify the servers? In a few minutes, you can use UnityJDBC to write a JDBC program to extract data using SQL, load it into another source, or use the data for a quick report or decision. No new software, no hassle, problem solved in 30 minutes.
- Multiple Database Query Tool - Do you ever wish that your favorite query tool (e.g. Aqua Data Studio, SQuirreL SQL, DatabaseSpy, RazorSQL) would allow you to write a query across multiple databases? Just register the UnityJDBC driver with your query tool, and you can write queries that span databases in 5 minutes.
JBoss Enterprise Integration
As part of the JBoss enterprise server, JBoss Enterprise Integration provides mediator software allowing queries to span multiple data sources.
Please send comments or suggestions to Dr. Ramon Lawrence.Back to Scenarios