How to query sharded tables

Supermetrics Support Site Team

Modified on: Thu, 15 Dec, 2022 at 10:38 AM

When Supermetrics transfers data from BigQuery, it creates sharded tables. Follow these instructions to select and query data from a single table shard (one day of data), optimize your queries, and prepare the data for transfer to Tableau or PowerBI.

Instructions

Query a sharded table

Running this query in the Google Cloud Console will select data from a single table shard:

select *

from `project.dataset.datasource_tablename_20220101`

To query all the data that exists in a set of sharded tables, however, you need to ensure that BigQuery combines the data from all of these tables together using a wildcard.

The following query uses a star (*) as the wildcard. This instructs BigQuery to “query data from all tables that match this naming convention”:

select *

from `project.dataset.datasource_tablename_*`

Prepare data for Tableau or PowerBI

While Looker Studio (formerly Data Studio) natively reads sharded tables, most BI tools don’t. This means that if you try to connect Tableau or PowerBI directly to your sharded tables, they’ll reach them as hundreds of individual tables that aren’t connected to each other.

Creating a BigQuery view that uses a wildcard before loading the data into a BI tool solves this problem. A view combines the results of the underlying sharded tables into a single object that both Tableau and PowerBI can access.

Instructions

Enter this query in the Google Cloud Console:

select *

from `project.dataset.datasource_tablename_*`

After it’s run, you’ll see a SAVE button that allows you to Save view. Select this, name the new view, and then choose which dataset it’ll live in.

If you’d prefer to do this using SQL syntax, you can run the following command instead:

create or replace view dataset.viewname

select *

from `project.dataset.datasource_tablename_*`

Optimize queries

Creating a view (“Prepare data for Tableau or PowerBI” above) allows you to select all data in a sharded table set. In some cases, however, you might not need all of that data in every query.

Sharded tables have some useful properties that can improve each query’s speed and reduce its cost while offering precise control over the data they pull.

Both of the techniques below enable BigQuery to determine exactly which tables need to be scanned before it executes the query. These deeply reduced scan sizes improve the speed and costs of the associated queries.

Move the wildcard

The easiest way to specify your selection of data is to move the query’s wildcard to a different position.

For example, this query will select data from only January 2022:

select *

from `project.dataset.datasource_tablename_202201*`

With the wildcard just after the month value, the query now returns data for only a single month. By extension, this query will return all data from all of 2022:

select *

from `project.dataset.datasource_tablename_2022*`

Use the _TABLE_SUFFIX

A dynamically defined time range can make a query even more efficient.

For example, this query will return all data after January 2022:

select *

from `project.dataset.datasource_tablename_*`

where _TABLE_SUFFIX > ‘20220131’

In this case, adding _TABLE_SUFFIX to the WHERE clause limits the amount of data being returned by the query.

Delete sharded tables

The process for deleting sharded tables is delicate. If carried out incorrectly, it could result in the deletion of data in unintended locations.

Learn how to delete sharded tables in our guide.

More resources

About BigQuery sharded tables

How to query sharded tables

Instructions

Instructions

Move the wildcard

Use the _TABLE_SUFFIX

More resources

Products

Resources

Company