Skip to main content
Skip to main content
Edit this page

How to query a remote ClickHouse server

In this guide, we're going to learn how to query a remote ClickHouse server from chDB.

Setup

Let's first create a virtual environment:

And now we'll install chDB. Make sure you have version 2.0.2 or higher:

And now we're going to install pandas, and ipython:

We're going to use ipython to run the commands in the rest of the guide, which you can launch by running:

You can also use the code in a Python script or in your favorite notebook.

An intro to ClickPy

The remote ClickHouse server that we're going to query is ClickPy. ClickPy keeps track of all the downloads of PyPI packages and lets you explore the stats of packages via a UI. The underlying database is available to query using the play user.

You can learn more about ClickPy in its GitHub repository.

Querying the ClickPy ClickHouse service

Let's import chDB:

We're going to query ClickPy using the remoteSecure function. This function takes in a host name, table name, and username at a minimum.

We can write the following query to return the number of downloads per day of the openai package as a Pandas DataFrame:

Now let's do the same to return the downloads for scikit-learn:

Merging Pandas DataFrames

We now have two DataFrames, which we can merge together based on date (which is the x column) like this:

We can then compute the ratio of Open AI downloads to scikit-learn downloads like this:

Querying Pandas DataFrames

Next, let's say we want to find the dates with the best and worst ratios. We can go back to chDB and compute those values:

If you want to learn more about querying Pandas DataFrames, see the Pandas DataFrames developer guide.