Load sample data into ClickHouse® ================================= The official ClickHouse® website offers `a list of example datasets `_ to get you started. Each dataset has a description on how to download, upload, and transform the data samples as needed. This article takes a closer look at how to use the ``Anonymized Web Analytics Data`` `example dataset `_. This contains two tables: - ``hits_v1``, with data for every user action - ``visits_v1``, with information about every user session The steps below show you how to download the dataset, set up a connection with the server, and load the data into your cluster. ClickHouse already offers detailed instructions `on setting up this dataset `_, but these steps add some more details on how to run commands by using a ClickHouse client running in Docker. Download the dataset -------------------- Download the original dataset directly from `the dataset documentation page `_. You can do this using cURL, where the generic command looks like this: .. code:: curl address_to_file_in_format_tsv_xz | unxz --threads=`nproc` > file-name.tsv .. note:: The ``nproc`` Linux command, which prints the number of processing units, is not available on macOS. To use the above command, add an alias for ``nproc`` into your ``~/.zshrc`` file: ``alias nproc="sysctl -n hw.logicalcpu"``. This command allows you to download and extract data from the URLs specified in the `ClickHouse documentation `_. Once done, you should have two files available: ``hits_v1.tsv`` and ``visits_v1.tsv``. Set up the service and database ------------------------------- If you don't yet have an Aiven for ClickHouse service, follow the steps in our :doc:`getting started guide ` to create one. When you create a service, a default database was already added. However, you can create separate databases specific to your use case. We will create a database with the name ``datasets``, keeping it the same as in the ClickHouse documentation. To create the new database, take the following steps: 1. Log in to the `Aiven web console `_, and select your service from the **Services** page. 2. In your service's page, select **Databases and tables** from the sidebar. 3. In the **Databases and tables** page, select **Create database** > **ClickHouse database**. 4. In the **Create ClickHouse database** window, enter name ``datasets`` for your database and select **Create database**. Connect to the ClickHouse database ---------------------------------- We will be using the ClickHouse client to connect to the server. Follow :doc:`the separate guide ` to familiarize yourself with how to set up and start using the ClickHouse client. To connect to the server, use the connection details that you can find in the **Connection information** section of the **Overview** page in the Aiven web console. You will need **Host**, **Port**, **User**, and **Password**. .. code:: bash docker run --interactive \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure Once you're connected, you can run queries from within the ClickHouse client. Create tables --------------- The next step is to add new tables to your newly created database. The ClickHouse documentation includes sample ``CREATE TABLE`` commands with `the recommended table structure `_, use it to create the tables for both ``hits_v1`` and ``visits_v1``: .. code:: sql CREATE TABLE datasets.hits_v1 [...] .. code:: sql CREATE TABLE datasets.visits_v1 [...] .. note:: If no database is specified, the default one is used. Load data --------- Now that you have a dataset with two empty tables, we'll load data into each of the tables. However, because we need to access files outside the docker container, we'll run the command specifying ``--query`` parameter. To do this: #. Go to the folder where you stored the downloaded files for ``hits_v1.tsv`` and ``visits_v1.tsv``. #. Run the following command: .. code:: cat hits_v1.tsv | docker run \ --interactive \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="INSERT INTO datasets.hits_v1 FORMAT TSV" ``hits_v1.tsv`` contains approximately 7Gb of data. Depending on your internet connection, it can take some time to load all the items. #. Run the corresponding command for ``visits_v1.tsv``: .. code:: cat visits_v1.tsv | docker run \ --interactive \ --rm clickhouse/clickhouse-server clickhouse-client \ --user USERNAME \ --password PASSWORD \ --host HOST \ --port PORT \ --secure \ --max_insert_block_size=100000 \ --query="INSERT INTO datasets.visits_v1 FORMAT TSV" You should now see the two tables in your database and you are ready to try out some queries. Run queries ----------- Once the data is loaded, you can run queries against the sample data you imported. For example, here is a command to query the number of items in the ``hits_v1`` table: .. code:: sql SELECT COUNT(*) FROM datasets.hits_v1 Another example uses some additional query features to find the longest lasting sessions: .. code:: sql SELECT StartURL AS URL, MAX(Duration) AS MaxDuration FROM datasets.visits_v1 GROUP BY URL ORDER BY MaxDuration DESC LIMIT 10 View tables in the console -------------------------- You can also use the database and added tables with the data in the `Aiven web console `_. You can find them by selecting **Databases & Tables** from the sidebar of your service's page in `Aiven Console `_.