Skip to main content
Skip to main content
Edit this page

Recipes Dataset

The RecipeNLG dataset is available for download here. It contains 2.2 million recipes. The size is slightly less than 1 GB.

Download and Unpack the Dataset

  1. Go to the download page https://recipenlg.cs.put.poznan.pl/dataset.
  2. Accept Terms and Conditions and download zip file.
  3. Option: Using the md5sum dataset.zip to validate the zip file and it should be equal to 3a168dfd0912bb034225619b3586ce76.
  4. Unpack the zip file with unzip dataset.zip. You will get the full_dataset.csv file in the dataset directory.

Create a Table

Run clickhouse-client and execute the following CREATE query:

Insert the Data

Run the following command:

This is a showcase how to parse custom CSV, as it requires multiple tunes.

Explanation:

  • The dataset is in CSV format, but it requires some preprocessing on insertion; we use table function input to perform preprocessing;
  • The structure of CSV file is specified in the argument of the table function input;
  • The field num (row number) is unneeded - we parse it from file and ignore;
  • We use FORMAT CSVWithNames but the header in CSV will be ignored (by command line parameter --input_format_with_names_use_header 0), because the header does not contain the name for the first field;
  • File is using only double quotes to enclose CSV strings; some strings are not enclosed in double quotes, and single quote must not be parsed as the string enclosing - that's why we also add the --format_csv_allow_single_quote 0 parameter;
  • Some strings from CSV cannot parse, because they contain \M/ sequence at the beginning of the value; the only value starting with backslash in CSV can be \N that is parsed as SQL NULL. We add --input_format_allow_errors_num 10 parameter and up to ten malformed records can be skipped;
  • There are arrays for ingredients, directions and NER fields; these arrays are represented in unusual form: they are serialized into string as JSON and then placed in CSV - we parse them as String and then use JSONExtract function to transform it to Array.

Validate the Inserted Data

By checking the row count:

Query:

Result:

Example Queries

Top Components by the Number of Recipes:

In this example we learn how to use arrayJoin function to expand an array into a set of rows.

Query:

Result:

The Most Complex Recipes with Strawberry

Result:

In this example, we involve has function to filter by array elements and sort by the number of directions.

There is a wedding cake that requires the whole 126 steps to produce! Show that directions:

Query:

Result:

Online Playground

The dataset is also available in the Online Playground.