UDFBench

Welcome to the UDFBench repository!

UDFBench is a UDF-centric database benchmark that focuses on UDFs and their implications to SQL query performance. The benchmark is based on a real-world schema and data, and comprises 21 queries involving 42 scalar, aggregate and table User-Defined Functions (UDFs).

This repository provides an initial deployment of UDFBench on five data engines: MonetDB, PostgreSQL, DuckDB, SQLite, and Spark. And it also provides instructions on how it can be deployed on new systems.

Citing UDFBench

If you use UDFBench, please cite the relevant publication:

@article{udfbench,
  title        = {The UDFBench Benchmark for General-purpose UDF Queries},
  author       = {Yannis Foufoulas and Theoni Palaiologou and Alkis Simitsis},
  journal      = {Proc. {VLDB} Endow.},
  volume       = {18},
  year         = {2025}
}

How to Use

For installation and execution instructions, please refer to the instructions.md file in the repository.

Schema and Data

Schema

The UDFBench schema hails from a real-world application that performs information extraction, text mining and analytics over Open Access publications, and is a curated subset of the OpenAIRE Graph Dataset.

The UDFBench schema comprises 2 core tables: artifacts and projects. These connect to 7 tables containing author metadata, author lists per artifact (in JSON), artifact abstracts and accessing fees, artifacts related to a project, and to 1 table with view statistics for each artifact.

The schema includes fields of various data types and sizes, and allows building queries with expensive relational operators such as aggregation, grouping, and join paths of moderate length (up to 5).

Note that the UDFBench schema contains denormalized features such as nested attributes (JSON list, dict) and the UDFBench queries connect to external denormalized files, hence, enabling testing complex UDF functionality beyond relational. The external files contain information from sources such as arxiv, crossred, pubmed, etc. and are in various formats such as csv, txt, xml, json, etc.

The ER-diagram of the UDFBench schema is as follows:

Dataset

UDFBench contains four dataset sizes: tiny, small, medium, and large.

The tiny dataset is intended for deployment testing and it is included in this repository in the dataset directory.

The other three sizes are hosted in Zenodo and can be automatically deployed into the UDFBench schema using automated scripts (see instructions.md).

Each dataset consists of 10 CSV files corresponding to the 10 UDFBench schema tables.

The external files used in the UDFBench queries are also available in the same Zenodo repository.

Queries

UDFBench comprises 21 SQL queries with UDFs organized in four query classes:

(QC1) Queries with UDFs and a few relational operators (emphasis on UDFs). These queries are useful for evaluating UDF related overheads.
(QC2) Queries with a blend of UDFs and expensive relational operators (e.g., join, group) or nesting. These queries are useful for evaluating the interaction of UDFs with heavy relational operators.
(QC3) Queries with UDFs and complex relational logic. These queries are usefule for evaluating the impact of UDFs in query execution/optimization.
(QC4) Queries with UDF and DML operations (e.g., insert, update).

The queries are implemented for all supported engines and can be found in the queries directory under each engine directory.

Query List

Query Class 1 (QC1)

Q1 - Date Extraction with Scalar UDFs

Description: Extract the year, month, and day from a date using three scalar UDFs.
Input: 1 string.
Output: 3 integer values.

Q2 - Date Extraction with Table UDF

Description: Extract the year, month, and day from a date using a single table-returning UDF.
Input: 1 string column.
Output: 3 integer columns.

Q3 - Multi-language Scalar UDFs

Description: Extract the year, month, and day from a date using scalar UDFs implemented in different languages.
Input: 1 string.
Output: 3 integer values.

Q4 - Average and Median with Aggregate UDFs

Description: Compute the average and median author count per artifact using two aggregate UDFs.
Input: 1 integer.
Output: 2 float values.

Q5 - Aggregate Calculations with Table UDF

Description: Compute the average and median author count per artifact using a single table-returning UDF.
Input: 1 integer.
Output: 2 float values.

Q6 - File Format Conversion

Description: Convert and union two disk files (XML and JSON formats) into a single CSV file.
Input: 2 file paths.
Output: 1 CSV file.

Q7 - JSON File Analysis

Description: Analyze a JSON file with publication metadata to compute the average citation and author counts.
Input: 1 file path.
Output: 2 float values.

Q8 - Publication Metadata Analysis

Description: Analyze publication metadata from two tables (author lists and citations) to compute the average citation and author counts.
Input: 2 tables.
Output: 2 float values.

Q9 - Author Pair Counting

Description: Clean author list data and count all possible author pairs for publications with fewer than 50 authors.
Input: 1 table.
Output: 1 integer value.

Query Class 2 (QC2)

Q10 - Iterative k-Means Clustering

Description: Perform iterative k-means clustering on publications grouped by type, based on funding amounts accessed through funding projects.
Input: 2 tables.
Output: 3 values (1 string, 1 integer, 1 float).

Q11 - Recursive k-Means Clustering

Description: Perform recursive k-means clustering on publications grouped by type, based on funding amounts accessed through funding projects.
Input: 2 tables.
Output: 3 values (1 string, 1 integer, 1 float).

Q12 - Noisy Artifact Rankings

Description: Return the 10 most-viewed artifacts from the past year, adding Gaussian noise to the counts.
Input: 1 table.
Output: 1 string, 1 float.

Q13 - JSON Link Extraction

Description: Extract links between publications and projects from an external JSON file.
Input: 1 file.
Output: 2 string values.

Query Class 3 (QC3)

Q14 - Recent Author Affiliations

Description: Return the most recent affiliation of the first author for publications funded by the European Commission.
Input: 4 tables.
Output: 2 string values.

Q15 - XML Link Extraction

Description: Extract links between publications and projects from an external XML file, excluding links that already exist in the database.
Input: 1 file.
Output: 2 string values.

Q16 - Author Collaboration Analysis

Description: Analyze author collaborations within projects by computing:
- Author pairs collaborating during the project.
- Author pairs collaborating before the project.
- Author pairs collaborating after the project.
Input: 4 tables.
Output: 3 string columns and 3 integer columns.

Q17 - TF/IDF Computation

Description: Compute the term frequency-inverse document frequency (TF/IDF) for each term in artifact abstracts.
Input: 1 table.
Output: 2 string values.

Q18 - Jaccard Similarity

Description: Compute Jaccard similarity between document abstracts from two external files and return the top 5 similar documents per source document.
Input: 2 files.
Output: 2 strings, 1 float.

Q19 - Project Artifact Linking

Description: Return the pivoted count of links to artifacts of different types for each project.
Input: 2 tables.
Output: 1 string, 4 integers.

Query Class 4 (QC4)

Q20 - Date Cleaning and Updating

Description: Update artifact table dates after cleaning.
Input: 1 table.
Output: Updated table.

Q21 - JSON Link Insertion

Description: Extract links between publications and projects from a JSON file and insert them into the projects_artifacts table.
Input: 1 file.
Output: Updated table.

UDFs

UDFBench comprises 42 UDFs, categorized as follows:

24 Scalar UDFs
4 Aggregate UDFs
14 Table UDFs, covering various subtypes (see the UDFBench paper for details)

The UDFs are implemented for all supported engines and can be found in the udfs directory under each engine directory, organized by their types.

UDF List

Scalar UDFs

ID	Name	Description	Type	#Inputs	Input Type	#Outputs	Output Type	Pipeline	Parallelizable	State	Cost
U1	Addnoise	Adds Gaussian noise to a value and returns a float	S-1	1	numeric	1	float	x	x		O(n)
U2	Clean	Performs a data cleaning task on the string tokens of a JSON list	S-1	1	string	1	string	x	x	x	O(n)
U3	Cleandate	Reads a date and converts it to a common format (if needed) and handles problem dates	S-1	1	string	1	string	x	x	x	O(n)
U4	Converttoeuro	Converts currency to euro, returns a float	S-1	1	numeric	1	float	x	x		O(n)
U5	Extractclass	Extracts class from a string with format `funder::class::projectid`	S-1	1	string	1	string	x	x		O(n)
U6	Extractcode	Extracts project ID from a structured string containing the funder's ID, funding class, and project ID	S-1	1	string	1	string	x	x	x	O(n)
U7	Extractday	Reads a date (as a string) and extracts an integer with the day	S-1	1	string	1	int	x	x	x	O(n)
U8	Extractfunder	Extracts funder from a string with format `funder::class::projectid`	S-1	1	string	1	string	x	x		O(n)
U9	Extractid	Extracts project ID from a string with format `funder::class::projectid`	S-1	1	string	1	string	x	x		O(n)
U10	Extractmonth	Reads a date (as a string) and extracts an integer with the month	S-1	1	string	1	int	x	x	x	O(n)
U11	Extractprojectid	Processes a text snippet and extracts a 6-digit project identifier	S-1	1	string	1	string	x	x	x	O(n)
U12	Extractyear	Reads a date (as a string) and extracts an integer with the year	S-1	1	string	1	int	x	x	x	O(n)
U13	Filterstopwords	Removes stopwords from the input text	S-1	1	string	1	string	x	x	x	O(n)
U14	Frequentterms	Returns a space-separated text containing the most N% frequent tokens	S-1	1	string	1	string	x	x		O(n)
U15	Jaccard	Processes two JSON lists with tokens and calculates the Jaccard distance	S-1	2	json	1	float	x	x		O(n*k)
U16	Jpack	Converts a string to a JSON list with tokens	S-1	P	P	1	json	x	x		O(n*k)
U17	Jsoncount	Returns the length of a JSON list	S-1	1	json	1	int	x	x		O(n)
U18	Jsort	Processes a JSON list and returns a sorted JSON list	S-1	1	string	1	string	x	x		O(n)
U19	Jsortvalues	Processes a JSON list, if a value contains more than one token it sorts the value tokens	S-1	1	string	1	string	x	x		O(n)
U20	Keywords	Removes punctuation from text and returns the keywords in one string	S-1	1	string	1	string	x	x	x	O(n)
U21	Log10	Computes and returns the logarithm	S-1	1	numeric	1	float	x	x		O(n)
U22	Lower	Converts the input text to lowercase	S-1	1	string	1	string	x	x		O(n)
U23	Removeshortterms	Processes a JSON list, if a value contains more than one token it removes tokens with length less than 3 chars	S-1	1	string	1	string	x	x		O(n)
U24	Stem	Stems the input text	S-1	1	string	1	string	x	x	x	O(n)

Aggregate UDF

ID	Name	Description	Type	#Inputs	Input Type	#Outputs	Output Type	Pipeline	Blocking	Parallelizable	State	Cost
U25	Avg	Computes average	A-2	1	numeric	1	float	x		x		O(n)
U26	Count	Computes count	A-2	1	P	1	int	x		x		O(n)
U27	Max	Computes max	A-2	1	P	1	int	x		x	x	O(n)
U28	Median	Computes median	A-2	1	numeric	1	numeric		x			O(nlogn)

Table UDFs

ID	Name	Description	Type	#Inputs	Input Type	#Outputs	Output Type	Pipeline	Blocking	Parallelizable	Side Effect	State	Cost
U29	Extractfromdate	Reads a date (as string), returns 3 column values (day, month, year)	T-3	1	string	3	rset	x		x		x	O(n)
U30	Jsonparse	Parses a JSON dict per time and returns a tuple with the values	T-3	1	json	P	P	x		x			O(n*k)
U31	Combinations	Reads a JSON list and returns a table with all the combinations per an integer parameter	T-4	1	json	P	P	x		x			O(n*k^2)
U32	Extractkeys	Selects keys from XML parsed input	T-2	1	xml	P	P	x		x		x	O(n)
U33	Strsplitv	Processes a string at a time and returns its tokens in separate rows	T-4	1	string	1	string	x		x
U34	Jgroupordered	Performs a group by with an aggregate defined as a (named) parameter on a subquery ordered by an attribute	T-5	P	P	P	P	x					O(n)
U35	Kmeans	Iterative version of the kmeans function	T-5	P	numeric	P	P		x				O(n^2)
U36	Kmeans	Clusters input data using recursive kmeans, returns cluster id and data point	T-5	P	numeric	P	P		x			x	O(n^2)
U37	Xmlparse	Parses XML input and returns a table	T-5	1	xml	P	P		x				O(n)
U38	Pivot	Converts rows of a specific attribute (optionally grouped by another attribute) into columns, while applying an aggregation within the transformed dataset, and returns one tuple per input group	T-6	P	P	P	P		x				O(n*k)
U39	Top	Processes one group at a time and returns the top N values of an attribute	T-7	P	P	P	P	x		x			O(n*logk)
U40	File	Parses an external file (CSV, XML, JSON) and returns a table	T-8	1	string	P	P	x					O(n)
U41	Output	Exports the results of a subquery to local storage in various formats and returns true if it succeeds	T-9	P	P	1	bool	x			x		O(n)
U42	getstats	Inputs a table with integer values and returns the avg and the median for each input column	T-10	P	numeric	P	float		x				O(n*logn)

UDFBench Experimental Suite

UDFBench is a flexible framework for conducting experiments across different database engines. It evaluates the supported queries and UDFs, and allows the configuration of the following settings:

Dataset Size: Choose between small, medium, or large datasets to match your benchmarking requirements.
Parallelism: Control the number of threads to run.
Cache State: Select between cold (fresh start) or hot cache scenarios.
Storage Medium: Test performance on different storage meduims: hdd, ssd, and memory.
Resource Utilization Monitoring:
- Use collectl to monitor system resource usage (CPU, memory, disk, etc.).
- Store the results for post-experiment analysis.

More details on how to orchestrate and run an experiment can be found in the instructions.md file.

UDFBench Paper Experiments

For the experiments presented in the UDFBench paper, we used the following software versions: MonetDB 5 server(v11.50.0), PostgreSQL (17), DuckDB (1.0.0), SQLite3 (3.37.2), Spark/PySpark (3.5.5), Python (3.10.12), gcc (11.4.0), PyPy (7.3.12), CFFI (1.15.1), APSW (3.40.0.0), NumPy (1.26.4), Pandas (2.2.0), PyArrow(18.0.0), Scikit-learn (1.4.0), NLTK (3.8.1), Radon 6.0.1.

Note: If these experiments run on different software versions the results may vary and certain features may not be supported.

Disclaimer

This software is provided "as is" without any warranties, express or implied.

Thank you for exploring the UDFBench!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
automations		automations
dataset		dataset
engines		engines
figs		figs
LICENSE		LICENSE
README.md		README.md
collectl_results.sh		collectl_results.sh
instructions.md		instructions.md
integration_instructions.txt		integration_instructions.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

UDFBench

Citing UDFBench

How to Use

Schema and Data

Schema

Dataset

Queries

Query List

UDFs

UDF List

UDFBench Experimental Suite

UDFBench Paper Experiments

Disclaimer

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

athenarc/UDFBench

Folders and files

Latest commit

History

Repository files navigation

UDFBench

Citing UDFBench

How to Use

Schema and Data

Schema

Dataset

Queries

Query List

UDFs

UDF List

UDFBench Experimental Suite

UDFBench Paper Experiments

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages