Creating new table from existing table very slow. Must I really use arrow? #1179

IanWorthington · 2025-05-27T17:37:16Z

IanWorthington
May 27, 2025

I've just started looking at duckdb to see if it can solve one of my memory-limited problems with tibbles.

The problem is to build a database table of all possible timelines of closing stock prices from an existing table of date-prices

To create the source table is a basic download, manipulate, and store:

`symbol = "^gspc"
DbLocation = r"(E:\Backup\My Code Projects\R Studio Projects\Stocks exploration\data\sp500.duckdb)"
tableName = symbol

symbolData = tidyquant::tq_get( symbol, from = "1800-01-01" )

symbolData2 =
symbolData %>%
mutate( isTradingDay = TRUE,
TradingDayIndex = row_number()
) %>%
padr::pad( ) %>%
tidyr::fill(close, symbol ) %>%
mutate( DayIndex = row_number(),
isTradingDay = ifelse( is.na(isTradingDay), FALSE, isTradingDay ),
isEow = ( lubridate::wday(date) == 6 ), # 6 = Friday
next.date = lead(date),
isEom = ( month(date) != month(next.date) )
) %>%
select( -next.date )

//# Save in db

con = dbConnect( duckdb(), dbdir = DbLocation )
dbWriteTable( con, tableName, symbolData2, overwrite=TRUE, append=FALSE )
dbDisconnect( con )`

The problem arises when I attempt to use this table. I wish to build a new table which contains the future prices starting from each date in the first table. This requires a lot of memory so I would like to do it directly from one table to another without materialising the data into a dataframe. However dbWriteTable doesn't seem to support a lazy table, and the only solution I could find was to transform it into an arrow dataframe first.

But this was super slow and I'm sure there must be a better solution to this.

`con = dbConnect( duckdb(), dbdir = DbLocation )

//# Build a pre-sql pipeline

PreSqlStep =
tbl( con, tableName ) %>%
select( date, close, DayIndex ) # %>%

//# convert to sql

PreSqlStep.AsSql = sql_render( PreSqlStep )

//# use in a complex sql statement

ComplexSql = str_glue( .sep = " ",
"SELECT *",
"FROM ({PreSqlStep.AsSql}) t1",
"LEFT JOIN",
"(SELECT DayIndex AS FutureDayIndex, "close" AS FutureClose FROM '^gspc' ) t2",
"ON t1.DayIndex <= t2.FutureDayIndex"
)

output =
tbl( con, sql(ComplexSql) ) %>%
arrange( DayIndex, FutureDayIndex ) %>%
mutate( FutureDayIndexDelta = FutureDayIndex - DayIndex,
FutureCloseChangePerc = 100 * (FutureClose - close) / close
) %>%
select( -c(date, close) ) %>%
arrow::to_arrow()

dbWriteTableArrow( con, "test2", output, overwrite = TRUE )

OutputData =
tbl( con, "test2" ) %>%
filter( DayIndex == 1) %>%
collect()

dbDisconnect( con )`

krlmlr · 2025-05-27T17:44:18Z

krlmlr
May 27, 2025
Maintainer

Do you need dplyr::compute() instead of arrow::to_arrow() ?

For further and better support, a smaller, well-formatted and self-contained example would be great. Please see https://reprex.tidyverse.org/ for guidance.

0 replies

IanWorthington · 2025-05-29T18:00:50Z

IanWorthington
May 29, 2025
Author

Thanks.

In the end I've been unable to find anything that doesn't prevent the creation of intermediate BLOCK files on disk apart from a pure SQL solution (or writing out to CSV/PARQUET and reading back in). Pure SQL reduced the execution time from over 4000 seconds to 300.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Creating new table from existing table very slow. Must I really use arrow? #1179

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Creating new table from existing table very slow. Must I really use arrow? #1179

Uh oh!

IanWorthington May 27, 2025

Replies: 2 comments

Uh oh!

krlmlr May 27, 2025 Maintainer

Uh oh!

IanWorthington May 29, 2025 Author

IanWorthington
May 27, 2025

krlmlr
May 27, 2025
Maintainer

IanWorthington
May 29, 2025
Author