Creating new table from existing table very slow. Must I really use arrow? #1179
Unanswered
IanWorthington
asked this question in
Q&A
Replies: 2 comments
-
Do you need For further and better support, a smaller, well-formatted and self-contained example would be great. Please see https://reprex.tidyverse.org/ for guidance. |
Beta Was this translation helpful? Give feedback.
0 replies
-
Thanks. In the end I've been unable to find anything that doesn't prevent the creation of intermediate BLOCK files on disk apart from a pure SQL solution (or writing out to CSV/PARQUET and reading back in). Pure SQL reduced the execution time from over 4000 seconds to 300. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I've just started looking at duckdb to see if it can solve one of my memory-limited problems with tibbles.
The problem is to build a database table of all possible timelines of closing stock prices from an existing table of date-prices
To create the source table is a basic download, manipulate, and store:
`symbol = "^gspc"
DbLocation = r"(E:\Backup\My Code Projects\R Studio Projects\Stocks exploration\data\sp500.duckdb)"
tableName = symbol
symbolData = tidyquant::tq_get( symbol, from = "1800-01-01" )
symbolData2 =
symbolData %>%
mutate( isTradingDay = TRUE,
TradingDayIndex = row_number()
) %>%
padr::pad( ) %>%
tidyr::fill(close, symbol ) %>%
mutate( DayIndex = row_number(),
isTradingDay = ifelse( is.na(isTradingDay), FALSE, isTradingDay ),
isEow = ( lubridate::wday(date) == 6 ), # 6 = Friday
next.date = lead(date),
isEom = ( month(date) != month(next.date) )
) %>%
select( -next.date )
//# Save in db
con = dbConnect( duckdb(), dbdir = DbLocation )
dbWriteTable( con, tableName, symbolData2, overwrite=TRUE, append=FALSE )
dbDisconnect( con )`
The problem arises when I attempt to use this table. I wish to build a new table which contains the future prices starting from each date in the first table. This requires a lot of memory so I would like to do it directly from one table to another without materialising the data into a dataframe. However dbWriteTable doesn't seem to support a lazy table, and the only solution I could find was to transform it into an arrow dataframe first.
But this was super slow and I'm sure there must be a better solution to this.
`con = dbConnect( duckdb(), dbdir = DbLocation )
//# Build a pre-sql pipeline
PreSqlStep =
tbl( con, tableName ) %>%
select( date, close, DayIndex ) # %>%
//# convert to sql
PreSqlStep.AsSql = sql_render( PreSqlStep )
//# use in a complex sql statement
ComplexSql = str_glue( .sep = " ",
"SELECT *",
"FROM ({PreSqlStep.AsSql}) t1",
"LEFT JOIN",
"(SELECT DayIndex AS FutureDayIndex, "close" AS FutureClose FROM '^gspc' ) t2",
"ON t1.DayIndex <= t2.FutureDayIndex"
)
output =
tbl( con, sql(ComplexSql) ) %>%
arrange( DayIndex, FutureDayIndex ) %>%
mutate( FutureDayIndexDelta = FutureDayIndex - DayIndex,
FutureCloseChangePerc = 100 * (FutureClose - close) / close
) %>%
select( -c(date, close) ) %>%
arrow::to_arrow()
dbWriteTableArrow( con, "test2", output, overwrite = TRUE )
OutputData =
tbl( con, "test2" ) %>%
filter( DayIndex == 1) %>%
collect()
dbDisconnect( con )`
Beta Was this translation helpful? Give feedback.
All reactions