Skip to content

These functions ingest data from a file. In many cases, these functions return immediately because they only read the metadata. The actual data is only read when it is actually processed.

read_parquet_duckdb() reads a CSV file using DuckDB's read_parquet() table function.

read_csv_duckdb() reads a CSV file using DuckDB's read_csv_auto() table function.

read_json_duckdb() reads a JSON file using DuckDB's read_json() table function.

read_file_duckdb() uses arbitrary readers to read data. See https://duckdb.org/docs/data/overview for a documentation of the available functions and their options. To read multiple files with the same schema, pass a wildcard or a character vector to the path argument,

Usage

read_parquet_duckdb(
  path,
  ...,
  prudence = c("thrifty", "lavish", "stingy"),
  options = list()
)

read_csv_duckdb(
  path,
  ...,
  prudence = c("thrifty", "lavish", "stingy"),
  options = list()
)

read_json_duckdb(
  path,
  ...,
  prudence = c("thrifty", "lavish", "stingy"),
  options = list()
)

read_file_duckdb(
  path,
  table_function,
  ...,
  prudence = c("thrifty", "lavish", "stingy"),
  options = list()
)

Arguments

path

Path to files, glob patterns * and ? are supported.

...

These dots are for future extensions and must be empty.

prudence

Memory protection, controls if DuckDB may convert intermediate results in DuckDB-managed memory to data frames in R memory.

  • "thrifty": up to a maximum size of 1 million cells,

  • "lavish": regardless of size,

  • "stingy": never.

The default is "thrifty" for the ingestion functions, and may be different for other functions. See vignette("prudence") for more information.

options

Arguments to the DuckDB function indicated by table_function.

table_function

The name of a table-valued DuckDB function such as "read_parquet", "read_csv", "read_csv_auto" or "read_json".

Value

A duckplyr frame, see as_duckdb_tibble() for details.

Fine-tuning prudence

[Experimental]

The prudence argument can also be a named numeric vector with at least one of cells or rows to limit the cells (values) and rows in the resulting data frame after automatic materialization. If both limits are specified, both are enforced. The equivalent of "thrifty" is c(cells = 1e6).

Examples

# Create simple CSV file
path <- tempfile("duckplyr_test_", fileext = ".csv")
write.csv(data.frame(a = 1:3, b = letters[4:6]), path, row.names = FALSE)

# Reading is immediate
df <- read_csv_duckdb(path)

# Names are always available
names(df)
#> [1] "a" "b"

# Materialization upon access is turned off by default
try(print(df$a))
#> [1] 1 2 3

# Materialize explicitly
collect(df)$a
#> [1] 1 2 3

# Automatic materialization with prudence = "lavish"
df <- read_csv_duckdb(path, prudence = "lavish")
df$a
#> [1] 1 2 3

# Specify column types
read_csv_duckdb(
  path,
  options = list(delim = ",", types = list(c("DOUBLE", "VARCHAR")))
)
#> # A duckplyr data frame: 2 variables
#>       a b    
#>   <dbl> <chr>
#> 1     1 d    
#> 2     2 e    
#> 3     3 f    

# Create and read a simple JSON file
path <- tempfile("duckplyr_test_", fileext = ".json")
writeLines('[{"a": 1, "b": "x"}, {"a": 2, "b": "y"}]', path)

# Reading needs the json extension
db_exec("INSTALL json")
db_exec("LOAD json")
read_json_duckdb(path)
#> # A duckplyr data frame: 2 variables
#>       a b    
#>   <dbl> <chr>
#> 1     1 x    
#> 2     2 y