This is a pure Julia implementation of the Apache Arrow data standard. This package provides Julia AbstractVector objects for
referencing data that conforms to the Arrow standard. This allows users to seamlessly interface Arrow formatted data with a great deal of existing Julia code.
Please see this document for a description of the Arrow memory layout.
The package can be installed by typing in the following in a Julia REPL:
julia> using Pkg; Pkg.add("Arrow")Arrow.jl currently requires Julia 1.12+.
When developing on Arrow.jl it is recommended that you run the following to ensure that any changes to ArrowTypes.jl are immediately available to Arrow.jl without requiring a release:
julia --project -e 'using Pkg; Pkg.develop(path="src/ArrowTypes")'Current write-path notes:
Arrow.tobufferincludes a direct single-partition fast path for eligible inputsArrow.tobuffer(Tables.partitioner(...))also includes a targeted direct multi-record-batch path for single-column top-level strings and single-column non-missing binary/code-units columnsArrow.write(io, Tables.partitioner(...))now reuses that same targeted direct multi-record-batch path instead of always going through the legacyWriterorchestration- multi-column partitions, dictionary-encoded top-level columns, map-heavy inputs, and missing-binary partitions retain the existing writer path
This implementation supports the 1.0 version of the specification, including support for:
- All primitive data types
- All nested data types
- Dictionary encodings and messages
- Dictionary-encoded
CategoricalArrayinterop, including missing-value roundtrips throughArrow.Table,copy, andDataFrame(...; copycols=true) - Extension types
- Lightweight schema/field metadata overlays via
Arrow.withmetadata(...)for Tables.jl-compatible sources before serialization - Base Julia
Enumlogical types via theJuliaLang.Enumextension label, with native Julia roundtrips back to the original enum type whileconvert=falseand non-Julia consumers still see the primitive storage type - View-backed Utf8/Binary columns, including recovery from under-reported variadic buffer counts by inferring the required external buffers from valid view elements
- Streaming, file, record batch, and replacement and isdelta dictionary messages
It currently doesn't include support for:
- Tensor or sparse tensor IPC payload semantics; Arrow.jl now recognizes those message headers explicitly and rejects them with precise errors instead of falling through to a generic unsupported-message path
- C data interface
- Writing Run-End Encoded arrays; Arrow.jl now reads REE arrays and exposes them as read-only vectors, but still rejects REE on write paths
Flight RPC status:
- Experimental
Arrow.Flightsupport is available in-tree - Requires Julia
1.12+ - Includes generated protocol bindings for the
FlightServiceRPC surface while keeping the gRPC client constructors in the modular client boundary undersrc/flight/client/instead of in the generated protocol module - Keeps the top-level Flight module shell thin, with exports and generated-protocol setup split out of
src/flight/Flight.jl - Includes high-level
FlightData <-> Arrow IPChelpers forArrow.Table,Arrow.Stream, and DoPut/DoExchange payload generation,Arrow.Flight.pathdescriptor(...)for PATH descriptors without manual proto assembly, opt-inapp_metadatasurfacing throughinclude_app_metadata=trueonArrow.Flight.stream(...)/Arrow.Flight.table(...), explicit batch-wiseapp_metadata=...emission onArrow.Flight.flightdata(...),Arrow.Flight.putflightdata!(...), and source-basedArrow.Flight.doexchange(...), and a reusableArrow.Flight.withappmetadata(...)wrapper so source-level batch metadata can stay attached without manual keyword threading - Keeps the Flight IPC conversion layer modular under
src/flight/convert/, withsrc/flight/convert.jlretained as a thin entrypoint - Owns Flight protocol, descriptor, IPC, and server/runtime surfaces only; package-owned interop and performance proofs run through external Python clients instead of a Julia Flight client runtime
- Includes a transport-agnostic server core (
Service,ServerCallContext,ServiceDescriptor,MethodDescriptor) for local Flight method dispatch, path lookup, handler testing, packaged backend capability checks throughArrow.Flight.flight_server_backend_capabilities(...), transport-neutral gRPC-over-HTTP/2 framing helpers, high-levelDoExchangeassembly throughArrow.Flight.exchangeservice(...),Arrow.Flight.tableservice(...), andArrow.Flight.streamservice(...), and source-based local invocation throughArrow.Flight.doexchange(service, context, source; ...),Arrow.Flight.table(service, context, source; ...), andArrow.Flight.stream(service, context, source; ...) - Keeps the transport-agnostic server core modular under
src/flight/server/, withsrc/flight/server.jlretained as a thin entrypoint - Includes built-in
PureHTTP2.jltransport helpers in the Flight server core for package-owned h2c listeners, unary RPCs, client-streaming, server-streaming, and live bidirectionalDoExchangegRPC-over-HTTP/2 handling throughArrow.Flight.purehttp2_flight_server(...); long-lived connection and handler workers now run on Julia's thread pool instead of sticky@asynctasks so CPU-heavy Flight callbacks can overlap on multi-threaded runtimes, and the listener now exposes a boundedmax_active_requestsadmission gate so overload returns a gRPC status instead of silently growing unbounded active compute work - The packaged Flight server backend contract now reports the built-in
:purehttp2path as the only default live listener profile, retires:grpcserver, and exposes a weakdep-backed:nghttp2profile only whenNghttp2Wrapper.jlis loaded; that backend is currently limited to unary plus buffered server-streaming methods with trailer-bornegrpc-status - Includes package-owned live Python-client coverage for authenticated
ListFlights,GetFlightInfo,GetSchema,DoGet,DoPut,DoExchange,ListActions, andDoActionthroughtest/flight_purehttp2.jl - Keeps targeted Flight verification modular under
test/flight/, withtest/flight.jlretained as the shared default entrypoint for generated protocol, server-core, and IPC coverage, and the PureHTTP2/nghttp2 listener proofs isolated in dedicated runner files - Includes
test/flight_purehttp2.jlas the PureHTTP2-first temporary-environment runner for shared Flight interop coverage plus package-owned listener proofs - Includes
test/flight_purehttp2_perf.jlas a focused large-transport runner that benchmarks large-responseDoGeton the package-ownedPureHTTP2listener through a reusable backend-factory seam - Includes
test/flight_nghttp2_probe.jlas a substrate probe that verifiesNghttp2Wrapper.jlexports the low-level session / callback / submit hooks needed for the Flight adapter, proves a smallPureHTTP2client interop smoke againstNghttp2Wrapper.HTTP2Server, and measures a raw 2 MiB h2c response on the C-wrapper server without widening default CI - Includes
test/flight_nghttp2.jlas the focused weakdep-backed nghttp2 listener runner; it proves live Python-client unary plus server-streaming Flight calls overNghttp2Wrapper.jland prints same-harness largeDoGetcomparison numbers against the defaultPureHTTP2backend - The current nghttp2 backend still does not support request-streaming
Handshake,DoPut, orDoExchange, soPureHTTP2remains the canonical package-owned backend andtest/flight_purehttp2_perf.jlremains the default large-transport proof for the product lane Handshaketoken propagation andPollFlightInfocurrently remain server-core/local proofs because the current external Python client surfaces used in tests do not cover those contracts directly- Dedicated CI jobs now exercise the Flight interop suite on stable and nightly Linux through
test/flight_purehttp2.jl; the built-in PureHTTP2 server substrate is the package-owned live backend direction, with Python-client smoke coverage on the same listener surface
Third-party data formats:
- CSV, parquet and avro support via the existing CSV.jl, Parquet.jl and Avro.jl packages
- Other Tables.jl-compatible packages automatically supported (DataFrames.jl, JSONTables.jl, JuliaDB.jl, SQLite.jl, MySQL.jl, JDBC.jl, ODBC.jl, XLSX.jl, etc.)
- No current Julia packages support ORC
Canonical extension highlights:
UUIDnow writes the canonicalarrow.uuidextension name by default while retaining reader compatibility with legacyJuliaLang.UUIDmetadataArrow.TimestampWithOffset{U}provides a canonicalarrow.timestamp_with_offsetlogical type without conflating offset-only semantics withZonedDateTimeArrow.Bool8provides an explicit opt-in writer/reader surface for the canonicalarrow.bool8extension without changing the default packed-bitBoolpathArrow.JSONText{String}provides a text-backed logical type for the canonicalarrow.jsonextension without parsing payloads during read or writearrow.opaquenow reads as the underlying storage type without warning, and explicit writer metadata can be generated withArrow.opaquemetadata(type_name, vendor_name)Arrow.variantmetadata(),Arrow.fixedshapetensormetadata(...), andArrow.variableshapetensormetadata(...)generate canonical metadata strings for advanced canonical extensionsarrow.fixed_shape_tensorandarrow.variable_shape_tensorare recognized on read as canonical passthrough extensions over their storage types, and Arrow.jl now validates their canonical metadata plus top-level storage shape before accepting themarrow.parquet.variantis recognized on read as a canonical passthrough extension over its storage type; Arrow.jl currently validates that its canonical metadata is the required empty string, but does not yet implement deeper variant semantics or an automatic writer surface- Legacy
JuliaLang.ZonedDateTime-UTCandJuliaLang.ZonedDateTimefiles remain readable for backward compatibility
See the full documentation for details on reading and writing arrow data.