Querying files
TL;DR
How to query a file?
This feature is only available in the shell mode. If you want similar features in the MySQL server, you need to create a vtable.
JSON
Run SELECT * FROM read_json('path/to/file.json')
in your terminal.
CSV
Run SELECT * FROM read_csv('path/to/file.csv')
in your terminal.
Parquet
Run SELECT * FROM read_parquet('path/to/file.parquet')
in your terminal.
YAML
Run SELECT * FROM read_yaml('path/to/file.yaml')
in your terminal.
TOML
Run SELECT * FROM read_toml('path/to/file.toml')
in your terminal.
Introduction
Anyquery is able to run SQL queries on JSON, CSV, Parquet, YAML, and TOML files. The shell mode provides syntactic sugar to query these files. In the MySQL server, you need to create a vtable to query these files which is explained here.
Remote files
While you can query local files, you can also query remote files. You can query files from HTTP, HTTPS, S3, and GCS. The syntax is the same as querying local files.
HTTPS
S3
The aws_access_key_id
, aws_access_key_secret
, region
(optional), and version
(optional) can be passed as query parameters. Anyquery will also attempt to read them from the environment variables and ~/.aws/config
.
GCS
To query a file from GCS, you need to set GOOGLE_APPLICATION_CREDENTIALS
to the path of your service account key or GOOGLE_CREDENTIALS
to the content of your service account key.
All files will be cached in the local filesystem (for 24 hours) to avoid downloading them multiple times. To clear the cache, you can use the clear_file_cache
function.
Stdin
You can also query files from stdin. The syntax is the same as querying local files.
File formats
JSON
To query a JSON file, you need to use the read_json
function. The function takes one or two arguments. The first argument is the path to the JSON file. The second argument is optional and is the JSON path to the data you want to query.
You can also specify the parameters with named arguments.
Shapes supported
The following shapes are supported:
records
:
columns
:
object
:
In this case, there is only one row, and each key is a column.
CSV
To query a CSV file, you need to use the read_csv
function. The function takes one, two, or three arguments. The first argument is the path to the CSV file. The second argument is optional and is if the first row is a header. The third argument is optional and is the delimiter.
You also specify a schema for the CSV file, and any query will make the best effort to parse the file according to the schema.
TSV
To query a TSV file, use the read_csv
function with the delimiter set to \t
.
HTML
You can query HTML tables using the read_html
function. The function takes two arguments. The first argument is the URL of the HTML page. The second argument is the selector of the table.
If the CSS selector points to an element that is not a table, it will return all elements that match the selector.
To extract an attribute from an element, you can use the ->>
operator to access the JSON object’s attribute.
You can also pass the parameters cache
, cache_ttl
, or ttl
to the read_html
function to reduce the time-to-live of the cache. Specify the time in seconds enclosed in single quotes.
By default, the cache is set to 60 seconds for the read_html
table. This means that if a query is run within 60 seconds of the first query with the same URL, the result will be fetched from the cache and not from the actual website.
Parquet
To query a Parquet file, you need to use the read_parquet
function. The function takes one argument which is the path to the Parquet file.
YAML
To query a YAML file, you need to use the read_yaml
function. The function takes one argument, which is the path to the YAML file.
Each key in the YAML file represents a column. Therefore, only one row is returned. This structure is similar to the objects
shape in JSON.
TOML
To query a TOML file, you need to use the read_toml
function. The function takes one argument, which is the path to the TOML file.
Each key in the TOML file is a column. Therefore, only one row is returned. It’s similar to the objects
shape in JSON.
MySQL server
To query files in the MySQL server, you need to create a virtual table. It’s a table that points to the file. The virtual table is created using the CREATE VIRTUAL TABLE
statement. It uses the same arguments as the shell mode.
Each table named read_file_format in the shell has a corresponding file_format_reader table in the MySQL server. file_format_reader tables are also available in the shell mode.
Limitations
- You cannot observe the schema of the file using the
PRAGMA table_info
orDESCRIBE
statement. This is due toanyquery
rewriting your query on the fly to create a temporary virtual table for the duration of the query. To observe the schema, you need to create a virtual table as specified in the MySQL server section. - You cannot use the
CREATE VIEW
statement with theread_*
functions.