Feature request: a public dataset #117

raybellwaves · 2020-09-29T18:08:57Z

Similar to df = dd.read_parquet('gs://dask-nyc-taxi/yellowtrip.parquet', storage_options={'token': 'anon'}).

Something like df = dd.read_parquet('az://yellowtrip.parquet', storage_options={'account_name': 'dask-nyc-taxi'})

The text was updated successfully, but these errors were encountered:

martindurant · 2020-09-29T18:15:05Z

As far as I know, we (Anaconda) don't have any public/free data on Azure. Maybe not even any account at all. I wouldn't be surprised is MS would be happy to host this, and might even have some standard datasets around ( https://azure.microsoft.com/en-ca/services/open-datasets/#overview ?)

raybellwaves · 2020-09-29T18:31:35Z

Thanks for responding.

might even have some standard datasets around

Yes. @hayesgb recently added support to access these datasets. I'll have a rummage to see if there is a simple tabular dataset as a parquet file that we can point to.

I also have a feature request to read the Azureml datasets e.g. here using adlfs: #102. It looks like a lot of these are partitioned parquet files (https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/#AzureNotebooks).

No idea who the point of contact from Azure would be (@lostmygithubaccount?)

lostmygithubaccount · 2020-09-29T18:35:32Z

for Open Datasets it is @meyetman

you can access via adlfs with the anonymous access:

from adlfs import AzureBlobFileSystem

storage_options = {'account_name': 'azureopendatastorage'}

fs = AzureBlobFileSystem(**storage_options)
fs.ls('isdweatherdatacontainer')

martindurant · 2020-09-29T18:40:10Z

Many of them have example notebooks attached with simple blob code - but there are various file types.

https://opendatasets-api.azure.com/discoveryapi/OpenDataset/DownloadNotebook?serviceType=AzureNotebooks&package=azure-storage&registryId=us-decennial-census-zip

rabernat · 2022-01-26T14:41:10Z

@martindurant's link is broken.

I would love to see an example of simple anonymous access to ABS with adlfs. In pangeo-forge/pangeo-forge-recipes#267, I open the file https://ai4edataeuwest.blob.core.windows.net/ecmwf/20220122/18z/0p4-beta/scda/20220122180000-36h-scda-fc.grib2 with the fsspec http interface, so it is clearly "public". But I could not find a combination of options to open the same file with adlfs.

Could anyone (maybe @TomAugspurger) post an example equivalent to s3fs's anon=True?

TomAugspurger · 2022-01-26T14:44:15Z

@rabernat you provide the "storage account name" which is the https://<storage-account-name>.blob.windows.net part. The "storage container" is in the path. It's a bit confusing at first:

In [1]: import adlfs

In [2]: fs = adlfs.AzureBlobFileSystem("ai4edataeuwest")

In [3]: fs.ls("/ecmwf")
Out[3]:
['ecmwf/20220121',
 'ecmwf/20220122',
 'ecmwf/20220123',
 'ecmwf/20220124',
 'ecmwf/20220125',
 'ecmwf/20220126']

TomAugspurger · 2022-01-26T14:47:48Z

Dunno if people are still looking for a Parquet file, but this should work if you are:

In [1]: import dask.dataframe as dd

In [2]: dd.read_parquet("az://gbif/occurrence/2022-01-01/occurrence.parquet/000000", storage_options=dict(account_name="ai4edataeuwest"))
Out[2]:
Dask DataFrame Structure:
              gbifid datasetkey occurrenceid kingdom  phylum   class   order  family   genus species infraspecificepithet taxonrank scientificname verbatimscientificname verbatimscientificnameauthorship countrycode locality stateprovince occurrencestatus individualcount publishingorgkey decimallatitude decimallongitude coordinateuncertaintyinmeters coordinateprecision elevation elevationaccuracy    depth depthaccuracy eventdate    day  month   year taxonkey specieskey basisofrecord institutioncode collectioncode catalognumber recordnumber identifiedby dateidentified license rightsholder recordedby typestatus establishmentmeans lastinterpreted mediatype   issue
npartitions=1


               int64     object       object  object  object  object  object  object  object  object               object    object         object                 object                           object      object   object        object           object           int32           object         float64          float64                       float64             float64   float64           float64  float64       float64    object  int32  int32  int32    int32      int32        object          object         object        object       object       object         object  object       object     object     object             object          object    object  object
                 ...        ...          ...     ...     ...     ...     ...     ...     ...     ...                  ...       ...            ...                    ...                              ...         ...      ...           ...              ...             ...              ...             ...              ...                           ...                 ...       ...               ...      ...           ...       ...    ...    ...    ...      ...        ...           ...             ...            ...           ...
 ...          ...            ...     ...          ...        ...        ...                ...             ...       ...     ...
Dask Name: read-parquet, 1 tasks

raybellwaves mentioned this issue Sep 29, 2020

[FEA] Support reading files from blob using adlfs rapidsai/cudf#6348

Closed

raybellwaves mentioned this issue Sep 29, 2020

Feature request: use adlfs to read azureml.opendatasets #102

Closed

raybellwaves closed this as completed Sep 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: a public dataset #117

Feature request: a public dataset #117

raybellwaves commented Sep 29, 2020

martindurant commented Sep 29, 2020

raybellwaves commented Sep 29, 2020

lostmygithubaccount commented Sep 29, 2020

martindurant commented Sep 29, 2020

rabernat commented Jan 26, 2022 •

edited

Loading

TomAugspurger commented Jan 26, 2022

TomAugspurger commented Jan 26, 2022

Feature request: a public dataset #117

Feature request: a public dataset #117

Comments

raybellwaves commented Sep 29, 2020

martindurant commented Sep 29, 2020

raybellwaves commented Sep 29, 2020

lostmygithubaccount commented Sep 29, 2020

martindurant commented Sep 29, 2020

rabernat commented Jan 26, 2022 • edited Loading

TomAugspurger commented Jan 26, 2022

TomAugspurger commented Jan 26, 2022

rabernat commented Jan 26, 2022 •

edited

Loading