Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: a public dataset #117

Closed
raybellwaves opened this issue Sep 29, 2020 · 7 comments
Closed

Feature request: a public dataset #117

raybellwaves opened this issue Sep 29, 2020 · 7 comments

Comments

@raybellwaves
Copy link
Contributor

Similar to df = dd.read_parquet('gs://dask-nyc-taxi/yellowtrip.parquet', storage_options={'token': 'anon'}).

Something like df = dd.read_parquet('az://yellowtrip.parquet', storage_options={'account_name': 'dask-nyc-taxi'})

cc. @martindurant @TomAugspurger

@martindurant
Copy link
Member

As far as I know, we (Anaconda) don't have any public/free data on Azure. Maybe not even any account at all. I wouldn't be surprised is MS would be happy to host this, and might even have some standard datasets around ( https://azure.microsoft.com/en-ca/services/open-datasets/#overview ?)

@raybellwaves
Copy link
Contributor Author

Thanks for responding.

might even have some standard datasets around

Yes. @hayesgb recently added support to access these datasets. I'll have a rummage to see if there is a simple tabular dataset as a parquet file that we can point to.

I also have a feature request to read the Azureml datasets e.g. here using adlfs: #102. It looks like a lot of these are partitioned parquet files (https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/#AzureNotebooks).

No idea who the point of contact from Azure would be (@lostmygithubaccount?)

@lostmygithubaccount
Copy link

for Open Datasets it is @meyetman

you can access via adlfs with the anonymous access:

from adlfs import AzureBlobFileSystem

storage_options = {'account_name': 'azureopendatastorage'}

fs = AzureBlobFileSystem(**storage_options)
fs.ls('isdweatherdatacontainer')

@martindurant
Copy link
Member

Many of them have example notebooks attached with simple blob code - but there are various file types.

https://opendatasets-api.azure.com/discoveryapi/OpenDataset/DownloadNotebook?serviceType=AzureNotebooks&package=azure-storage&registryId=us-decennial-census-zip

@rabernat
Copy link

rabernat commented Jan 26, 2022

@martindurant's link is broken.

I would love to see an example of simple anonymous access to ABS with adlfs. In pangeo-forge/pangeo-forge-recipes#267, I open the file https://ai4edataeuwest.blob.core.windows.net/ecmwf/20220122/18z/0p4-beta/scda/20220122180000-36h-scda-fc.grib2 with the fsspec http interface, so it is clearly "public". But I could not find a combination of options to open the same file with adlfs.

Could anyone (maybe @TomAugspurger) post an example equivalent to s3fs's anon=True?

@TomAugspurger
Copy link
Contributor

@rabernat you provide the "storage account name" which is the https://<storage-account-name>.blob.windows.net part. The "storage container" is in the path. It's a bit confusing at first:

In [1]: import adlfs

In [2]: fs = adlfs.AzureBlobFileSystem("ai4edataeuwest")

In [3]: fs.ls("/ecmwf")
Out[3]:
['ecmwf/20220121',
 'ecmwf/20220122',
 'ecmwf/20220123',
 'ecmwf/20220124',
 'ecmwf/20220125',
 'ecmwf/20220126']

@TomAugspurger
Copy link
Contributor

Dunno if people are still looking for a Parquet file, but this should work if you are:

In [1]: import dask.dataframe as dd

In [2]: dd.read_parquet("az://gbif/occurrence/2022-01-01/occurrence.parquet/000000", storage_options=dict(account_name="ai4edataeuwest"))
Out[2]:
Dask DataFrame Structure:
              gbifid datasetkey occurrenceid kingdom  phylum   class   order  family   genus species infraspecificepithet taxonrank scientificname verbatimscientificname verbatimscientificnameauthorship countrycode locality stateprovince occurrencestatus individualcount publishingorgkey decimallatitude decimallongitude coordinateuncertaintyinmeters coordinateprecision elevation elevationaccuracy    depth depthaccuracy eventdate    day  month   year taxonkey specieskey basisofrecord institutioncode collectioncode catalognumber recordnumber identifiedby dateidentified license rightsholder recordedby typestatus establishmentmeans lastinterpreted mediatype   issue
npartitions=1


               int64     object       object  object  object  object  object  object  object  object               object    object         object                 object                           object      object   object        object           object           int32           object         float64          float64                       float64             float64   float64           float64  float64       float64    object  int32  int32  int32    int32      int32        object          object         object        object       object       object         object  object       object     object     object             object          object    object  object
                 ...        ...          ...     ...     ...     ...     ...     ...     ...     ...                  ...       ...            ...                    ...                              ...         ...      ...           ...              ...             ...              ...             ...              ...                           ...                 ...       ...               ...      ...           ...       ...    ...    ...    ...      ...        ...           ...             ...            ...           ...
 ...          ...            ...     ...          ...        ...        ...                ...             ...       ...     ...
Dask Name: read-parquet, 1 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants