-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature request: a public dataset #117
Comments
As far as I know, we (Anaconda) don't have any public/free data on Azure. Maybe not even any account at all. I wouldn't be surprised is MS would be happy to host this, and might even have some standard datasets around ( https://azure.microsoft.com/en-ca/services/open-datasets/#overview ?) |
Thanks for responding.
Yes. @hayesgb recently added support to access these datasets. I'll have a rummage to see if there is a simple tabular dataset as a parquet file that we can point to. I also have a feature request to read the Azureml datasets e.g. here using adlfs: #102. It looks like a lot of these are partitioned parquet files (https://azure.microsoft.com/en-us/services/open-datasets/catalog/nyc-taxi-limousine-commission-green-taxi-trip-records/#AzureNotebooks). No idea who the point of contact from Azure would be (@lostmygithubaccount?) |
for Open Datasets it is @meyetman you can access via adlfs with the anonymous access: from adlfs import AzureBlobFileSystem
storage_options = {'account_name': 'azureopendatastorage'}
fs = AzureBlobFileSystem(**storage_options)
fs.ls('isdweatherdatacontainer') |
Many of them have example notebooks attached with simple blob code - but there are various file types. |
@martindurant's link is broken. I would love to see an example of simple anonymous access to ABS with adlfs. In pangeo-forge/pangeo-forge-recipes#267, I open the file https://ai4edataeuwest.blob.core.windows.net/ecmwf/20220122/18z/0p4-beta/scda/20220122180000-36h-scda-fc.grib2 with the fsspec http interface, so it is clearly "public". But I could not find a combination of options to open the same file with adlfs. Could anyone (maybe @TomAugspurger) post an example equivalent to s3fs's |
@rabernat you provide the "storage account name" which is the
|
Dunno if people are still looking for a Parquet file, but this should work if you are: In [1]: import dask.dataframe as dd
In [2]: dd.read_parquet("az://gbif/occurrence/2022-01-01/occurrence.parquet/000000", storage_options=dict(account_name="ai4edataeuwest"))
Out[2]:
Dask DataFrame Structure:
gbifid datasetkey occurrenceid kingdom phylum class order family genus species infraspecificepithet taxonrank scientificname verbatimscientificname verbatimscientificnameauthorship countrycode locality stateprovince occurrencestatus individualcount publishingorgkey decimallatitude decimallongitude coordinateuncertaintyinmeters coordinateprecision elevation elevationaccuracy depth depthaccuracy eventdate day month year taxonkey specieskey basisofrecord institutioncode collectioncode catalognumber recordnumber identifiedby dateidentified license rightsholder recordedby typestatus establishmentmeans lastinterpreted mediatype issue
npartitions=1
int64 object object object object object object object object object object object object object object object object object object int32 object float64 float64 float64 float64 float64 float64 float64 float64 object int32 int32 int32 int32 int32 object object object object object object object object object object object object object object object
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ...
Dask Name: read-parquet, 1 tasks |
Similar to
df = dd.read_parquet('gs://dask-nyc-taxi/yellowtrip.parquet', storage_options={'token': 'anon'})
.Something like
df = dd.read_parquet('az://yellowtrip.parquet', storage_options={'account_name': 'dask-nyc-taxi'})
cc. @martindurant @TomAugspurger
The text was updated successfully, but these errors were encountered: