Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-13168: [C++][R] Enable runtime timezone database for Windows #12536

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
7c8cbc2
First pass at R bindings
wjones127 Feb 28, 2022
d92b1ab
Enable R timezone unit tests
wjones127 Mar 1, 2022
bf82303
Get timezone tests passing in R; locale still a problem
wjones127 Mar 2, 2022
be6b344
Force Windows time locale to be C
wjones127 Mar 2, 2022
a1be197
Try to fix onLoad warning
wjones127 Mar 3, 2022
89ce1bf
Try requireNamespace
wjones127 Mar 3, 2022
191885c
Add tzdb download to CI
wjones127 Mar 8, 2022
b58d102
First pass at R bindings
wjones127 Feb 28, 2022
05d1dee
Enable R timezone unit tests
wjones127 Mar 1, 2022
48c5584
Get timezone tests passing in R; locale still a problem
wjones127 Mar 2, 2022
647e3ac
Force Windows time locale to be C
wjones127 Mar 2, 2022
22fc930
PR feedback on R bindings
wjones127 Mar 8, 2022
1c609aa
More fixes for CI scripts
wjones127 Mar 8, 2022
64af70c
Add section to developer docs for downloading
wjones127 Mar 8, 2022
410c2e5
Add new test to illuminate issue
wjones127 Mar 9, 2022
5a33d30
Make tzdb test skip if not found in downloads
wjones127 Mar 10, 2022
ab5d038
Get other test running
wjones127 Mar 10, 2022
668c8cc
Fix some failures
wjones127 Mar 10, 2022
32df619
Just skip the locale test for now
wjones127 Mar 11, 2022
7ebf6a8
Cleanup
wjones127 Mar 11, 2022
42f0a6a
Adjust script to avoid tar Windows drive issue
wjones127 Mar 14, 2022
baa2263
Do a better job of skipping test
wjones127 Mar 14, 2022
7c09080
Document for users how to get tzdb
wjones127 Mar 14, 2022
d883a48
Add underscore to WIN32
wjones127 Mar 21, 2022
60ccf65
Update docs/source/cpp/build_system.rst
wjones127 Mar 23, 2022
ece8ce0
Update cpp/src/arrow/public_api_test.cc
wjones127 Mar 23, 2022
7da26f6
Address PR feedback
wjones127 Mar 23, 2022
2471139
Link docs
wjones127 Mar 23, 2022
843cdb6
Use environment variable for public api tests
wjones127 Mar 24, 2022
29dc875
Allow tests set location via environment variable
wjones127 Mar 24, 2022
c78d3cd
Document environment variable
wjones127 Mar 24, 2022
fe6469d
Cleanup
wjones127 Mar 24, 2022
9f97016
Apply suggestions from code review
wjones127 Mar 25, 2022
ad161ae
Fix type that led to empty string
wjones127 Mar 25, 2022
62e8178
Update r/R/arrow-package.R
wjones127 Mar 25, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/workflows/cpp.yml
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,9 @@ jobs:
with:
fetch-depth: 0
submodules: recursive
- name: Download Timezone Database
shell: bash
run: ci/scripts/download_tz_database.sh
- name: Build
shell: bash
run: ci/scripts/cpp_build.sh $(pwd) $(pwd)/build
Expand Down Expand Up @@ -319,6 +322,9 @@ jobs:
run: |
export CMAKE_BUILD_PARALLEL_LEVEL=$NUMBER_OF_PROCESSORS
ci/scripts/cpp_build.sh "$(pwd)" "$(pwd)/build"
- name: Download Timezone Database
shell: bash
run: ci/scripts/download_tz_database.sh
- name: Download MinIO
shell: msys2 {0}
run: |
Expand Down
14 changes: 14 additions & 0 deletions ci/appveyor-cpp-setup.bat
Original file line number Diff line number Diff line change
Expand Up @@ -115,3 +115,17 @@ powershell.exe -Command "Start-Process clcache-server" || exit /B
if "%ARROW_S3%" == "ON" (
appveyor DownloadFile https://dl.min.io/server/minio/release/windows-amd64/minio.exe -FileName C:\Windows\Minio.exe || exit /B
)


@rem
@rem Download IANA Timezone Database for unit tests
@rem
@rem (Doc section: Download timezone database)
curl https://data.iana.org/time-zones/releases/tzdata2021e.tar.gz --output tzdata.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this always be the same DB that R and Python will be using?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No that 2021e is a version, which won't necessarily align with the R and Python ones in the future. I don't think it matters that we update it, unless the format changes somehow. This is just for testing that it works, and we don't ship it.

But the R unit tests use the one provided by the tzdb package, so we are testing that as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently US and EU have DST, but they plan to abolish it soon. At that point we will have Python and R DBs with fresh DSTless times and arrow c++ using DST for tests. In isolation that's ok, but we do have tests comparing pandas and pyarrow results and similar for lubridate. It's not a big problem for sure, but if there is like a tzdata-latest.tar.gz that would be great.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The times in the past won't change, so unless we test against a random "now", updates to the timezone database (such as for DST policy changes) shouldn't impact those tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should be fine as is. We will never be testing between timezone databases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We currently test arrow vs pandas tzdb in CI. We have a test for a timestamp in 2033 and if it is in DST and DST is abolished it will error if we'll be using a pre-abolishment db with arrow and post-abolishment db with pandas. This is super irrelevant and as it's a simple fix and I'm sorry for wasting your time :).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah got it. We'll address in the follow up PR where we'll have PyArrow use tzdb. For now the python timestamp tests are skipped on Windows

if sys.platform == 'win32':
# TODO: We should test on windows once ARROW-13168 is resolved.
pytest.skip('Timezone database is not available on Windows yet')

mkdir tzdata
tar --extract --file tzdata.tar.gz --directory tzdata
move tzdata %USERPROFILE%\Downloads\tzdata
@rem Also need Windows timezone mapping
curl https://raw.githubusercontent.com/unicode-org/cldr/master/common/supplemental/windowsZones.xml ^
--output %USERPROFILE%\Downloads\tzdata\windowsZones.xml
@rem (Doc section: Download timezone database)
30 changes: 30 additions & 0 deletions ci/scripts/download_tz_database.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

set -ex

# Download database
curl https://data.iana.org/time-zones/releases/tzdata2021e.tar.gz --output ~/Downloads/tzdata2021e.tar.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, they just released 2022a a few days ago


# Extract
mkdir -p ~/Downloads/tzdata
tar --extract --file ~/Downloads/tzdata2021e.tar.gz --directory ~/Downloads/tzdata

# Download Windows timezone mapping
curl https://raw.githubusercontent.com/unicode-org/cldr/master/common/supplemental/windowsZones.xml --output ~/Downloads/tzdata/windowsZones.xml
7 changes: 0 additions & 7 deletions cpp/src/arrow/compute/kernels/scalar_cast_string.cc
Original file line number Diff line number Diff line change
Expand Up @@ -150,12 +150,6 @@ struct TemporalToStringCastFunctor<O, TimestampType> {
return Status::OK();
}));
} else {
#ifdef _WIN32
// TODO(ARROW-13168):
return Status::NotImplemented(
"Casting a timestamp with time zone to string is not yet supported on "
"Windows.");
#else
switch (ty.unit()) {
case TimeUnit::SECOND:
RETURN_NOT_OK(ConvertZoned<std::chrono::seconds>(input, timezone, &builder));
Expand All @@ -176,7 +170,6 @@ struct TemporalToStringCastFunctor<O, TimestampType> {
DCHECK(false);
return Status::NotImplemented("Unimplemented time unit");
}
#endif
}
std::shared_ptr<Array> output_array;
RETURN_NOT_OK(builder.Finish(&output_array));
Expand Down
43 changes: 14 additions & 29 deletions cpp/src/arrow/compute/kernels/scalar_cast_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@
#include "arrow/testing/extension_type.h"
#include "arrow/testing/gtest_util.h"
#include "arrow/testing/random.h"
#include "arrow/testing/util.h"
#include "arrow/type.h"
#include "arrow/type_fwd.h"
#include "arrow/type_traits.h"
Expand Down Expand Up @@ -1146,6 +1147,16 @@ constexpr char kTimestampSecondsJson[] =
constexpr char kTimestampExtremeJson[] =
R"(["1677-09-20T00:00:59.123456", "2262-04-13T23:23:23.999999"])";

class CastTimezone : public ::testing::Test {
protected:
void SetUp() override {
#ifdef _WIN32
// Initialize timezone database on Windows
ASSERT_OK(InitTestTimezoneDatabase());
#endif
}
};

TEST(Cast, TimestampToDate) {
// See scalar_temporal_test.cc
auto timestamps = ArrayFromJSON(timestamp(TimeUnit::NANO), kTimestampJson);
Expand Down Expand Up @@ -1181,12 +1192,7 @@ TEST(Cast, TimestampToDate) {
}
}

TEST(Cast, ZonedTimestampToDate) {
#ifdef _WIN32
// TODO(ARROW-13168): we lack tzdb on Windows
GTEST_SKIP() << "ARROW-13168: no access to timezone database on Windows";
#endif

TEST_F(CastTimezone, ZonedTimestampToDate) {
{
// See TestZoned in scalar_temporal_test.cc
auto timestamps =
Expand Down Expand Up @@ -1377,12 +1383,7 @@ TEST(Cast, TimestampToTime) {
}
}

TEST(Cast, ZonedTimestampToTime) {
#ifdef _WIN32
// TODO(ARROW-13168): we lack tzdb on Windows
GTEST_SKIP() << "ARROW-13168: no access to timezone database on Windows";
#endif

TEST_F(CastTimezone, ZonedTimestampToTime) {
CheckCast(ArrayFromJSON(timestamp(TimeUnit::NANO, "Pacific/Marquesas"), kTimestampJson),
ArrayFromJSON(time64(TimeUnit::NANO), R"([
52259123456789, 50003999999999, 56480001001001, 65000000000000,
Expand Down Expand Up @@ -1573,8 +1574,7 @@ TEST(Cast, TimestampToString) {
}
}

#ifndef _WIN32
TEST(Cast, TimestampWithZoneToString) {
TEST_F(CastTimezone, TimestampWithZoneToString) {
for (auto string_type : {utf8(), large_utf8()}) {
CheckCast(
ArrayFromJSON(timestamp(TimeUnit::SECOND, "UTC"), "[-30610224000, -5364662400]"),
Expand Down Expand Up @@ -1608,21 +1608,6 @@ TEST(Cast, TimestampWithZoneToString) {
R"(["1968-11-30 13:30:44.123456789-0700", "2016-02-29 10:42:23.456789246-0700"])"));
}
}
#else
// TODO(ARROW-13168): we lack tzdb on Windows
TEST(Cast, TimestampWithZoneToString) {
for (auto string_type : {utf8(), large_utf8()}) {
ASSERT_RAISES(NotImplemented, Cast(ArrayFromJSON(timestamp(TimeUnit::SECOND, "UTC"),
"[-34226955, 1456767743]"),
CastOptions::Safe(string_type)));

ASSERT_RAISES(NotImplemented,
Cast(ArrayFromJSON(timestamp(TimeUnit::SECOND, "America/Phoenix"),
"[-34226955, 1456767743]"),
CastOptions::Safe(string_type)));
}
}
#endif

TEST(Cast, DateToDate) {
auto day_32 = ArrayFromJSON(date32(), "[0, null, 100, 1, 10]");
Expand Down
19 changes: 13 additions & 6 deletions cpp/src/arrow/compute/kernels/scalar_temporal_test.cc
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
#include "arrow/compute/kernels/test_util.h"
#include "arrow/testing/gtest_util.h"
#include "arrow/testing/matchers.h"
#include "arrow/testing/util.h"
#include "arrow/type.h"
#include "arrow/util/checked_cast.h"
#include "arrow/util/formatting.h"
Expand Down Expand Up @@ -407,6 +408,14 @@ class ScalarTemporalTest : public ::testing::Test {
RoundTemporalOptions round_to_15_quarters =
RoundTemporalOptions(15, CalendarUnit::QUARTER);
RoundTemporalOptions round_to_15_years = RoundTemporalOptions(15, CalendarUnit::YEAR);

protected:
void SetUp() override {
#ifdef _WIN32
// Initialize timezone database on Windows
ASSERT_OK(InitTestTimezoneDatabase());
#endif
}
};

TEST_F(ScalarTemporalTest, TestTemporalComponentExtractionAllTemporalTypes) {
Expand Down Expand Up @@ -564,8 +573,6 @@ TEST_F(ScalarTemporalTest, TestOutsideNanosecondRange) {
CheckScalarUnary("subsecond", unit, times, float64(), subsecond);
}

#ifndef _WIN32
// TODO: We should test on windows once ARROW-13168 is resolved.
TEST_F(ScalarTemporalTest, TestIsLeapYear) {
auto is_leap_year_marquesas =
"[false, true, false, false, false, false, false, false, false, false, false, "
Expand Down Expand Up @@ -792,7 +799,6 @@ TEST_F(ScalarTemporalTest, TestNonexistentTimezone) {
ASSERT_RAISES(Invalid, Subsecond(timestamp_array));
}
}
#endif

TEST_F(ScalarTemporalTest, Week) {
auto unit = timestamp(TimeUnit::NANO);
Expand Down Expand Up @@ -1607,8 +1613,6 @@ TEST_F(ScalarTemporalTest, TestTemporalDifferenceErrors) {
CallFunction("weeks_between", {arr1, arr1}, &options));
}

// TODO: We should test on windows once ARROW-13168 is resolved.
#ifndef _WIN32
TEST_F(ScalarTemporalTest, TestAssumeTimezone) {
std::string timezone_utc = "UTC";
std::string timezone_kolkata = "Asia/Kolkata";
Expand Down Expand Up @@ -1875,6 +1879,9 @@ TEST_F(ScalarTemporalTest, StrftimeCLocale) {
}

TEST_F(ScalarTemporalTest, StrftimeOtherLocale) {
#ifdef _WIN32
GTEST_SKIP() << "There is a known bug in strftime for locales on Windows (ARROW-15922)";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this on Windows or specifically MinGW? i.e., would the non-MinGW Windows CI pass if you remove this skip? I'm wondering if we can narrow the condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If failed on Windows 2019 C++17, which uses MSVC. But seems like it was actually passing on MinGW. So maybe I can try skipping for just MSVC?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit weird, MinGW is just a different compiler but targetting the same runtime libraries...

Copy link
Member Author

@wjones127 wjones127 Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I think the test is likely skipping on the LocaleExists("fr_FR.UTF-8") condition; IIRC MinGW doesn't support locales apart from "C" and "POSIX". Sadly, no indication in CI whether the test was skipped: https://github.com/apache/arrow/runs/5504310182?check_suite_focus=true

#else
if (!LocaleExists("fr_FR.UTF-8")) {
GTEST_SKIP() << "locale 'fr_FR.UTF-8' doesn't exist on this system";
}
Expand All @@ -1886,6 +1893,7 @@ TEST_F(ScalarTemporalTest, StrftimeOtherLocale) {
["01 janvier 1970 00:00:59,123", "18 août 2021 15:11:50,456", null])";
CheckScalarUnary("strftime", timestamp(TimeUnit::MILLI, "UTC"), milliseconds, utf8(),
expected, &options);
#endif
}

TEST_F(ScalarTemporalTest, StrftimeInvalidLocale) {
Expand Down Expand Up @@ -2579,7 +2587,6 @@ TEST_F(ScalarTemporalTest, TestCeilFloorRoundTemporalKolkata) {
CheckScalarUnary("round_temporal", unit, times, unit, round_1_hours, &round_to_1_hours);
CheckScalarUnary("round_temporal", unit, times, unit, round_2_hours, &round_to_2_hours);
}
#endif // !_WIN32

} // namespace compute
} // namespace arrow
13 changes: 0 additions & 13 deletions cpp/src/arrow/compute/kernels/scalar_temporal_unary.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1046,7 +1046,6 @@ struct RoundTemporal {
// ----------------------------------------------------------------------
// Convert timestamps to a string representation with an arbitrary format

#ifndef _WIN32
Result<std::locale> GetLocale(const std::string& locale) {
try {
return std::locale(locale.c_str());
Expand Down Expand Up @@ -1130,18 +1129,6 @@ struct Strftime {
return Status::OK();
}
};
#else
// TODO(ARROW-13168)
template <typename Duration, typename InType>
struct Strftime {
static Status Call(KernelContext* ctx, const Scalar& in, Scalar* out) {
return Status::NotImplemented("Strftime not yet implemented on windows.");
}
static Status Call(KernelContext* ctx, const ArrayData& in, ArrayData* out) {
return Status::NotImplemented("Strftime not yet implemented on windows.");
}
};
#endif

// ----------------------------------------------------------------------
// Convert timestamps from local timestamp without a timezone to timestamps with a
Expand Down
28 changes: 28 additions & 0 deletions cpp/src/arrow/config.cc
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@

#include "arrow/util/config.h"
#include "arrow/util/cpu_info.h"
#include "arrow/vendored/datetime.h"

namespace arrow {

Expand Down Expand Up @@ -62,6 +63,8 @@ std::string MakeSimdLevelString(QueryFlagFunction&& query_flag) {
}
}

util::optional<std::string> timezone_db_path;

}; // namespace

const BuildInfo& GetBuildInfo() { return kBuildInfo; }
Expand All @@ -73,7 +76,32 @@ RuntimeInfo GetRuntimeInfo() {
MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsSupported(flags); });
info.detected_simd_level =
MakeSimdLevelString([&](int64_t flags) { return cpu_info->IsDetected(flags); });
info.using_os_timezone_db = USE_OS_TZDB;
#if !USE_OS_TZDB
info.timezone_db_path = timezone_db_path;
#else
info.timezone_db_path = util::optional<std::string>();
#endif
return info;
}

Status Initialize(const GlobalOptions& options) noexcept {
if (options.timezone_db_path.has_value()) {
#if !USE_OS_TZDB
try {
arrow_vendored::date::set_install(options.timezone_db_path.value());
arrow_vendored::date::reload_tzdb();
} catch (const std::runtime_error& e) {
return Status::IOError(e.what());
}
timezone_db_path = options.timezone_db_path.value();
#else
return Status::Invalid(
"Arrow was set to use OS timezone database at compile time, "
"so a downloaded database cannot be provided at runtime.");
#endif // !USE_OS_TZDB
}
return Status::OK();
}

} // namespace arrow
18 changes: 18 additions & 0 deletions cpp/src/arrow/config.h
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,9 @@

#include <string>

#include "arrow/status.h"
#include "arrow/util/config.h" // IWYU pragma: export
#include "arrow/util/optional.h"
#include "arrow/util/visibility.h"

namespace arrow {
Expand Down Expand Up @@ -62,6 +64,13 @@ struct RuntimeInfo {

/// The SIMD level available on the OS and CPU
std::string detected_simd_level;

/// Whether using the OS-based timezone database
/// This is set at compile-time.
bool using_os_timezone_db;

/// The path to the timezone database; by default None.
util::optional<std::string> timezone_db_path;
};

/// \brief Get runtime build info.
Expand All @@ -77,4 +86,13 @@ const BuildInfo& GetBuildInfo();
ARROW_EXPORT
RuntimeInfo GetRuntimeInfo();

struct GlobalOptions {
/// Path to text timezone database. This is only configurable on Windows,
/// which does not have a compatible OS timezone database.
util::optional<std::string> timezone_db_path;
};

ARROW_EXPORT
Status Initialize(const GlobalOptions& options) noexcept;

} // namespace arrow
Loading