Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spp 7526 thousand pound correction #61

Merged
merged 42 commits into from
Oct 18, 2023
Merged
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
bc3e330
Add line length to config
dllmorgan Aug 18, 2022
b2cc2ba
Add converage/test files to git ignore
dllmorgan Aug 18, 2022
ca05133
Add prototype method
dllmorgan Aug 18, 2022
7d73983
Tweak typing to support python 3.8 envs
dllmorgan Sep 5, 2022
2524cf0
First cut of conversion from prototype and adding documentation
dllmorgan Sep 15, 2022
db72ffc
Try new formula display approach
dllmorgan Sep 16, 2022
a91aeeb
Update display of 2nd formula
dllmorgan Sep 16, 2022
213ccab
Tweak 2nd formula
dllmorgan Sep 16, 2022
25131e5
Update documentation
dllmorgan Sep 16, 2022
21afeae
Update import order to stop isort check complaining
dllmorgan Sep 16, 2022
9bc4483
Update main repo readme to include doc links
dllmorgan Sep 16, 2022
1500fea
Tweak formating in readme
dllmorgan Sep 16, 2022
95aa66a
Update following review
dllmorgan Oct 6, 2022
8f1d5b8
Further tweaks following feedback
dllmorgan Oct 12, 2022
2fefc2d
Consistency renaming
dllmorgan Oct 12, 2022
bfcd0d4
Added csv output to example
dllmorgan Oct 17, 2022
157dceb
Updates following UAT discussions
dllmorgan Nov 1, 2022
13eed78
Tweaks following UAT
dllmorgan Nov 8, 2022
52b4aaa
Re-add case where adjustment takes place in example
dllmorgan Nov 8, 2022
801b535
Allow checking of target variables when no adjustment
dllmorgan Nov 11, 2022
badfceb
Thousand pounds code allignment (#47)
JasonBellONS Sep 1, 2023
52682f9
Merge branch 'main' into SPP-7526-Thousand-Pound-Correction
JasonBellONS Sep 1, 2023
1632e7d
Spp 9939 validate input (#48)
JasonBellONS Sep 8, 2023
6c92ada
Spp 9970 introduce logging (#51)
JasonBellONS Sep 15, 2023
2f632ac
Spp 9800 pandas wrapper (#53)
JasonBellONS Sep 21, 2023
002a492
Spp 10021 stop marker and input types (#54)
JasonBellONS Sep 26, 2023
bf01f39
Spp 9629 empty field processing (#55)
JasonBellONS Oct 3, 2023
dbf9ae0
Spp10168 amend logging config (#58)
JasonBellONS Oct 11, 2023
ed73fa5
Spp 9969 decimal processing (#57)
JasonBellONS Oct 11, 2023
095861b
Spp 10072 thousand pounds uat test cases (#59)
jhmjam Oct 11, 2023
39d6e4f
Merge branch 'main' into SPP-7526-Thousand-Pound-Correction
JasonBellONS Oct 12, 2023
7a92b32
precision added to thousand pounds tests
jhmjam Oct 16, 2023
3a9de8e
Merge branch 'main' into SPP-7526-Thousand-Pound-Correction
JasonBellONS Oct 17, 2023
4da8705
Updating release candidate number
JasonBellONS Oct 17, 2023
604e02f
running black, isort, flake8
JasonBellONS Oct 17, 2023
910baf9
Spp 10187 update thousand pounds output to string (#62)
jhmjam Oct 17, 2023
566dd1f
updated csv file to ot return decimal
jhmjam Oct 17, 2023
2943a76
merge conflicts
jhmjam Oct 17, 2023
1d79fc4
tests numbering added
jhmjam Oct 17, 2023
e53a143
Docs amendment
JasonBellONS Oct 18, 2023
3e538ef
Merge remote-tracking branch 'origin/SPP-7526-Thousand-Pound-Correcti…
JasonBellONS Oct 18, 2023
19b7227
changed a test to match uat
jhmjam Oct 18, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .flake8
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
[flake8]
max-line-length = 120
max-line-length = 160
extend-ignore = E203
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -7,3 +7,4 @@ This library contains pandas statistical methods that are only suitable for use
For further information about the methods contained in this module see the [method specifications](https://github.com/ONSdigital/Statistical-Method-Specifications)

For user documentation and example data relating to the methods in this module see the [supporting information](https://github.com/ONSdigital/sml-supporting-info)

135 changes: 135 additions & 0 deletions docs/thousand_pounds.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
Method specification
====================

* [SML](https://github.com/ONSdigital/Statistical-Method-Specifications/blob/thousand_pound/thousand_pounds_correction.md)

Overall method
=============

This process determines whether a given value is too large by a factor of ~1000. If so, the value is corrected to be smaller, and any accompanying values are also adjusted accordingly.

The method is to be run for a single instance of a principal question (with 0..n linked questions) at a time. If there are multiple principal questions or multiple references to be processed then the method should be called multiple times, once for each principal question (+ linked) dataset.

* Dataclasses are used to provide structure where required
* All dataclass datasets are immutable upon creation
* Missing/null values are not adjusted (they are placed in the output dataset 'as-is' regardless of calculations)
* Uses built-in python libraries only
* Assumes that all processing can be dealt with 'in-memory' and output is provided as single dataset
* Errors are caught internally within the method and are converted into managed output errors (TPC Marker of "E")

Inputs
------

The method requires at least one of the following as the previous period predictor variable, in the given priority order:

1. Returned, cleaned response,
2. Imputed or constructed value

The calling process is responsible for determining whether a returned, clean response is suitable for use in the method and to set the 'predictive' variable accordingly, giving the following effective priority order:

1. Returned, cleaned response
2. Imputed or constructed value
3. Auxiliary variable, e.g. registered annual turnover

The calling/wrangling process is also responsible for auditing the metadata about the values used in this method (e.g. the source of the predictive value, tpc_ratio, etc).

Note: ensure that predictor and auxiliary variables are of the same denomination as the current period variables.

* **unique_identifier**: *(String)* - Unique identifier e.g. "q500" (question identifier), or "12345678901-202209" (conrtibutor reference & period)
* **principal_variable**: *(Float)* - Numeric value that the method is working on
* **predictive**: *(Float)* - *Optional* - Numeric value used for comparison. A previous 'valid' value (i.e. Returned/Imputed/Constructed)
* **auxiliary**: *(Float)* - *Optional* - Alternative numeric` value used when a predictive value is not available and required by the user
* **upper_limit**: *(Float)* - Upper bound of 'error value' threshold
* **lower_limit**: *(Float)* - Lower bound of 'error value' threshold
* **target_variables**: *(List of Float)* - *Optional* - List of linked question and values to potentially be adjusted
* **precision**: *(Int)* - *Optional* - Precision is used by the decimal package to ensure a specified accuracy
used throughout method processing

Note:

* Although *predictive* and *auxiliary* are both optional, at least one has to be provided for the calculation, else a method error is produced
* The principal_identifier is unused directly by the method and is passed-through as-is into the output dataset. This attribute is provided to allow a user context to be provided as required. For example, it could contain a contributor reference, an IDBR period and a question code ('19900001234-202207-q500'), a unique system generated ID ('cfacf706-36a5-4acb-935f-67e7b07c0470'), just the principal question code ('q150'), etc. It is a text/string field and no parsing or validation is undertaken by the method.
* If no target_variables are provided only the principal_variable may be adjusted

Calculation
-----------

A ratio is determined by the ratio of the latest returned principal value and the corresponding previous period value. The $comparisonValue$ is determined by either the predictive variable (if provided) or the the auxiliary (if the predictive is not available and an appropriate auxiliary variable exists).

The resulting ratio is compared against an upper and lower limit.

Where $comparisonValue \ne 0$ we use:

<p align="center">
<img src="https://latex.codecogs.com/svg.image?{\color{Orange}lowerLimit&space;<&space;\frac{principalValue}{comparisonValue}&space;<&space;upperLimit}" />
</p>

If the ratio is within the limits we determine that a pounds thousands error has been detected.

When a pounds thousands error has been detected we apply the following correction to the principalValue and all linked values:

<p align="center">
<img src="https://latex.codecogs.com/svg.image?{\color{Orange}adjustedValue&space;=&space;\frac{value}{1000}" />
</p>

If the previous period's value is zero, then the method does not continue. A thousand pounds error is neither detected nor corrected.
If any linked variable values are 'missing' then they will not be adjusted and will be placed in the output dataset 'as-is'
If a pounds thousands error has *not* been detected then the principal variable and any linked variables will not be adjusted and placed in the output dataset 'as-is'

Error Detection
---------------

The method explicitly checks for the following error states:

* Predictive and auxiliary are both missing or are both 0
* Principal variable is missing (note, a principal variable of 0 is not an error)
* At least one of the upper or lower limits are missing or 0

The method will catch unexpected errors and will set the TPC marker = 'E' and will populate the accompanying error attribute. For example, if non-numeric values are provided as inputs to numeric atttributes (such as principal value).

Outputs
-------

* **principal_identifier**: *(String)* - Unique identifer. Will contain same as was input to method.
* **principal_adjusted_value**: *(Float)* - Output value that may or may not have been adjusted
* **target_variables**: *(List of Variables)* - List of linked questions, original values and adjusted values (if appropriate)
* **tpc_ratio**: *(Float)* - Calculated ratio of the principal value. Used for testing against the given limits.
* **tpc_marker**: *(String)* - C = Correction applied | N = No correction applied | E = Process failure
* **error_description**: *(String)* - Error information populated when the TPC marker = E. Will be empty/blank on succesful runs

Data example
-------------

|principal_identifier|principal_variable|predictive|aux|threshold_upper|threshold_lower|tpc_marker|tpc_ratio|principal_adjusted_value|target_variable|target_original_value|target_adjusted_value
|---|---|---|---|---|---|---|---|---|---|---|---|
[A] Valid config with linked questions|50000000|60000|15000|1350|350|C|1000.0|50000.0|q101|500|0.5
[A] Valid config with linked questions|-|-|-|-|-|-|-|-|q102|1000|1
[A] Valid config with linked questions|-|-|-|-|-|-|-|-|q103|1500|1.5
[A] Valid config with linked questions|-|-|-|-|-|-|-|-|q104||
[B] Missing auxiliary|60000000|60000||1350|350|C|400.0|60000.0|||
[C] Missing predictive|269980||200|1350|350|C|1349.9|269.98|||
[D] Missing predictive and auxiliary|7000|||1350|350|E||7000|||
[E] Predictive and auxiliary are 0|8000|0|0|1350|350|E||8000|q451|500|500
[E] Predictive and auxiliary are 0|-|-|-|-|-|-|-|-|q452|1000|1000
[F] Missing principle variable||10|20|1350|350|E|||q501|1234|1234
[F] Missing principle variable|-|-|-|-|-|-|-|-|q502|2345|2345
[G] Principle variable is 0|0|10|20|1350|350|N|0|0|q601|500|500
[G] Principle variable is 0|-|-|-|-|-|-|-|-|q602|1000|1000
[H] Ratio is exactly lower limit|3500|10|20|1350|350|N|350|3500|q701|1000|1000
[I] Ratio is exactly upper limit|13500|10|20|1350|1350|N|350|13500|q801|1000|1000
[J] Upper and Lower limits are 0|0|-1|-1|0|0|E||0|||
[K] Text is passed into numeric fields|"Cheese"|"Toast"|"Jam"|"Rhubarb"|"Custard"|E||"Cheese"|||

NB. This is a mixture of input and output data and is illustrative of behaviour only, and is not indicative of the input/ouput dataset structure used by the method

Examples of usage
-----------------

* [Description/Readme](../sml_small/pounds_thousands/readme.md)
* [Code Example](../sml_small/pounds_thousands/example.py)

Implementation
--------------

* [Method implementation](../sml_small/pounds_thousands/pounds_thousands.py)
* [Method Unit Tests](../sml_small/pounds_thousands/test_pounds_thousands.py)
Loading