Data Validation

Heliconia Demographic Survey Data

We use the R package pointblank to review and validate the plot-level descriptors (HDP_plots.csv) and clean demographic data set (heliconia_survey_clean.csv) in preparation for archiving in Dryad and publication in Bruna et al. (2023). The report below includes:

  1. the different validation tests that were conducted,
  2. the date of the most recent test,
  3. each test’s criteria for ‘pass’, ‘warn’ and ‘stop’,
  4. the number of ‘units’ (i.e., rows or columns) assessed in each test,
  5. how many of these units passed or failed, and
  6. a button for downloading a .csv file of the records flagged by a particular validation test. Note that these are not necessarily errors. For instance, the validation procedure for ‘plant size - height’ returns as ‘stop’ all plants >2 m tall. Heliconia plants can exceed this threshold; this test is simply designed to flag any such individuals. In contrast, the data set should not have any duplicated rows. A notification of ‘fail’ for this test indicates an error that can be corrected by downloading the csv file, reviewing the duplicated rows, and uploading the necessary corrections.

Last run: 2023-09-20


Dataset Structure: Data types

Tests to determine if columns are correctly coded as integer, character, etc.
Test criteria: Strict (‘stop’ if any rows fail).

Pointblank Validation
Data Validation

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_expr

Height is measured to nearest cm

col_vals_expr()

ht%%1 == 0

57K 57K
1.00
0
0.00

2
col_vals_expr

Shoots is interger

col_vals_expr()

shts%%1 == 0

57K 57K
1.00
0
0.00

3
col_vals_expr

Number of inflorescences is integer

col_vals_expr()

infl%%1 == 0

2K 2K
1.00
0
0.00

2023-09-20 13:42:48 UTC < 1 s 2023-09-20 13:42:48 UTC

Dataset Structure: Plot & Subplot IDs

Test for any nonexistent values of plot_id (e.g., ‘FF-10’, ‘CF-23’) or subplot (e.g., ‘H23’, ‘A11’).
Test criteria: Strict (‘stop’ if any rows fail).

Pointblank Validation
Data Validation

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_in_set
 col_vals_in_set()

plot_id

CF-1, CF-2, CF-3, CF-4, CF-5, CF-6, FF-1, FF-2, FF-3, FF-4, FF-5, FF-6, FF-7

66K 66K
1.00
0
0.00

2
col_vals_in_set
 col_vals_in_set()

subplot

A1, A2, A3, A4, A5, A6, A7, A8, A9, A10, B1, B2, B3, B4, B5, B6, B7, B8, B9, B10, C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, D1, D2, D3, D4, D5, D6, D7, D8, D9, D10, E1, E2, E3, E4, E5, E6, E7, E8, E9, E10, F1, F2, F3, F4, F5, F6, F7, F8, F9, F10, G1, G2, G3, G4, G5, G6, G7, G8, G9, G10, H1, H2, H3, H4, H5, H6, H7, H8, H9, H10, I1, I2, I3, I4, I5, I6, I7, I8, I9, I10, J1, J2, J3, J4, J5, J6, J7, J8, J9, J10

66K 66K
1.00
0
0.00

2023-09-20 13:42:49 UTC < 1 s 2023-09-20 13:42:49 UTC

Dataset Structure: Duplicated or Missing Values

Tests for duplicated rows, missing plant_ID numbers, or duplicate plant_id numbers (test is done for every survey year).
Test criteria: Strict (‘stop’ if any rows fail).

Pointblank Validation
Data Validation

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
rows_distinct

duplicated rows

rows_distinct()

66K 66K
1.00
0
0.00

2
col_vals_not_null
 col_vals_not_null()

plant_id

66K 66K
1.00
0
0.00

3
rows_distinct

Check for duplicate plant ID numbers

rows_distinct()

plant_id

9K 9K
1.00
0
0.00

4
rows_distinct

Check for duplicate tag numbers in a plot

rows_distinct()

tag_number

64 0
0.00
64
1.00

2023-09-20 13:42:50 UTC 4.0 s 2023-09-20 13:42:54 UTC

Plant Characteristics: Size & Flowering

Tests to determine how many values of plant size (shts, ht) or infloresence number (infl) are outside the range of most values.
Test criteria: ‘warn’ if \(\geq\) 1 rows fail conditions, ‘stop’ if \(\geq\) 2% of rows fail conditions.

Pointblank Validation
Data Validation

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_between

shoots between 0 and 20

col_vals_between()

shts

[0, 20]

66K 66K
0.99
8
0.01

2
col_vals_between

height between 0 and 200cm

col_vals_between()

ht

[0, 200]

66K 66K
0.99
2
0.01

3
col_vals_between

infloresences between 0 and 3

col_vals_between()

infl

[0, 3]

66K 66K
0.99
15
0.01

2023-09-20 13:42:55 UTC < 1 s 2023-09-20 13:42:55 UTC

Plant Characteristics: Growth

Tests for unusual changes in plant size (both height and shoot number) from \(Year_{t}\) to \(Year_{t+1}\).
Test criteria: ‘warn’ if \(\geq\) 1 rows fail conditions, ‘stop’ if \(\geq\) 2% of rows fail conditions.

Pointblank Validation
Check growth & regression

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_lt

|% change in height| < 200%

col_vals_lt()

ht_pc

2

66K 66K
0.99
420
0.01

2
col_vals_between

|∆ height| < 100cm

col_vals_between()

ht_diff

[−100, 100]

66K 66K
0.99
11
0.01

3
col_vals_between

|∆ shoot number| < 5

col_vals_between()

shts_diff

[−5, 5]

66K 66K
0.99
201
0.01

2023-09-20 13:42:56 UTC < 1 s 2023-09-20 13:42:56 UTC

Seedlings: Initial size

Tests for seedlings whose size at initial marking was unusually large. Conducted for both height and shoot number.
Test criteria: ‘warn’ if \(\geq\) 1 rows fail conditions, ‘stop’ if \(\geq\) 2% of rows fail conditions.

Pointblank Validation
Check seedlings

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_lt

shoots < 3

col_vals_lt()

shts

3

3K 3K
0.99
12
0.01

2
col_vals_lt

height < 30cm

col_vals_lt()

ht

30

3K 3K
0.99
3
0.01

2023-09-20 13:42:57 UTC < 1 s 2023-09-20 13:42:57 UTC

Seedlings: Data Entry Errors

Check if during data entry the size of seedlings (1) wasn’t accidentally transposed to the “inflorescences” column, which would code a new seedling as being reproductive.

Test criteria: Strict (‘stop’ if any rows fail).

Pointblank Validation
Check for ‘reproductive’ seedlings

tibbleWARN STOP 1 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_lt

infl < 1

col_vals_lt()

infl

1

3K 3K
1.00
0
0.00

2023-09-20 13:42:58 UTC < 1 s 2023-09-20 13:42:58 UTC

Zombie plants

Zombie plants are those that were recorded as ‘Dead’ in a survey but for which there is a measurement in a subsequent year (indicative of the plant losing all below-ground parts and then new shoots emerging prior to the next survey). This validation generates a .csv of any plants meeting this condition (labeled as ’zombie` for review and correction.

Pointblank Validation
Check for zombies

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_equal

Check for Zombies

col_vals_equal()

zombie

zombie

0 0
NA
0
NA

2023-09-20 13:43:01 UTC < 1 s 2023-09-20 13:43:01 UTC

Plant Mortality: Plant size

Tests for plants 6 or more shoots dying from one year to the next. Note: These are not errors, these are plants whose size the year prior to being recorded as ‘dead’ in a survey was in the top 2% of dying plants.

Test criteria: ‘warn’ if \(\geq\) 1 rows fail conditions, ‘stop’ if \(\geq\) 2% of rows fail conditions.

Pointblank Validation
Check large plants dying

tibbleWARN 1 STOP 0.02 NOTIFY
STEP COLUMNS VALUES TBL EVAL UNITS PASS FAIL W S N EXT
1
col_vals_lt

6 or more shoots

col_vals_lt()

shts

6

2K 2K
0.98
32
0.02

2023-09-20 13:43:01 UTC < 1 s 2023-09-20 13:43:02 UTC