You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you concatenate a table with 100 rows and 3 columns with a table with 0 rows and 5 columns using pyarrow.concat_tables, the result will be as follows:
What Happens:
Column Matching by Name:
Columns are matched by name across both tables.
If a column is present in one table but not the other, it is filled with null values in the table that lacks it.
Schema Unification:
The resulting table will have columns from both tables.
Columns unique to the second table (0 rows) will be added to the first table with null values.
Row Count:
The resulting table will have the total number of rows from all concatenated tables, which in this case is 100 rows (since the second table has 0 rows).
Example:
import pyarrow as pa
Table with 100 rows and 3 columns
table1 = pa.table({
'a': list(range(100)),
'b': [x * 2 for x in range(100)],
'c': [x % 2 for x in range(100)]
})
If you concatenate a table with 100 rows and 3 columns with a table with 0 rows and 5 columns using pyarrow.concat_tables, the result will be as follows:
What Happens:
Columns are matched by name across both tables.
If a column is present in one table but not the other, it is filled with null values in the table that lacks it.
The resulting table will have columns from both tables.
Columns unique to the second table (0 rows) will be added to the first table with null values.
The resulting table will have the total number of rows from all concatenated tables, which in this case is 100 rows (since the second table has 0 rows).
Example:
import pyarrow as pa
Table with 100 rows and 3 columns
table1 = pa.table({
'a': list(range(100)),
'b': [x * 2 for x in range(100)],
'c': [x % 2 for x in range(100)]
})
Table with 0 rows and 5 columns
table2 = pa.table({
'a': pa.array([], type=pa.int32()),
'b': pa.array([], type=pa.int32()),
'c': pa.array([], type=pa.int32()),
'd': pa.array([], type=pa.string()),
'e': pa.array([], type=pa.float32())
})
Concatenate tables
result = pa.concat_tables([table1, table2])
print(result.schema)
print(result)
Output:
a: int64
b: int64
c: int64
d: string
e: float32
a: [[0, 1, 2, ..., 99]]
b: [[0, 2, 4, ..., 198]]
c: [[0, 1, 0, ..., 1]]
d: [[null, null, ..., null]]
e: [[null, null, ..., null]]
Key Observations:
The resulting schema is a union of the schemas from both tables.
Columns d and e are added from table2, even though they don't exist in table1.
Columns d and e are filled with null values in table1's rows.
Columns a, b, and c are unchanged and retain their values.
The resulting table has 100 rows, matching the non-empty table's row count.
Why This Happens:
pyarrow.concat_tables aligns the columns by name and fills in null values where columns are missing.
This allows you to concatenate tables with different schemas without losing any columns.
Practical Use Case:
This behavior is useful when:
You have sparse data where not all columns are present in every partition.
You're combining data from multiple sources with slightly different schemas.
Would you like more details or examples?
The text was updated successfully, but these errors were encountered: