HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443
HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…#6443tanishq-chugh wants to merge 8 commits intoapache:masterfrom
Conversation
| if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) { | ||
| logTypeWarning(colName, type); | ||
| } else { | ||
| nonPrimColNames.add(colName); |
There was a problem hiding this comment.
the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.
There was a problem hiding this comment.
@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !
| } else { | ||
| colTypes.add(type); | ||
| } | ||
| Map<String, String> colTypeMap = new HashMap<>(); |
There was a problem hiding this comment.
Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:
- ColumnStatsSemanticAnalyzer#getColumnName
- Utilities.getColumnNamesFromFieldSchema
The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.
The HashMap is only needed when the ASTNode has 3 children.
4a6804d to
85c0ebe
Compare
thomasrebele
left a comment
There was a problem hiding this comment.
Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.
| return rwt; | ||
| } | ||
|
|
||
| private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) { |
There was a problem hiding this comment.
Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?
| return new StatsEligibleColumns(colNames, colTypes); | ||
| } | ||
|
|
||
| private List<String> getColumnName(ASTNode tree) throws SemanticException { |
There was a problem hiding this comment.
I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.
There was a problem hiding this comment.
Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9
| colNames.clear(); | ||
| colNames.addAll(primColNames); |
There was a problem hiding this comment.
Modifying the argument can be avoided when implementing my other comments.
There was a problem hiding this comment.
Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9
| } | ||
|
|
||
| protected static List<String> getColumnTypes(Table tbl, List<String> colNames) { | ||
| protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) { |
There was a problem hiding this comment.
I recommend to refactor getColumnTypesByName to return List<FieldSchema>.
| colNames = statsCols.columnNames(); | ||
| } else { | ||
| colNames = getColumnName(ast); | ||
| } |
There was a problem hiding this comment.
The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:
- Collect the column names as string using the original method
- Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)
- Collect the columns as
List<FieldSchema> - The caller extracts the names (with
org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).
This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.
There was a problem hiding this comment.
Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9
| default: | ||
| if (tree.getChildCount() != 3) { | ||
| throw new SemanticException("Internal error. Expected number of children of ASTNode to be" | ||
| + " either 2 or 3. Found : " + tree.getChildCount()); |
There was a problem hiding this comment.
If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.
There was a problem hiding this comment.
Updated the exception message in commit - 84d81f9
|



…tColumnTypes
What changes were proposed in this pull request?
Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes
Why are the changes needed?
Performance Improvement
Does this PR introduce any user-facing change?
No
How was this patch tested?
Manual Testing + CI