HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge… by tanishq-chugh · Pull Request #6443 · apache/hive

tanishq-chugh · 2026-04-18T13:13:18Z

…tColumnTypes

What changes were proposed in this pull request?

Improve time complexity in ColumnStatsSemanticAnalyzer#getColumnTypes

Why are the changes needed?

Performance Improvement

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual Testing + CI

Aggarwal-Raghav · 2026-04-18T17:43:58Z

+        if (typeInfo.getCategory() != ObjectInspector.Category.PRIMITIVE) {
+          logTypeWarning(colName, type);
+        } else {
+          nonPrimColNames.add(colName);


the variable name should be PrimColNames instead of nonPrimColNames. As the primitve type will enter the else flow.

@Aggarwal-Raghav My bad, i validated the columnTypes/names being returned for primitive types and used the wrong variable name. Updated in commit - 4a6804d .
Thanks for pointing this out !

thomasrebele · 2026-04-21T12:09:27Z

-          } else {
-            colTypes.add(type);
-          }
+    Map<String, String> colTypeMap = new HashMap<>();


Thanks for the PR! When I created HIVE-29551, I had in mind do it without a HashMap if possible. There are two types of usages, depending on where the column names came from:

ColumnStatsSemanticAnalyzer#getColumnName

Utilities.getColumnNamesFromFieldSchema
The latter iterates over a list of FieldSchema, so the type info can be obtained from these items as well.

The HashMap is only needed when the ASTNode has 3 children.

…tColumnTypes

thomasrebele

Thank you for the refactoring! I've got some ideas to simplify the code, aiming to make it easier to maintain the code of ColumnStatsSemanticAnalyzer in the future.

thomasrebele · 2026-04-28T10:14:24Z

    return rwt;
  }

+  private record StatsEligibleColumns(List<String> columnNames, List<String> columnTypes) {


Instead of creating a new type, could you please use List<FieldSchema>, which contains both the name and the type of the column?

Made this change in commit - 84d81f9

thomasrebele · 2026-04-28T10:16:38Z

+    return new StatsEligibleColumns(colNames, colTypes);
  }

  private List<String> getColumnName(ASTNode tree) throws SemanticException {


I would suggest to rename the function, maybe "getExplictColumnNames", though there may be a better name.

Renamed the function to getExplicitColumnNamesFromAst in commit - 84d81f9

thomasrebele · 2026-04-28T10:17:25Z

+    colNames.clear();
+    colNames.addAll(primColNames);


Modifying the argument can be avoided when implementing my other comments.

Yes, the code has been updated such that modifying this argument is avoided, in commit - 84d81f9

thomasrebele · 2026-04-28T10:26:43Z

  }

-  protected static List<String> getColumnTypes(Table tbl, List<String> colNames) {
+  protected static List<String> getColumnTypesByName(Table tbl, List<String> colNames) {


I recommend to refactor getColumnTypesByName to return List<FieldSchema>.

Made this change in commit - 84d81f9

thomasrebele · 2026-04-28T10:44:30Z

+        colNames = statsCols.columnNames();
+      } else {
+        colNames = getColumnName(ast);
+      }


The handling of the AST should stay at once place to avoid code duplication here and in #rewriteAST. Maybe a new method List<FieldSchema> getColumns(ASTNode). To keep the behavior the same, I would do roughly the following:

Collect the column names as string using the original method

Verify the names with checkForPartitionColumns and validateSpecifiedColumnNames (and removing the calls to these functions in ColumnStatsSemanticAnalyzer#rewriteAST and ColumnStatsSemanticAnalyzer#analyze)

Collect the columns as List<FieldSchema>

The caller extracts the names (with org.apache.hadoop.hive.ql.exec.Utilities#getColumnNamesFromFieldSchema) and the types (I don't know of an existing function, at least I couldn't find one in Utilities).

This approach avoids the need to modify the column names later, and should make the code easier to understand. It would be nice (if that optimization does not make the code too complex) to optimize the case ast.getChildCount() == 2, so that step 1 and 3 only collect the columns once.

Thanks for pointing this out @thomasrebele !
And, yes this definitely makes more sense and helps to keep code clean. I have made all these changes in commit - 84d81f9

thomasrebele · 2026-04-28T10:55:41Z

-    default:
+    if (tree.getChildCount() != 3) {
      throw new SemanticException("Internal error. Expected number of children of ASTNode to be"
          + " either 2 or 3. Found : " + tree.getChildCount());


If we modify the method that way, the expected number of children is 3, so the exception message would need to be changed.

Updated the exception message in commit - 84d81f9

sonarqubecloud · 2026-04-29T07:44:03Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.3% Duplication on New Code

See analysis details on SonarQube Cloud

asf-ci-hive added tests pending tests passed and removed tests pending labels Apr 18, 2026

Aggarwal-Raghav reviewed Apr 18, 2026

View reviewed changes

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 19, 2026

thomasrebele reviewed Apr 21, 2026

View reviewed changes

tanishq-chugh added 3 commits April 27, 2026 01:01

HIVE-29551: Avoid quadratic runtime in ColumnStatsSemanticAnalyzer#ge…

63fa5af

…tColumnTypes

Update the wrong column name used

b3bb0a5

Refactor code to incorporate logic for different ast children values

85c0ebe

tanishq-chugh force-pushed the HIVE-29551 branch from 4a6804d to 85c0ebe Compare April 26, 2026 19:34

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 26, 2026

tanishq-chugh added 2 commits April 27, 2026 23:18

Fix sonarqube issue - 1

c8ec783

Fix sonarqube issue - 2

bca3eeb

asf-ci-hive added tests pending tests passed and removed tests passed tests pending labels Apr 27, 2026

thomasrebele reviewed Apr 28, 2026

View reviewed changes

Refactor code to address review comments

84d81f9

asf-ci-hive added tests pending tests unstable and removed tests passed tests pending labels Apr 28, 2026

Fix SonarQube issue - 3

4a1205a

asf-ci-hive added tests pending tests unstable and removed tests unstable tests pending labels Apr 29, 2026

Fix for column names being lowercased

b50be7a

asf-ci-hive added tests pending and removed tests unstable labels Apr 29, 2026

asf-ci-hive added tests passed and removed tests pending labels Apr 29, 2026

Conversation

tanishq-chugh commented Apr 18, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomasrebele left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants