Error 2802 teradata - Исправление ошибок и поиск оптимальных решений проблем

Raymond
Raymond

Raymond

articleArticles 549

imageDiagrams 49

codeCode 3

chat_bubble_outlineThreads 8

commentComments 265

loyaltyKontext Points 6055

account_circle Profile

visibility 1,275

comment 9

event 2021-03-08
access_time 2 years ago
language English

more_vert

In article Save DataFrame to SQL Databases via JDBC in PySpark, I showed examples about saving Spark DataFrame to a relational database like Teradata. Most recently I was asked by several friends that they encountered errors when using JDBC approach to save data to Teradata. Thus I am summarizing one approach to address that issue.

About the error

The detailed exception looks like the following:

py4j.protocol.Py4JJavaError: An error occurred while calling o56.save.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, raymond-pc.mshome.net, executor driver): java.sql.BatchUpdateException: [Teradata JDBC Driver] [TeraJDBC 16.20.00.13] [Error 1338] [SQLState HY000] A failure occurred while executing a PreparedStatement batch request. Details of the failure can be found in the exception chain that is accessible with getNextException.
        at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:149)
        at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeBatchUpdateException(ErrorFactory.java:138)
        at com.teradata.jdbc.jdbc_4.TDPreparedStatement.executeBatchDMLArray(TDPreparedStatement.java:277)
        at com.teradata.jdbc.jdbc_4.TDPreparedStatement.executeBatch(TDPreparedStatement.java:2755)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:691)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1(JdbcUtils.scala:858)
        at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$saveTable$1$adapted(JdbcUtils.scala:856)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2(RDD.scala:994)
        at org.apache.spark.rdd.RDD.$anonfun$foreachPartition$2$adapted(RDD.scala:994)
        at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:127)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.sql.SQLException: [Teradata Database] [TeraJDBC 16.20.00.13] [Error -2802] [SQLState 23000] Duplicate row error in TestDb.spark_jdbc_test.
        at com.teradata.jdbc.jdbc_4.util.ErrorFactory.makeDatabaseSQLException(ErrorFactory.java:302)
        at com.teradata.jdbc.jdbc_4.statemachine.ReceiveEndSubState.action(ReceiveEndSubState.java:95)
        at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.subStateMachine(StatementReceiveState.java:311)
        at com.teradata.jdbc.jdbc_4.statemachine.StatementReceiveState.action(StatementReceiveState.java:200)
        at com.teradata.jdbc.jdbc_4.statemachine.StatementController.runBody(StatementController.java:137)
        at com.teradata.jdbc.jdbc_4.statemachine.PreparedBatchStatementController.run(PreparedBatchStatementController.java:58)
        at com.teradata.jdbc.jdbc_4.TDStatement.executeStatement(TDStatement.java:389)
        at com.teradata.jdbc.jdbc_4.TDPreparedStatement.executeBatchDMLArray(TDPreparedStatement.java:257)
        ... 15 more

Reproduce the error

The following PySpark script can reproduce the error.

from pyspark.sql import SparkSession
import pandas as pd

appName = "PySpark Teradata Example"
master = "local"

# Create Spark session
spark = SparkSession.builder 
    .appName(appName) 
    .master(master) 
    .getOrCreate()

database = "TestDb"
table = "spark_jdbc_test"
user = "dbc"
password = "dbc"
driver = 'com.teradata.jdbc.TeraDriver'

# Construct a pandas DataFrame
users = {'ID': [1, 1, 3], 'Value': ['A', 'A', 'C']}
pdf = pd.DataFrame(users, columns=['ID', 'Value'])

# Create Spark DaraFrame from pandas DataFrame
df = spark.createDataFrame(pdf)

# Save the result to database
df.write 
    .format('jdbc') 
    .option('driver', driver) 
    .option('url', f'jdbc:teradata://192.168.119.128/Database={database}') 
    .option("dbtable", table) 
    .option('user', user) 
    .option('password', password) 
    .mode('overwrite')
    .save()

spark.stop()

The script is very simple. It creates a Pandas data frame in memory and then converts it to Spark DataFrame which is finally saved to Teradata using JDBC driver. If the table already exists, it will be dropped and recreated.

Root cause

In fact, the error message itself already points out the issue:

java.sql.SQLException: [Teradata Database] [TeraJDBC 16.20.00.13] [Error -2802] [SQLState 23000] Duplicate row error in TestDb.spark_jdbc_test.

Thus to address this issue, we just need to deduplicate or specify table primary index.

Resolutions

To resolve this issue, we can remove duplicates from the DataFrame. In Spark, we can achieve this using distinct or dropDuplicate function.

df.distinct().write 
    .format('jdbc') 
    .option('driver', driver) 
    .option('url', f'jdbc:teradata://192.168.119.128/Database={database}') 
    .option("dbtable", table) 
    .option('user', user) 
    .option('password', password) 
    .mode('overwrite')
    .save()

In the database, only two records will be saved.

ID Value

1 3 C

2 1 A

If you do need to keep the duplicates, one approach is to add a unique number for each row to make each row unique.

We can either use the index column in Pandas DataFrame if you can or we can use Spark function like ‘monotonically_increasing_id‘ to assign a unique value.

Note: this function doesn’t return sequential values.

df.withColumn('idx', monotonically_increasing_id()).write 
    .format('jdbc') 
    .option('driver', driver) 
    .option('url', f'jdbc:teradata://192.168.119.128/Database={database}') 
    .option("dbtable", table) 
    .option('user', user) 
    .option('password', password) 
    .option("createTableOptions", "PRIMARY INDEX (idx)")
    .mode('overwrite')
    .save()

Remember to import this function first:

from pyspark.sql.functions import monotonically_increasing_id

infoOption createTableOptions is used to explicitly specify the primary index for the Teradata table.

The saved table has three columns and three rows as expected:

ID Value idx

1 1 A 1

2 1 A 0

3 3 C 2

Источник

Archives of the TeradataForum

Message Posted: Tue, 25 May 2004 @ 15:41:17 GMT


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Subj:		Re: Identifying row which caused 2802 Error (Duplicate Row)

From:		Michael Larkins

Hi Phillips:

Without knowing a single thing about your table, I am guessing that you are doing a join within the UPDATE statement to identify which row to
UPDATE or to obtain the data for the change.

Either case, from the error you are receiving, it sounds like maybe you are doing a product join. By looking at the EXPLAIN output, you can
verify this. If not a product join, it sounds like you have not specified enough columns in your WHERE it make the join unique. I have seen this
error under these circumstances because what it attempts to do in INSERT a copy of each row with the new value. Hence, the duplicate row error.
In the old days, it would have inserted new rows into your table — these the are same ones we called the «good old days»…lol

My recommendation is that you go back to your logical model and verify that you are using enough columns to make each row unique.

Hope this helps,

Michael Larkins
Certified Teradata Master


	<Prev	Next>		<<First	<Prev	Next>	Last>>

Источник

Home
Community

Ask a Question
Write a Blog Post
Login / Sign-up

Search Questions and Answers

Murali Krishna

Jan 23, 2017 at 03:17 PM

6126 Views

Hi Experts,

Could you please let me know how to solve Duplicate row Error In sap bods. Here I am pulling data from sql server and loading to Teradata Target database. while loading data into Teradata I am getting «Duplicate Row Error».

Thanks,Murali

Источник

About the error

Reproduce the error

Root cause

Resolutions

Archives of the TeradataForum

Message Posted: Tue, 25 May 2004 @ 15:41:17 GMT

Читайте также: