Spark MongoDB Connector Not leading to correct count or data while reading

We are using Scala 2.11 , Spark 2.4 and Spark MongoDB Connector 2.4.4

Use Case 1 -

We wanted to read a Shareded Mongo Collection and copy its data to another Mongo Collection. We noticed that after Spark Job successful completion. Output MongoDB did not had many records.

Use Case 2 -

We read a MongoDB collection and doing count on dataframe lead to different count on each execution.

Analysis,

We realized that MongoDB Spark Connector is missing data on bulk read as a dataframe.

We tried various partitioner, listed on page - https://www.mongodb.com/docs/spark-connector/v2.4/configuration/ But, none of them worked for us.

Finally, we tried MongoShardedPartitioner this lead to constant count on each execution. But, it was greater than the actual count of records on the collection. This seems to be limitation with MongoDB Spark Connector. But, MongoShardedPartitioner seemed closest possible solution to this kind of situation. But, it performed really very slow.

Crux -

This seems to be limitation with MongoDB Spark Connector, where-in we can read all data from MongoDB Collection and copy that data to another Collection. If we specified where criteria then we could find those records in source Mongo Collection. But, bulk copy always missed data.

Thankfully, we had copy of data on HDFS. So, we ran job to copy data from HDFS to MongoDB instead of MongoDB to MongoDB.

QueryDB

Search This Blog

Spark MongoDB Connector Not leading to correct count or data while reading

Comments

Post a Comment

Popular posts

Scala Spark building Jar leads java.lang.StackOverflowError

MongoDB Chunk size many times bigger than configure chunksize (128 MB)

AWS EMR Spark – Much Larger Executors are Created than Requested

Hive Count Query not working