Skip to main content

Posts

Showing posts from September, 2024

Spark MongoDB Connector Not leading to correct count or data while reading

  We are using Scala 2.11 , Spark 2.4 and Spark MongoDB Connector 2.4.4 Use Case 1 - We wanted to read a Shareded Mongo Collection and copy its data to another Mongo Collection. We noticed that after Spark Job successful completion. Output MongoDB did not had many records. Use Case 2 -  We read a MongoDB collection and doing count on dataframe lead to different count on each execution. Analysis,  We realized that MongoDB Spark Connector is missing data on bulk read as a dataframe. We tried various partitioner, listed on page -  https://www.mongodb.com/docs/spark-connector/v2.4/configuration/  But, none of them worked for us. Finally, we tried  MongoShardedPartitioner  this lead to constant count on each execution. But, it was greater than the actual count of records on the collection. This seems to be limitation with MongoDB Spark Connector. But,  MongoShardedPartitioner  seemed closest possible solution to this kind of situation. But, it performed really very slow. Crux -  This seems

Scala Spark building Jar leads java.lang.StackOverflowError

  Exception -  [Thread-3] ERROR scala_maven.ScalaCompileMojo - error: java.lang.StackOverflowError [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.generic.TraversableForwarder$class.isEmpty(TraversableForwarder.scala:36) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.ListBuffer.isEmpty(ListBuffer.scala:45) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.ListBuffer.toList(ListBuffer.scala:306) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.ListBuffer.result(ListBuffer.scala:300) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.Stack$StackBuilder.result(Stack.scala:31) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.mutable.Stack$StackBuilder.result(Stack.scala:27) [Thread-3] INFO scala_maven.ScalaCompileMojo - at scala.collection.generic.GenericCompanion.apply(GenericCompanion.scala:50) [Thread-3] INFO scala_maven.ScalaCompileMojo - a

MongoDB Chunk size many times bigger than configure chunksize (128 MB)

  Shard Shard_0 at Shard_0/xyz.com:27018 { data: '202.04GiB', docs: 117037098, chunks: 5, 'estimated data per chunk': '40.4GiB', 'estimated docs per chunk': 23407419 } --- Shard Shard_1 at Shard_1/abc.com:27018 { data: '201.86GiB', docs: 116913342, chunks: 4, 'estimated data per chunk': '50.46GiB', 'estimated docs per chunk': 29228335 } Per MongoDB-  Starting in 6.0.3, we balance by data size instead of the number of chunks. So the 128MB is now only the size of data we migrate at-a-time. So large data size per chunk is good now, as long as the data size per shard is even for the collection. refer -  https://www.mongodb.com/community/forums/t/chunk-size-many-times-bigger-than-configure-chunksize-128-mb/212616 https://www.mongodb.com/docs/v6.0/release-notes/6.0/#std-label-release-notes-6.0-balancing-policy-changes