Spark MongoDB Connection - Fetched BSON document does not have all the fields solution

Spark MongoDB connector does not fetch all the fields present in stored BSON document in a collection.

This is because Mongo collection can have documents with different schema. Typically, all documents in a collection are of similar or related purpose. A document is a set of key-value pairs. Documents have dynamic schema. Dynamic schema means that documents in the same collection do not need to have the same set of fields or structure, and common fields in a collection's documents may hold different types of data.

So, when we read MongoDB Collection using Spark connector. It infers schema as per first row it may read, which might not consist of fields which are present in subsequent tuples/ rows.

Suppose We have a MongoDB collection - default.fruits, and it has following documents -

{ "_id" : 1, "type" : "apple"}

{ "_id" : 2, "type" : "orange", "qty" : 10 }

{ "_id" : 3, "type" : "banana" }

Code to connect read Mongo Collection -

Note- We have Spark 2.4, Scala 2.11 and mongo-spark-connector_2.11:2.3.5

Execute below command

spark-shell --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.5

Or, download and then execute-

spark-shell --jars mongo-spark-connector_2.11-2.3.5.jar,mongo-java-driver-3.12.5.jar

Execute below to read Mongo Collection - default.fruits

val df = spark.read.format("mongo").option("uri","mongodb://127.0.0.1/default.fruits").load()

Check Schema -

df.printSchema()

root

|-- _id: double (nullable = true)

|-- type: string (nullable = true)

Note - "qty" fields is missing in above schema. This is because a table/ dataframe is structured and it supposed to have fixed number of columns, or fixed schema. So, Spark infers the schema per first document it reads from collection.

Print Data

df.show(20, false)

Note - We missed "qty" field for "_id"=2

Solution to above problem is to specify Schema externally, rather then allowing Spark to infer it from data.

Pick up one document in JSON string format with minimal schema that is needed and save it in a file. For ex - create a file named "sample.json" and save following -

{ "_id" : 2, "type" : "orange", "qty" : 10 }

Read the JSON file

val sample=spark.read.json("/user/smylocation/sample.json")

Read from MongoDB collection specifying schema, like below -

val df = spark.read.format("mongo").schema(sample.schema).option("uri","mongodb://127.0.0.1/default.fruits").load()

Check Schema -

df.printSchema()

root

|-- _id: double (nullable = true)

|-- type: string (nullable = true)

|-- qty: double (nullable = true)

Print Data

df.show(20, false)

Note that above schema consist of "qty" column.

QueryDB

Search This Blog

Spark MongoDB Connection - Fetched BSON document does not have all the fields solution

Comments

Post a Comment

Popular posts

Hive Parse JSON with Array Columns and Explode it in to Multiple rows.

org.apache.spark.sql.AnalysisException: Cannot overwrite a path that is also being read from.;

Read from a hive table and write back to it using spark sql

Hadoop Distcp Error Duplicate files in input path

Scala Spark building Jar leads java.lang.StackOverflowError